# Idefics2 ## Overview The Idefics2 model was proposed in [What matters when building vision-language models?](https://huggingface.co/papers/2405.02246) by Léo Tronchon, Hugo Laurencon, Victor Sanh. The accompanying blog post can be found [here](https://huggingface.co/blog/idefics2). Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon IDEFICS-1, notably on document understanding, OCR, or visual reasoning. Idefics2 is lightweight (8 billion parameters) and treats images in their native aspect ratio and resolution, which allows for varying inference efficiency. The abstract from the paper is the following: *The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.* drawing

Idefics2 architecture. Taken from the original paper. This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts). The original code can be found [here](https://huggingface.co/HuggingFaceM4/idefics2). ## Usage tips - Each sample can contain multiple images, and the number of images can vary between samples. The processor will pad the inputs to the maximum number of images in a batch for input to the model. - The processor has a `do_image_splitting` option. If `True`, each input image will be split into 4 sub-images, and concatenated with the original to form 5 images. This is useful for increasing model performance. Make sure `processor.image_processor.do_image_splitting` is set to `False` if the model was not trained with this option. - `text` passed to the processor should have the `` tokens where the images should be inserted. And `` at the end of each utterance if the text is a chat message. - The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as `text` to the processor. Example of how to use the processor on chat messages: ```python import requests from PIL import Image from transformers import Idefics2ForConditionalGeneration, Idefics2Processor url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg" url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg" image_1 = Image.open(requests.get(url_1, stream=True).raw) image_2 = Image.open(requests.get(url_2, stream=True).raw) images = [image_1, image_2] messages = [{ "role": "user", "content": [ {"type": "text", "text": "What’s the difference between these two images?"}, {"type": "image"}, {"type": "image"}, ], }] processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b") model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b", device_map="auto") # at inference time, one needs to pass `add_generation_prompt=True` in order to make sure the model completes the prompt text = processor.apply_chat_template(messages, add_generation_prompt=True) print(text) # 'User: What’s the difference between these two images?\nAssistant:' inputs = processor(images=images, text=text, return_tensors="pt").to(model.device) generated_text = model.generate(**inputs, max_new_tokens=500) generated_text = processor.batch_decode(generated_text, skip_special_tokens=True)[0] print("Generated text:", generated_text) ``` - During training, it's important to determine which tokens the model should not learn. For Idefics2, this typically comes down to the image and padding tokens. This means that one can create the labels as follows: ```python import requests from PIL import Image from transformers import Idefics2ForConditionalGeneration, Idefics2Processor url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg" url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg" image_1 = Image.open(requests.get(url_1, stream=True).raw) image_2 = Image.open(requests.get(url_2, stream=True).raw) images = [image_1, image_2] messages = [{ "role": "user", "content": [ {"type": "text", "text": "What’s the difference between these two images?"}, {"type": "image"}, {"type": "image"}, ], }, { "role": "assistant", "content": [ {"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."}, ], }] processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b") model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b", device_map="auto") text = processor.apply_chat_template(messages, add_generation_prompt=False) inputs = processor(images=images, text=text, return_tensors="pt").to(model.device) labels = inputs.input_ids.clone() labels[labels == processor.tokenizer.pad_token_id] = -100 labels[labels == model.config.image_token_id] = -100 inputs["labels"] = labels outputs = model(**inputs) loss = outputs.loss loss.backward() ``` Do note that when training Idefics2 on multi-turn conversations between a user and an assistant, one typically also sets all the tokens corresponding to the user messages to -100. ## Model optimizations: Flash Attention The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model. First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature. ```bash pip install -U flash-attn --no-build-isolation ``` Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). Make also sure to load your model in half-precision (e.g. `torch.float16`) To load and run a model using Flash Attention-2, simply change the code snippet above with the following change: ```diff model = Idefics2ForConditionalGeneration.from_pretrained( "HuggingFaceM4/idefics2-8b", + dtype=torch.float16, + attn_implementation="flash_attention_2", device_map="auto", ) ``` ## Shrinking down Idefics2 using quantization As the Idefics2 model has 8 billion parameters, that would require about 16GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization). If the model is quantized to 4 bits (or half a byte per parameter), that requires only about 3.5GB of RAM. Quantizing a model is as simple as passing a `quantization_config` to the model. One can change the code snippet above with the changes below. We'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization) for other quantization methods): ```diff + from transformers import BitsAndBytesConfig + quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=torch.float16 + ) model = Idefics2ForConditionalGeneration.from_pretrained( "HuggingFaceM4/idefics2-8b", + dtype=torch.float16, + quantization_config=quantization_config, device_map="auto", ) ``` ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Idefics2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. - A notebook on how to fine-tune Idefics2 on a custom dataset using the [Trainer](../main_classes/trainer) can be found [here](https://colab.research.google.com/drive/1NtcTgRbSBKN7pYD3Vdx1j9m8pt3fhFDB?usp=sharing). It supports both full fine-tuning as well as (quantized) LoRa. - A script regarding how to fine-tune Idefics2 using the TRL library can be found [here](https://gist.github.com/edbeeching/228652fc6c2b29a1641be5a5778223cb). - Demo notebook regarding fine-tuning Idefics2 for JSON extraction use cases can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Idefics2). 🌎 ## Idefics2Config[[transformers.Idefics2Config]] #### transformers.Idefics2Config[[transformers.Idefics2Config]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/configuration_idefics2.py#L99) This is the configuration class to store the configuration of a Idefics2Model. It is used to instantiate a Idefics2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) Configuration objects inherit from [PreTrainedConfig](/docs/transformers/v5.8.0/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the documentation from [PreTrainedConfig](/docs/transformers/v5.8.0/en/main_classes/configuration#transformers.PreTrainedConfig) for more information. Example: ```python >>> from transformers import Idefics2Model, Idefics2Config >>> # Initializing configuration >>> configuration = Idefics2Config() >>> # Initializing a model from the configuration >>> model = Idefics2Model(configuration) >>> # Accessing the model configuration >>> configuration = model.config ``` **Parameters:** use_cache (`bool`, *optional*, defaults to `True`) : Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True` or when the model is a decoder-only generative model. image_token_id (`int`, *optional*, defaults to `32001`) : The image token index used as a placeholder for input images. tie_word_embeddings (`bool`, *optional*, defaults to `False`) : Whether to tie weight embeddings according to model's `tied_weights_keys` mapping. vision_config (`Union[dict, ~configuration_utils.PreTrainedConfig]`, *optional*) : The config object or dictionary of the vision backbone. perceiver_config (`IdeficsPerceiverConfig` or `dict`, *optional*) : Custom perceiver config or dict text_config (`Union[dict, ~configuration_utils.PreTrainedConfig]`, *optional*) : The config object or dictionary of the text backbone. ## Idefics2VisionConfig[[transformers.Idefics2VisionConfig]] #### transformers.Idefics2VisionConfig[[transformers.Idefics2VisionConfig]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/configuration_idefics2.py#L27) This is the configuration class to store the configuration of a Idefics2Model. It is used to instantiate a Idefics2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) Configuration objects inherit from [PreTrainedConfig](/docs/transformers/v5.8.0/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the documentation from [PreTrainedConfig](/docs/transformers/v5.8.0/en/main_classes/configuration#transformers.PreTrainedConfig) for more information. Example: ```python >>> from transformers.models.idefics2.modeling_idefics2 import Idefics2VisionTransformer >>> from transformers.models.idefics2.configuration_idefics2 import Idefics2VisionConfig >>> # Initializing a Idefics2VisionConfig with google/siglip-base-patch16-224 style configuration >>> configuration = Idefics2VisionConfig() >>> # Initializing a Idefics2VisionTransformer (with random weights) from the google/siglip-base-patch16-224 style configuration >>> model = Idefics2VisionTransformer(configuration) >>> # Accessing the model configuration >>> configuration = model.config ``` **Parameters:** hidden_size (`int`, *optional*, defaults to `768`) : Dimension of the hidden representations. intermediate_size (`int`, *optional*, defaults to `3072`) : Dimension of the MLP representations. num_hidden_layers (`int`, *optional*, defaults to `12`) : Number of hidden layers in the Transformer decoder. num_attention_heads (`int`, *optional*, defaults to `12`) : Number of attention heads for each attention layer in the Transformer decoder. num_channels (`int`, *optional*, defaults to `3`) : The number of input channels. image_size (`Union[int, list[int], tuple[int, int]]`, *optional*, defaults to `224`) : The size (resolution) of each image. patch_size (`Union[int, list[int], tuple[int, int]]`, *optional*, defaults to `32`) : The size (resolution) of each patch. hidden_act (`str`, *optional*, defaults to `gelu_pytorch_tanh`) : The non-linear activation function (function or string) in the decoder. For example, `"gelu"`, `"relu"`, `"silu"`, etc. layer_norm_eps (`float`, *optional*, defaults to `1e-06`) : The epsilon used by the layer normalization layers. attention_dropout (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to `0.02`) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices. ## Idefics2PerceiverConfig[[transformers.Idefics2PerceiverConfig]] #### transformers.Idefics2PerceiverConfig[[transformers.Idefics2PerceiverConfig]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/configuration_idefics2.py#L63) This is the configuration class to store the configuration of a Idefics2Model. It is used to instantiate a Idefics2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) Configuration objects inherit from [PreTrainedConfig](/docs/transformers/v5.8.0/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the documentation from [PreTrainedConfig](/docs/transformers/v5.8.0/en/main_classes/configuration#transformers.PreTrainedConfig) for more information. **Parameters:** hidden_act (`str`, *optional*, defaults to `silu`) : The non-linear activation function (function or string) in the decoder. For example, `"gelu"`, `"relu"`, `"silu"`, etc. hidden_size (`int`, *optional*, defaults to `4096`) : Dimension of the hidden representations. rms_norm_eps (`float`, *optional*, defaults to `1e-06`) : The epsilon used by the rms normalization layers. resampler_n_latents (`int`, *optional*, defaults to 64) : Number of latent embeddings to resample ("compress") the input sequence to (usually < 128). resampler_depth (`int`, *optional*, defaults to 3) : Depth of the Perceiver Resampler (Transformer w/ cross attention). Should be shallow (<= 3). resampler_n_heads (`int`, *optional*, defaults to 16) : Number of heads in each Transformer block (for multi-headed self-attention). resampler_head_dim (`int`, *optional*, defaults to 96) : Dimensionality of each head projection in the Transformer block. num_key_value_heads (`int`, *optional*, defaults to `4`) : This is the number of key_value heads that should be used to implement Grouped Query Attention. If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details, check out [this paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `num_attention_heads`. attention_dropout (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to `0.02`) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices. ## Idefics2Model[[transformers.Idefics2Model]] #### transformers.Idefics2Model[[transformers.Idefics2Model]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/modeling_idefics2.py#L774) Idefics2 model consisting of a SIGLIP vision encoder and Mistral language decoder This model inherits from [PreTrainedModel](/docs/transformers/v5.8.0/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forwardtransformers.Idefics2Model.forwardhttps://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/modeling_idefics2.py#L876[{"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "position_ids", "val": ": torch.LongTensor | None = None"}, {"name": "past_key_values", "val": ": transformers.cache_utils.Cache | None = None"}, {"name": "inputs_embeds", "val": ": torch.FloatTensor | None = None"}, {"name": "pixel_values", "val": ": torch.FloatTensor | None = None"}, {"name": "pixel_attention_mask", "val": ": torch.BoolTensor | None = None"}, {"name": "image_hidden_states", "val": ": torch.FloatTensor | None = None"}, {"name": "use_cache", "val": ": bool | None = None"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs]"}]- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using [AutoTokenizer](/docs/transformers/v5.8.0/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/v5.8.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and [PreTrainedTokenizer.__call__()](/docs/transformers/v5.8.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. [What are input IDs?](../glossary#input-ids) - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) - **position_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. [What are position IDs?](../glossary#position-ids) - **past_key_values** (`~cache_utils.Cache`, *optional*) -- Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`. Only [Cache](/docs/transformers/v5.8.0/en/internal/generation_utils#transformers.Cache) instance is allowed as input, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). If no `past_key_values` are passed, [DynamicCache](/docs/transformers/v5.8.0/en/internal/generation_utils#transformers.DynamicCache) will be initialized by default. The model will output the same cache format that is fed as input. If `past_key_values` are used, the user is expected to input only unprocessed `input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, unprocessed_length)` instead of all `input_ids` of shape `(batch_size, sequence_length)`. - **inputs_embeds** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) -- Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. - **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) -- The tensors corresponding to the input images. Pixel values can be obtained using [Idefics2ImageProcessor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2ImageProcessor). See `Idefics2ImageProcessor.__call__()` for details ([Idefics2Processor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Processor) uses [Idefics2ImageProcessor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2ImageProcessor) for processing images). - **pixel_attention_mask** (`torch.Tensor` of shape `(batch_size, image_size, image_size)`, *optional*) -- Mask to avoid performing attention on padding pixel indices. - **image_hidden_states** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) -- The hidden states of the image encoder after modality projection and perceiver resampling. - **use_cache** (`bool`, *optional*) -- If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`).0`Idefics2BaseModelOutputWithPast` or `tuple(torch.FloatTensor)`A `Idefics2BaseModelOutputWithPast` or a tuple of `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various elements depending on the configuration ([Idefics2Config](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Config)) and inputs. Inputs fed to the model can have an arbitrary number of images. To account for this, pixel_values fed to the model have image padding -> (batch_size, max_num_images, 3, max_heights, max_widths) where max_num_images is the maximum number of images among the batch_size samples in the batch. Padding images are not needed beyond padding the pixel_values at the entrance of the model. For efficiency, we only pass through the vision_model's forward the real images by discarding the padding images i.e. pixel_values of size (image_batch_size, 3, height, width) where image_batch_size would be 7 when num_images_per_sample=[1, 3, 1, 2] and max_num_images would be 3. - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, hidden_size)` is output. - **past_key_values** (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) -- It is a [Cache](/docs/transformers/v5.8.0/en/internal/generation_utils#transformers.Cache) instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding. - **hidden_states** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. - **attentions** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. - **image_hidden_states** (`tuple(torch.FloatTensor)`, *optional*) -- Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images, sequence_length, hidden_size)`. image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver **Parameters:** config ([Idefics2Config](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Config)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.8.0/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. **Returns:** ``Idefics2BaseModelOutputWithPast` or `tuple(torch.FloatTensor)`` A `Idefics2BaseModelOutputWithPast` or a tuple of `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various elements depending on the configuration ([Idefics2Config](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Config)) and inputs. #### get_image_features[[transformers.Idefics2Model.get_image_features]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/modeling_idefics2.py#L823) - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. - **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. **Parameters:** pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) : The tensors corresponding to the input images. pixel_attention_mask (`torch.LongTensor`, *optional*) : The attention mask indicating padded regions in the image. **Returns:** `[BaseModelOutputWithPooling](/docs/transformers/v5.8.0/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`` A [BaseModelOutputWithPooling](/docs/transformers/v5.8.0/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various elements depending on the configuration ([Idefics2Config](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Config)) and inputs. ## Idefics2ForConditionalGeneration[[transformers.Idefics2ForConditionalGeneration]] #### transformers.Idefics2ForConditionalGeneration[[transformers.Idefics2ForConditionalGeneration]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/modeling_idefics2.py#L970) The Idefics2 Model with a language modeling head. It is made up a SigLIP vision encoder, with a language modeling head on top. This model inherits from [PreTrainedModel](/docs/transformers/v5.8.0/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forwardtransformers.Idefics2ForConditionalGeneration.forwardhttps://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/modeling_idefics2.py#L1007[{"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "position_ids", "val": ": torch.LongTensor | None = None"}, {"name": "past_key_values", "val": ": transformers.cache_utils.Cache | None = None"}, {"name": "inputs_embeds", "val": ": torch.FloatTensor | None = None"}, {"name": "pixel_values", "val": ": torch.FloatTensor | None = None"}, {"name": "pixel_attention_mask", "val": ": torch.BoolTensor | None = None"}, {"name": "image_hidden_states", "val": ": torch.FloatTensor | None = None"}, {"name": "labels", "val": ": torch.LongTensor | None = None"}, {"name": "use_cache", "val": ": bool | None = None"}, {"name": "logits_to_keep", "val": ": int | torch.Tensor = 0"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using [AutoTokenizer](/docs/transformers/v5.8.0/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/v5.8.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and [PreTrainedTokenizer.__call__()](/docs/transformers/v5.8.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. [What are input IDs?](../glossary#input-ids) - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) - **position_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. [What are position IDs?](../glossary#position-ids) - **past_key_values** (`~cache_utils.Cache`, *optional*) -- Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`. Only [Cache](/docs/transformers/v5.8.0/en/internal/generation_utils#transformers.Cache) instance is allowed as input, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). If no `past_key_values` are passed, [DynamicCache](/docs/transformers/v5.8.0/en/internal/generation_utils#transformers.DynamicCache) will be initialized by default. The model will output the same cache format that is fed as input. If `past_key_values` are used, the user is expected to input only unprocessed `input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, unprocessed_length)` instead of all `input_ids` of shape `(batch_size, sequence_length)`. - **inputs_embeds** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) -- Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. - **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) -- The tensors corresponding to the input images. Pixel values can be obtained using [Idefics2ImageProcessor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2ImageProcessor). See `Idefics2ImageProcessor.__call__()` for details ([Idefics2Processor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Processor) uses [Idefics2ImageProcessor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2ImageProcessor) for processing images). - **pixel_attention_mask** (`torch.Tensor` of shape `(batch_size, image_size, image_size)`, *optional*) -- Mask to avoid performing attention on padding pixel indices. - **image_hidden_states** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) -- The hidden states of the image encoder after modality projection and perceiver resampling. - **labels** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or `model.image_token_id` (where `model` is your instance of `Idefics2ForConditionalGeneration`). Tokens with indices set to `model.image_token_id` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. - **use_cache** (`bool`, *optional*) -- If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). - **logits_to_keep** (`Union[int, torch.Tensor]`, *optional*, defaults to `0`) -- If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).0`Idefics2CausalLMOutputWithPast` or `tuple(torch.FloatTensor)`A `Idefics2CausalLMOutputWithPast` or a tuple of `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various elements depending on the configuration ([Idefics2Config](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Config)) and inputs. The [Idefics2ForConditionalGeneration](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2ForConditionalGeneration) forward method, overrides the `__call__` special method. Although the recipe for forward pass needs to be defined within this function, one should call the `Module` instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. - **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Language modeling loss (for next-token prediction). - **logits** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). - **past_key_values** (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) -- It is a [Cache](/docs/transformers/v5.8.0/en/internal/generation_utils#transformers.Cache) instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding. - **hidden_states** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. - **attentions** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. - **image_hidden_states** (`tuple(torch.FloatTensor)`, *optional*) -- Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images, sequence_length, hidden_size)`. image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver Example: ```python >>> import torch >>> from PIL import Image >>> from io import BytesIO >>> from transformers import AutoProcessor, AutoModelForImageTextToText >>> from transformers.image_utils import load_image >>> # Note that passing the image urls (instead of the actual pil images) to the processor is also possible >>> image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg") >>> image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg") >>> image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg") >>> processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base") >>> model = AutoModelForImageTextToText.from_pretrained("HuggingFaceM4/idefics2-8b-base", device_map="auto") >>> BAD_WORDS_IDS = processor.tokenizer(["", ""], add_special_tokens=False).input_ids >>> EOS_WORDS_IDS = [processor.tokenizer.eos_token_id] >>> # Create inputs >>> prompts = [ ... "In this image, we can see the city of New York, and more specifically the Statue of Liberty.In this image,", ... "In which city is that bridge located?", ... ] >>> images = [[image1, image2], [image3]] >>> inputs = processor(images=images, text=prompts, padding=True, return_tensors="pt").to("cuda") >>> # Generate >>> generated_ids = model.generate(**inputs, bad_words_ids=BAD_WORDS_IDS, max_new_tokens=20) >>> generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> print(generated_texts) ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of New York, and more specifically the Statue of Liberty.\n\n', 'In which city is that bridge located?\n\nThe bridge is located in the city of Pittsburgh, Pennsylvania.\n\n\nThe bridge is'] ``` **Parameters:** config ([Idefics2ForConditionalGeneration](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2ForConditionalGeneration)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.8.0/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. **Returns:** ``Idefics2CausalLMOutputWithPast` or `tuple(torch.FloatTensor)`` A `Idefics2CausalLMOutputWithPast` or a tuple of `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various elements depending on the configuration ([Idefics2Config](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Config)) and inputs. #### get_image_features[[transformers.Idefics2ForConditionalGeneration.get_image_features]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/modeling_idefics2.py#L990) - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. - **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Example: ```python >>> from PIL import Image >>> from transformers import AutoProcessor, Idefics2ForConditionalGeneration >>> model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b") >>> processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b") >>> messages = [ ... { ... "role": "user", "content": [ ... {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"}, ... {"type": "text", "text": "Where is the cat standing?"}, ... ] ... }, ... ] >>> inputs = processor.apply_chat_template( ... messages, ... tokenize=True, ... return_dict=True, ... return_tensors="pt", ... add_generation_prompt=True ... ) >>> # Generate >>> generate_ids = model.generate(**inputs) >>> processor.batch_decode(generate_ids, skip_special_tokens=True)[0] ``` **Parameters:** pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) : The tensors corresponding to the input images. pixel_attention_mask (`torch.LongTensor`, *optional*) : The attention mask indicating padded regions in the image. **Returns:** `[BaseModelOutputWithPooling](/docs/transformers/v5.8.0/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`` A [BaseModelOutputWithPooling](/docs/transformers/v5.8.0/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various elements depending on the configuration ([Idefics2Config](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Config)) and inputs. ## Idefics2ImageProcessor[[transformers.Idefics2ImageProcessor]] #### transformers.Idefics2ImageProcessor[[transformers.Idefics2ImageProcessor]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/image_processing_idefics2.py#L114) Constructs a Idefics2ImageProcessor image processor. preprocesstransformers.Idefics2ImageProcessor.preprocesshttps://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/image_processing_idefics2.py#L132[{"name": "images", "val": ": typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.models.idefics2.image_processing_idefics2.Idefics2ImageProcessorKwargs]"}]- **images** (`Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]`) -- Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set `do_rescale=False`. - **do_image_splitting** (`bool`, *kwargs*, *optional*, defaults to `self.do_image_splitting`) -- Whether to split the image into a sequence 4 equal sub-images concatenated with the original image. - **return_tensors** (`str` or [TensorType](/docs/transformers/v5.8.0/en/internal/file_utils#transformers.TensorType), *optional*) -- Returns stacked tensors if set to `'pt'`, otherwise returns a list of tensors. - ****kwargs** ([ImagesKwargs](/docs/transformers/v5.8.0/en/main_classes/processors#transformers.ImagesKwargs), *optional*) -- Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.0`~image_processing_base.BatchFeature`- **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.). - **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization. **Parameters:** do_image_splitting (`bool`, *kwargs*, *optional*, defaults to `self.do_image_splitting`) : Whether to split the image into a sequence 4 equal sub-images concatenated with the original image. - ****kwargs** ([ImagesKwargs](/docs/transformers/v5.8.0/en/main_classes/processors#transformers.ImagesKwargs), *optional*) : Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments. **Returns:** ``~image_processing_base.BatchFeature`` - **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.). - **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization. ## Idefics2ImageProcessorPil[[transformers.Idefics2ImageProcessorPil]] #### transformers.Idefics2ImageProcessorPil[[transformers.Idefics2ImageProcessorPil]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/image_processing_pil_idefics2.py#L116) Constructs a Idefics2ImageProcessor image processor. preprocesstransformers.Idefics2ImageProcessorPil.preprocesshttps://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/image_processing_pil_idefics2.py#L134[{"name": "images", "val": ": typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.models.idefics2.image_processing_pil_idefics2.Idefics2ImageProcessorKwargs]"}]- **images** (`Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]`) -- Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set `do_rescale=False`. - **do_image_splitting** (`bool`, *kwargs*, *optional*, defaults to `self.do_image_splitting`) -- Whether to split the image into a sequence 4 equal sub-images concatenated with the original image. - **return_tensors** (`str` or [TensorType](/docs/transformers/v5.8.0/en/internal/file_utils#transformers.TensorType), *optional*) -- Returns stacked tensors if set to `'pt'`, otherwise returns a list of tensors. - ****kwargs** ([ImagesKwargs](/docs/transformers/v5.8.0/en/main_classes/processors#transformers.ImagesKwargs), *optional*) -- Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.0`~image_processing_base.BatchFeature`- **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.). - **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization. **Parameters:** do_image_splitting (`bool`, *kwargs*, *optional*, defaults to `self.do_image_splitting`) : Whether to split the image into a sequence 4 equal sub-images concatenated with the original image. - ****kwargs** ([ImagesKwargs](/docs/transformers/v5.8.0/en/main_classes/processors#transformers.ImagesKwargs), *optional*) : Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments. **Returns:** ``~image_processing_base.BatchFeature`` - **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.). - **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization. ## Idefics2Processor[[transformers.Idefics2Processor]] #### transformers.Idefics2Processor[[transformers.Idefics2Processor]] [Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/processing_idefics2.py#L59) Constructs a Idefics2Processor which wraps a image processor and a tokenizer into a single processor. [Idefics2Processor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2Processor) offers all the functionalities of [Idefics2ImageProcessor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2ImageProcessor) and [LlamaTokenizer](/docs/transformers/v5.8.0/en/model_doc/llama2#transformers.LlamaTokenizer). See the [~Idefics2ImageProcessor](/docs/transformers/v5.8.0/en/model_doc/idefics2#transformers.Idefics2ImageProcessor) and [~LlamaTokenizer](/docs/transformers/v5.8.0/en/model_doc/llama2#transformers.LlamaTokenizer) for more information. __call__transformers.Idefics2Processor.__call__https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/idefics2/processing_idefics2.py#L98[{"name": "images", "val": ": typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], list[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]], list[list[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]]]] = None"}, {"name": "text", "val": ": typing.Union[str, ForwardRef('PreTokenizedInput'), list[str], list['PreTokenizedInput']] = None"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.models.idefics2.processing_idefics2.Idefics2ProcessorKwargs]"}]- **images** (`Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor], list[Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]], list[list[Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]]]]`, *optional*) -- Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set `do_rescale=False`. - **text** (`Union[str, PreTokenizedInput, list[str], list[PreTokenizedInput]]`, *optional*) -- The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If you pass a pretokenized input, set `is_split_into_words=True` to avoid ambiguity with batched inputs. - **return_tensors** (`str` or [TensorType](/docs/transformers/v5.8.0/en/internal/file_utils#transformers.TensorType), *optional*) -- If set, will return tensors of a particular framework. Acceptable values are: - `'pt'`: Return PyTorch `torch.Tensor` objects. - `'np'`: Return NumPy `np.ndarray` objects. - ****kwargs** ([ProcessingKwargs](/docs/transformers/v5.8.0/en/main_classes/processors#transformers.ProcessingKwargs), *optional*) -- Additional processing options for each modality (text, images, videos, audio). Model-specific parameters are listed above; see the TypedDict class for the complete list of supported arguments.0`~feature_extraction_utils.BatchFeature`- **data** (`dict`, *optional*) -- Dictionary of lists/arrays/tensors returned by the __call__/pad methods ('input_values', 'attention_mask', etc.). - **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization. - **skip_tensor_conversion** (`list[str]` or `set[str]`, *optional*) -- List or set of keys that should NOT be converted to tensors, even when `tensor_type` is specified. **Parameters:** image_processor (`Idefics2ImageProcessor`) : The image processor is a required input. tokenizer (`LlamaTokenizer`) : The tokenizer is a required input. image_seq_len (`int`, *optional*, defaults to 64) : The length of the image sequence i.e. the number of tokens per image in the input. This parameter is used to build the string from the input prompt and image tokens and should match the config.perceiver_config.resampler_n_latents value for the model used. chat_template (`str`, *optional*) : A Jinja template to convert lists of messages in a chat into a tokenizable string. **Returns:** ``~feature_extraction_utils.BatchFeature`` - **data** (`dict`, *optional*) -- Dictionary of lists/arrays/tensors returned by the __call__/pad methods ('input_values', 'attention_mask', etc.). - **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization. - **skip_tensor_conversion** (`list[str]` or `set[str]`, *optional*) -- List or set of keys that should NOT be converted to tensors, even when `tensor_type` is specified.