Multimodal Model Interfaces

Relevant source files

This page documents the Python Protocol classes that multimodal models must implement in vLLM, how they signal capabilities, and how they embed multimodal inputs into the language model's representation space. It covers SupportsMultiModal and related protocols (SupportsLoRA, SupportsPP, SupportsMRoPE, SupportsTranscription, SupportsMultiModalPruning, SupportsEagle3, SupportsQuant), with concrete examples from LLaVA, Pixtral, Qwen2-VL, Qwen3-VL, Whisper, and MiniCPM-V.

For the preprocessing pipeline that converts raw images/audio/video into tensor kwargs before they reach the model, see Multimodal Processing Pipeline. For how the model runner invokes encoder passes, see GPU Model Runner.

Protocol Overview

All protocols are defined in vllm/model_executor/models/interfaces.py and are decorated with @runtime_checkable. Membership is checked via isinstance at runtime rather than requiring explicit inheritance.

Protocol	Class-variable sentinel	Key method(s)
`SupportsMultiModal`	`supports_multimodal = True`	`embed_multimodal`, `embed_input_ids`, `get_language_model`
`SupportsMultiModalPruning`	`supports_multimodal_pruning = True`	`recompute_mrope_positions`
`SupportsLoRA`	`supports_lora = True`	(data fields: `embedding_modules`, `packed_modules_mapping`)
`SupportsPP`	`supports_pipeline_parallel = True`	(pipeline stage splitting)
`SupportsMRoPE`	`supports_mrope = True`	`get_mrope_input_positions`
`SupportsTranscription`	`supports_transcription = True`	(encoder-decoder language identification)
`SupportsEagle3`	`supports_eagle3 = True`	(intermediate hidden-state hooks for draft generation)
`SupportsQuant`	`supports_quantization = True`	(quantization method compatibility)
`SupportsScoreTemplate`	`supports_score_template = True`	`get_score_template`, `post_process_tokens`

Helper functions such as supports_multimodal(model), supports_multimodal_pruning(model), etc. are provided for safe isinstance-style checks that return TypeIs narrowed types.

Sources: vllm/model_executor/models/interfaces.py87-467

`SupportsMultiModal` — The Core Interface

SupportsMultiModal is a @runtime_checkable Protocol that all models capable of processing non-text inputs must implement.

Class-level Flags

vllm/model_executor/models/interfaces.py87-116

Flag	Type	Default	Effect
`supports_multimodal`	`ClassVar[Literal[True]]`	`True`	Marks model as multimodal (set by base class)
`supports_multimodal_raw_input_only`	`ClassVar[bool]`	`False`	Model consumes raw media, never pre-encoded embeddings
`supports_encoder_tp_data`	`ClassVar[bool]`	`False`	Model supports `mm_encoder_tp_mode="data"`
`requires_raw_input_tokens`	`ClassVar[bool]`	`False`	Forward pass needs raw token IDs, not embeddings

supports_encoder_tp_data = True is set explicitly in Qwen2_5_VLForConditionalGeneration and Qwen3-VL to enable data-parallel ViT execution over tensor-parallel groups.

Sources: vllm/model_executor/models/interfaces.py91-116 vllm/model_executor/models/qwen2_5_vl.py1016

Component Marking — Language Model vs. Tower

During __init__, models use context managers provided by SupportsMultiModal to annotate which submodules are the language model and which are the media tower (vision/audio encoder). This information is used for:

--mm-encoder-only mode: language model layers are replaced with StageMissingLayer stubs.
--limit-mm-per-prompt 0 mode: the tower itself is stubbed out when the modality count is zero.

Component Marking Diagram

The three context managers available are:

Context Manager	Purpose
`_mark_language_model(vllm_config, targets=None)`	Mark submodules assigned in this block as language model components
`_mark_tower_model(vllm_config, modalities, targets=None)`	Mark submodules as vision/audio tower components for given modalities
`_mark_composite_model(vllm_config, language_targets, tower_targets)`	Convenience wrapper combining both of the above

Sources: vllm/model_executor/models/interfaces.py189-297

`MultiModalEmbeddings` Type

vllm/model_executor/models/interfaces.py57-64

Returned by embed_multimodal. Accepted formats:

List or tuple of 2D tensors: each element corresponds to one multimodal data item (one image, one audio segment, etc.).
Single 3D tensor: batch dimension groups the 2D feature tensors.

The ordering must match the order in which the corresponding media items appear in the prompt.

Embedding Flow

The model runner calls into the model in two stages per forward pass:

embed_multimodal(**kwargs) — encodes raw pixel values / audio features through the tower and returns MultiModalEmbeddings.
embed_input_ids(input_ids, multimodal_embeddings, is_multimodal=...) — applies the token embedding table to text tokens, then scatters multimodal embeddings into the positions indicated by the is_multimodal boolean mask.

Embedding Merge Sequence Diagram

The default implementation of embed_input_ids in SupportsMultiModal handles the scatter via _merge_multimodal_embeddings from vllm/model_executor/models/utils.py. Models with out-of-vocabulary multimodal token IDs can set handle_oov_mm_token=True to skip calling the LM embedding table for those positions.

Sources: vllm/model_executor/models/interfaces.py140-387

Additional Capability Protocols

`SupportsMultiModalPruning`

vllm/model_executor/models/interfaces.py389-426

For models that support dynamic pruning of multimodal embeddings at prefill time (e.g., Efficient Video Sampling / EVS in Qwen2.5-VL and Qwen3-VL). Requires implementing:

recompute_mrope_positions(
    input_ids, multimodal_embeddings, mrope_positions, num_computed_tokens
) -> (MultiModalEmbeddings, mrope_positions, mrope_position_delta)

When tokens are pruned, the MRoPE position IDs for the remaining sequence must be recalculated. This method updates the positions starting at num_computed_tokens to reflect pruning.

Implementing classes: Qwen2_5_VLForConditionalGeneration, Qwen3VLForConditionalGeneration.

`SupportsLoRA`

vllm/model_executor/models/interfaces.py514-535

Fields that a model must populate:

Field	Type	Purpose
`embedding_modules`	`ClassVar[dict[str, str]]`	Maps embedding weight names to their shard names
`packed_modules_mapping`	`dict[str, list[str]]`	Maps fused weight names to their component shards
`lora_skip_prefixes`	`ClassVar[list[str]]`	Module name prefixes to skip during LoRA application
`is_3d_moe_weight`	`ClassVar[bool]`	MoE weight layout flag
`is_non_gated_moe`	`ClassVar[bool]`	Whether MoE uses non-gated routing

Example from Qwen2_5_VLForConditionalGeneration:

Sources: vllm/model_executor/models/qwen2_5_vl.py998-1002

`SupportsMRoPE`

Used by models that apply Multimodal Rotary Position Embedding — a 3D extension of RoPE where each token has separate position IDs along temporal, height, and width axes. The key method is get_mrope_input_positions, which returns a (3, N) tensor of position IDs and a mrope_position_delta integer.

Implementing classes: Qwen2VLForConditionalGeneration, Qwen2_5_VLForConditionalGeneration, Qwen3VLForConditionalGeneration, Ernie4_5_VLForConditionalGeneration, Glm4vForConditionalGeneration.

The computation in Qwen2_5_VLForConditionalGeneration.get_mrope_input_positions iterates over multimodal features sorted by their prompt offset, building per-modality grid position IDs using iter_mm_grid_thw.

Sources: vllm/model_executor/models/qwen2_5_vl.py1056-1100

`SupportsTranscription`

For encoder-decoder speech-to-text models (Whisper, Voxtral). These models take audio (mel-spectrogram features) as encoder input and produce transcribed text token-by-token in the decoder.

Implementing classes: WhisperForConditionalGeneration, VoxtralForConditionalGeneration, Qwen3OmniMoeForConditionalGeneration.

The interface works in conjunction with SpeechToTextConfig (from vllm.config) and ExplicitEncoderDecoderPrompt to support language-forced decoding and prompt-prefix injection for language identification.

Sources: vllm/model_executor/models/whisper.py67-71 vllm/model_executor/models/voxtral.py66

`SupportsEagle3`

Signals that a model exposes intermediate hidden states at configurable layer indices for use by EAGLE3 speculative decoding draft heads. Implementing classes include LlavaForConditionalGeneration and Qwen2_5_VLForConditionalGeneration.

Sources: vllm/model_executor/models/llava.py57-62 vllm/model_executor/models/qwen2_5_vl.py88-97

Input Schema Classes (`TensorSchema`)

Each model defines typed input schema classes that inherit from TensorSchema (in vllm/utils/tensor_schema.py). These use Annotated + TensorShape to declare expected tensor dimensions, and are used as discriminated unions (via a type field) to select the correct forward path.

Example schema classes:

Model	Pixel input class	Embedding input class
LLaVA	`LlavaImagePixelInputs`	`LlavaImageEmbeddingInputs`
Pixtral	`PixtralImagePixelInputs`	—
Qwen2-VL	`Qwen2VLImagePixelInputs`, `Qwen2VLVideoPixelInputs`	`Qwen2VLImageEmbeddingInputs`, `Qwen2VLVideoEmbeddingInputs`
Qwen2.5-VL	`Qwen2_5_VLImagePixelInputs`, `Qwen2_5_VLVideoPixelInputs`	`Qwen2_5_VLImageEmbeddingInputs`, `Qwen2_5_VLVideoEmbeddingInputs`
MiniCPM-V	`MiniCPMVImagePixelInputs`	`MiniCPMVImageEmbeddingInputs`
Whisper	`WhisperAudioInputs`	—
Gemma3	`Gemma3ImagePixelInputs`	—

Models support embedding inputs (pre-encoded features) when users bypass the vision encoder and supply vectors directly, which enables encoder caching or external preprocessing.

Sources: vllm/model_executor/models/llava.py76-126 vllm/model_executor/models/qwen2_vl.py121-239 vllm/model_executor/models/qwen2_5_vl.py121-258 vllm/model_executor/models/whisper.py89-101

Protocol Implementation by Model

Protocol Implementation Matrix

Sources: vllm/model_executor/models/llava.py56-62 vllm/model_executor/models/pixtral.py71-76 vllm/model_executor/models/qwen2_5_vl.py988-997 vllm/model_executor/models/qwen3_vl.py99-108 vllm/model_executor/models/whisper.py67-71 vllm/model_executor/models/minicpmv.py91-96 vllm/model_executor/models/gemma3_mm.py41-45

Concrete Model Examples

LLaVA (`llava.py`)

Architecture: CLIP/SigLIP/Pixtral vision encoder → LlavaMultiModalProjector (two-layer MLP) → language model.

embed_multimodal passes pixel values through the vision encoder then the projector.
The projector (LlavaMultiModalProjector) uses two ColumnParallelLinear/RowParallelLinear layers with an activation between them.
Supports both pixel-value inputs (LlavaImagePixelInputs) and pre-encoded embeddings (LlavaImageEmbeddingInputs).

Sources: vllm/model_executor/models/llava.py129-161

Pixtral (`pixtral.py`)

Architecture: native Mistral vision transformer (PixtralVisionModel) with dynamic resolution (no fixed image size) → language model.

Uses PixtralProcessorAdapter to wrap mistral_common's ImageEncoder in a HF-compatible interface.
Applies 2D RoPE (PixtralRotaryEmbedding) in the vision attention layers.
Image tokens are encoded by the Mistral tokenizer before the model sees them.

Sources: vllm/model_executor/models/pixtral.py105-122 vllm/model_executor/models/pixtral.py124-208

Qwen2-VL / Qwen2.5-VL / Qwen3-VL

These three model families share a vision architecture pattern:

pixel patches (flat) → PatchEmbed (Conv3d) → Transformer blocks → PatchMerger → LLM embeddings

Vision Encoder Stages (Qwen family)

Key differences across the three:

Model	Patch embed	Position encoding	Special features
`Qwen2VisionTransformer`	`Conv3dLayer` (t, h, w)	2D RoPE (h, w axes)	—
`Qwen2_5_VisionTransformer`	`Conv3dLayer` (t, h, w)	2D RoPE + window attention	Window-partitioned attention blocks, windowed vs. full-attn by layer index
`Qwen3_VisionTransformer`	`Conv3dLayer` (t, h, w)	2D RoPE + learned pos embed	`fast_pos_embed_interpolate` for variable resolution; deepstack feature extraction

All three models implement SupportsMRoPE and compute 3D MRoPE position IDs for the combined token sequence (text + image patches + video frames), tracked via get_mrope_input_positions.

Sources: vllm/model_executor/models/qwen2_vl.py525-694 vllm/model_executor/models/qwen2_5_vl.py562-875 vllm/model_executor/models/qwen3_vl.py316-618

Whisper (`whisper.py`)

Architecture: Convolutional audio encoder → cross-attention decoder.

Implements SupportsTranscription in addition to SupportsMultiModal.
WhisperAudioInputs wraps mel-spectrogram features as input_features.
The encoder (WhisperEncoder) uses 1D convolutions to downsample the spectrogram, then self-attention transformer blocks with WhisperEncoderAttention (an MMEncoderAttention subclass supporting 2D tensors).
The decoder uses cross-attention (CrossAttention) to attend to encoder outputs.
ExplicitEncoderDecoderPrompt is used to pass the audio as encoder input while the decoder generates transcription text.

Sources: vllm/model_executor/models/whisper.py89-101 vllm/model_executor/models/whisper.py103-128

MiniCPM-V (`minicpmv.py`)

Architecture: Idefics2VisionTransformer → Resampler2 (perceiver-style cross-attention) → language model.

Resampler2 compresses variable-length patch sequences into a fixed number of latent queries using cross-attention and 2D sinusoidal position embeddings.
Supports multiple image slices per input (a "pan-and-scan" style high-resolution strategy).
MiniCPMVImagePixelInputs batches images across slices using a combined bns (batch × images × slices) dimension.

Sources: vllm/model_executor/models/minicpmv.py45-48 vllm/model_executor/models/minicpmv.py91-97

Processor Registration Pattern

All multimodal models are registered with the MULTIMODAL_REGISTRY via the @MULTIMODAL_REGISTRY.register_processor decorator. This decorator is applied to the model class itself and links three objects to it:

Argument	Class	Role
First positional	`BaseMultiModalProcessor` subclass	Tokenizes and processes raw inputs; produces `MultiModalKwargsItems`
`info=`	`BaseProcessingInfo` subclass	Config queries (supported modalities, max tokens, image sizes)
`dummy_inputs=`	`BaseDummyInputsBuilder` subclass	Generates dummy inputs for CUDA graph capture and profiling

The decorator stores the factory on the class as _processor_factory (a ClassVar), which the registry uses later to instantiate a processor per model configuration.

Sources: vllm/model_executor/models/qwen2_5_vl.py983-997

Multimodal Model Interfaces

Relevant source files

Protocol Overview

Protocol	Class-variable sentinel	Key method(s)
`SupportsMultiModal`	`supports_multimodal = True`	`embed_multimodal`, `embed_input_ids`, `get_language_model`
`SupportsMultiModalPruning`	`supports_multimodal_pruning = True`	`recompute_mrope_positions`
`SupportsLoRA`	`supports_lora = True`	(data fields: `embedding_modules`, `packed_modules_mapping`)
`SupportsPP`	`supports_pipeline_parallel = True`	(pipeline stage splitting)
`SupportsMRoPE`	`supports_mrope = True`	`get_mrope_input_positions`
`SupportsTranscription`	`supports_transcription = True`	(encoder-decoder language identification)
`SupportsEagle3`	`supports_eagle3 = True`	(intermediate hidden-state hooks for draft generation)
`SupportsQuant`	`supports_quantization = True`	(quantization method compatibility)
`SupportsScoreTemplate`	`supports_score_template = True`	`get_score_template`, `post_process_tokens`

Helper functions such as supports_multimodal(model), supports_multimodal_pruning(model), etc. are provided for safe isinstance-style checks that return TypeIs narrowed types.

Sources: vllm/model_executor/models/interfaces.py87-467

`SupportsMultiModal` — The Core Interface

SupportsMultiModal is a @runtime_checkable Protocol that all models capable of processing non-text inputs must implement.

Class-level Flags

vllm/model_executor/models/interfaces.py87-116

Flag	Type	Default	Effect
`supports_multimodal`	`ClassVar[Literal[True]]`	`True`	Marks model as multimodal (set by base class)
`supports_multimodal_raw_input_only`	`ClassVar[bool]`	`False`	Model consumes raw media, never pre-encoded embeddings
`supports_encoder_tp_data`	`ClassVar[bool]`	`False`	Model supports `mm_encoder_tp_mode="data"`
`requires_raw_input_tokens`	`ClassVar[bool]`	`False`	Forward pass needs raw token IDs, not embeddings

supports_encoder_tp_data = True is set explicitly in Qwen2_5_VLForConditionalGeneration and Qwen3-VL to enable data-parallel ViT execution over tensor-parallel groups.

Sources: vllm/model_executor/models/interfaces.py91-116 vllm/model_executor/models/qwen2_5_vl.py1016

Component Marking — Language Model vs. Tower

--mm-encoder-only mode: language model layers are replaced with StageMissingLayer stubs.
--limit-mm-per-prompt 0 mode: the tower itself is stubbed out when the modality count is zero.

Component Marking Diagram

The three context managers available are:

Context Manager	Purpose
`_mark_language_model(vllm_config, targets=None)`	Mark submodules assigned in this block as language model components
`_mark_tower_model(vllm_config, modalities, targets=None)`	Mark submodules as vision/audio tower components for given modalities
`_mark_composite_model(vllm_config, language_targets, tower_targets)`	Convenience wrapper combining both of the above

Sources: vllm/model_executor/models/interfaces.py189-297

`MultiModalEmbeddings` Type

vllm/model_executor/models/interfaces.py57-64

Returned by embed_multimodal. Accepted formats:

List or tuple of 2D tensors: each element corresponds to one multimodal data item (one image, one audio segment, etc.).
Single 3D tensor: batch dimension groups the 2D feature tensors.

The ordering must match the order in which the corresponding media items appear in the prompt.

Embedding Flow

The model runner calls into the model in two stages per forward pass:

embed_multimodal(**kwargs) — encodes raw pixel values / audio features through the tower and returns MultiModalEmbeddings.
embed_input_ids(input_ids, multimodal_embeddings, is_multimodal=...) — applies the token embedding table to text tokens, then scatters multimodal embeddings into the positions indicated by the is_multimodal boolean mask.

Embedding Merge Sequence Diagram

Sources: vllm/model_executor/models/interfaces.py140-387

Additional Capability Protocols

`SupportsMultiModalPruning`

vllm/model_executor/models/interfaces.py389-426

For models that support dynamic pruning of multimodal embeddings at prefill time (e.g., Efficient Video Sampling / EVS in Qwen2.5-VL and Qwen3-VL). Requires implementing:

recompute_mrope_positions(
    input_ids, multimodal_embeddings, mrope_positions, num_computed_tokens
) -> (MultiModalEmbeddings, mrope_positions, mrope_position_delta)

When tokens are pruned, the MRoPE position IDs for the remaining sequence must be recalculated. This method updates the positions starting at num_computed_tokens to reflect pruning.

Implementing classes: Qwen2_5_VLForConditionalGeneration, Qwen3VLForConditionalGeneration.

`SupportsLoRA`

vllm/model_executor/models/interfaces.py514-535

Fields that a model must populate:

Field	Type	Purpose
`embedding_modules`	`ClassVar[dict[str, str]]`	Maps embedding weight names to their shard names
`packed_modules_mapping`	`dict[str, list[str]]`	Maps fused weight names to their component shards
`lora_skip_prefixes`	`ClassVar[list[str]]`	Module name prefixes to skip during LoRA application
`is_3d_moe_weight`	`ClassVar[bool]`	MoE weight layout flag
`is_non_gated_moe`	`ClassVar[bool]`	Whether MoE uses non-gated routing

Example from Qwen2_5_VLForConditionalGeneration:

Sources: vllm/model_executor/models/qwen2_5_vl.py998-1002

`SupportsMRoPE`

Sources: vllm/model_executor/models/qwen2_5_vl.py1056-1100

`SupportsTranscription`

For encoder-decoder speech-to-text models (Whisper, Voxtral). These models take audio (mel-spectrogram features) as encoder input and produce transcribed text token-by-token in the decoder.

Implementing classes: WhisperForConditionalGeneration, VoxtralForConditionalGeneration, Qwen3OmniMoeForConditionalGeneration.

Sources: vllm/model_executor/models/whisper.py67-71 vllm/model_executor/models/voxtral.py66

`SupportsEagle3`

Sources: vllm/model_executor/models/llava.py57-62 vllm/model_executor/models/qwen2_5_vl.py88-97

Input Schema Classes (`TensorSchema`)

Example schema classes:

Model	Pixel input class	Embedding input class
LLaVA	`LlavaImagePixelInputs`	`LlavaImageEmbeddingInputs`
Pixtral	`PixtralImagePixelInputs`	—
Qwen2-VL	`Qwen2VLImagePixelInputs`, `Qwen2VLVideoPixelInputs`	`Qwen2VLImageEmbeddingInputs`, `Qwen2VLVideoEmbeddingInputs`
Qwen2.5-VL	`Qwen2_5_VLImagePixelInputs`, `Qwen2_5_VLVideoPixelInputs`	`Qwen2_5_VLImageEmbeddingInputs`, `Qwen2_5_VLVideoEmbeddingInputs`
MiniCPM-V	`MiniCPMVImagePixelInputs`	`MiniCPMVImageEmbeddingInputs`
Whisper	`WhisperAudioInputs`	—
Gemma3	`Gemma3ImagePixelInputs`	—

Models support embedding inputs (pre-encoded features) when users bypass the vision encoder and supply vectors directly, which enables encoder caching or external preprocessing.

Sources: vllm/model_executor/models/llava.py76-126 vllm/model_executor/models/qwen2_vl.py121-239 vllm/model_executor/models/qwen2_5_vl.py121-258 vllm/model_executor/models/whisper.py89-101

Protocol Implementation by Model

Protocol Implementation Matrix

Concrete Model Examples

LLaVA (`llava.py`)

Architecture: CLIP/SigLIP/Pixtral vision encoder → LlavaMultiModalProjector (two-layer MLP) → language model.

embed_multimodal passes pixel values through the vision encoder then the projector.
The projector (LlavaMultiModalProjector) uses two ColumnParallelLinear/RowParallelLinear layers with an activation between them.
Supports both pixel-value inputs (LlavaImagePixelInputs) and pre-encoded embeddings (LlavaImageEmbeddingInputs).

Sources: vllm/model_executor/models/llava.py129-161

Pixtral (`pixtral.py`)

Architecture: native Mistral vision transformer (PixtralVisionModel) with dynamic resolution (no fixed image size) → language model.

Uses PixtralProcessorAdapter to wrap mistral_common's ImageEncoder in a HF-compatible interface.
Applies 2D RoPE (PixtralRotaryEmbedding) in the vision attention layers.
Image tokens are encoded by the Mistral tokenizer before the model sees them.

Sources: vllm/model_executor/models/pixtral.py105-122 vllm/model_executor/models/pixtral.py124-208

Qwen2-VL / Qwen2.5-VL / Qwen3-VL

These three model families share a vision architecture pattern:

pixel patches (flat) → PatchEmbed (Conv3d) → Transformer blocks → PatchMerger → LLM embeddings

Vision Encoder Stages (Qwen family)

Key differences across the three:

Model	Patch embed	Position encoding	Special features
`Qwen2VisionTransformer`	`Conv3dLayer` (t, h, w)	2D RoPE (h, w axes)	—
`Qwen2_5_VisionTransformer`	`Conv3dLayer` (t, h, w)	2D RoPE + window attention	Window-partitioned attention blocks, windowed vs. full-attn by layer index
`Qwen3_VisionTransformer`	`Conv3dLayer` (t, h, w)	2D RoPE + learned pos embed	`fast_pos_embed_interpolate` for variable resolution; deepstack feature extraction

All three models implement SupportsMRoPE and compute 3D MRoPE position IDs for the combined token sequence (text + image patches + video frames), tracked via get_mrope_input_positions.

Sources: vllm/model_executor/models/qwen2_vl.py525-694 vllm/model_executor/models/qwen2_5_vl.py562-875 vllm/model_executor/models/qwen3_vl.py316-618

Whisper (`whisper.py`)

Architecture: Convolutional audio encoder → cross-attention decoder.

Implements SupportsTranscription in addition to SupportsMultiModal.
WhisperAudioInputs wraps mel-spectrogram features as input_features.
The encoder (WhisperEncoder) uses 1D convolutions to downsample the spectrogram, then self-attention transformer blocks with WhisperEncoderAttention (an MMEncoderAttention subclass supporting 2D tensors).
The decoder uses cross-attention (CrossAttention) to attend to encoder outputs.
ExplicitEncoderDecoderPrompt is used to pass the audio as encoder input while the decoder generates transcription text.

Sources: vllm/model_executor/models/whisper.py89-101 vllm/model_executor/models/whisper.py103-128

MiniCPM-V (`minicpmv.py`)

Architecture: Idefics2VisionTransformer → Resampler2 (perceiver-style cross-attention) → language model.

Resampler2 compresses variable-length patch sequences into a fixed number of latent queries using cross-attention and 2D sinusoidal position embeddings.
Supports multiple image slices per input (a "pan-and-scan" style high-resolution strategy).
MiniCPMVImagePixelInputs batches images across slices using a combined bns (batch × images × slices) dimension.

Sources: vllm/model_executor/models/minicpmv.py45-48 vllm/model_executor/models/minicpmv.py91-97

Processor Registration Pattern

Argument	Class	Role
First positional	`BaseMultiModalProcessor` subclass	Tokenizes and processes raw inputs; produces `MultiModalKwargsItems`
`info=`	`BaseProcessingInfo` subclass	Config queries (supported modalities, max tokens, image sizes)
`dummy_inputs=`	`BaseDummyInputsBuilder` subclass	Generates dummy inputs for CUDA graph capture and profiling

The decorator stores the factory on the class as _processor_factory (a ClassVar), which the registry uses later to instantiate a processor per model configuration.

Sources: vllm/model_executor/models/qwen2_5_vl.py983-997

Multimodal Model Interfaces

Protocol Overview

SupportsMultiModal — The Core Interface

Class-level Flags

Component Marking — Language Model vs. Tower

MultiModalEmbeddings Type

Embedding Flow

Additional Capability Protocols

SupportsMultiModalPruning

SupportsLoRA

SupportsMRoPE

SupportsTranscription

SupportsEagle3

Input Schema Classes (TensorSchema)

Protocol Implementation by Model

Concrete Model Examples

LLaVA (llava.py)

Pixtral (pixtral.py)

Qwen2-VL / Qwen2.5-VL / Qwen3-VL

Whisper (whisper.py)

MiniCPM-V (minicpmv.py)

Processor Registration Pattern

On this page

Multimodal Model Interfaces

Protocol Overview

SupportsMultiModal — The Core Interface

Class-level Flags

Component Marking — Language Model vs. Tower

MultiModalEmbeddings Type

Embedding Flow

Additional Capability Protocols

SupportsMultiModalPruning

SupportsLoRA

SupportsMRoPE

SupportsTranscription

SupportsEagle3

Input Schema Classes (TensorSchema)

Protocol Implementation by Model

Concrete Model Examples

LLaVA (llava.py)

Pixtral (pixtral.py)

Qwen2-VL / Qwen2.5-VL / Qwen3-VL

Whisper (whisper.py)

MiniCPM-V (minicpmv.py)

Processor Registration Pattern

On this page

`SupportsMultiModal` — The Core Interface

`MultiModalEmbeddings` Type

`SupportsMultiModalPruning`

`SupportsLoRA`

`SupportsMRoPE`

`SupportsTranscription`

`SupportsEagle3`

Input Schema Classes (`TensorSchema`)

LLaVA (`llava.py`)

Pixtral (`pixtral.py`)

Whisper (`whisper.py`)

MiniCPM-V (`minicpmv.py`)

`SupportsMultiModal` — The Core Interface

`MultiModalEmbeddings` Type

`SupportsMultiModalPruning`

`SupportsLoRA`

`SupportsMRoPE`

`SupportsTranscription`

`SupportsEagle3`

Input Schema Classes (`TensorSchema`)

LLaVA (`llava.py`)

Pixtral (`pixtral.py`)

Whisper (`whisper.py`)

MiniCPM-V (`minicpmv.py`)