This page documents the Python Protocol classes that multimodal models must implement in vLLM, how they signal capabilities, and how they embed multimodal inputs into the language model's representation space. It covers SupportsMultiModal and related protocols (SupportsLoRA, SupportsPP, SupportsMRoPE, SupportsTranscription, SupportsMultiModalPruning, SupportsEagle3, SupportsQuant), with concrete examples from LLaVA, Pixtral, Qwen2-VL, Qwen3-VL, Whisper, and MiniCPM-V.
For the preprocessing pipeline that converts raw images/audio/video into tensor kwargs before they reach the model, see Multimodal Processing Pipeline. For how the model runner invokes encoder passes, see GPU Model Runner.
All protocols are defined in vllm/model_executor/models/interfaces.py and are decorated with @runtime_checkable. Membership is checked via isinstance at runtime rather than requiring explicit inheritance.
| Protocol | Class-variable sentinel | Key method(s) |
|---|---|---|
SupportsMultiModal | supports_multimodal = True | embed_multimodal, embed_input_ids, get_language_model |
SupportsMultiModalPruning | supports_multimodal_pruning = True | recompute_mrope_positions |
SupportsLoRA | supports_lora = True | (data fields: embedding_modules, packed_modules_mapping) |
SupportsPP | supports_pipeline_parallel = True | (pipeline stage splitting) |
SupportsMRoPE | supports_mrope = True | get_mrope_input_positions |
SupportsTranscription | supports_transcription = True | (encoder-decoder language identification) |
SupportsEagle3 | supports_eagle3 = True | (intermediate hidden-state hooks for draft generation) |
SupportsQuant | supports_quantization = True | (quantization method compatibility) |
SupportsScoreTemplate | supports_score_template = True | get_score_template, post_process_tokens |
Helper functions such as supports_multimodal(model), supports_multimodal_pruning(model), etc. are provided for safe isinstance-style checks that return TypeIs narrowed types.
Sources: vllm/model_executor/models/interfaces.py87-467
SupportsMultiModal — The Core InterfaceSupportsMultiModal is a @runtime_checkable Protocol that all models capable of processing non-text inputs must implement.
vllm/model_executor/models/interfaces.py87-116
| Flag | Type | Default | Effect |
|---|---|---|---|
supports_multimodal | ClassVar[Literal[True]] | True | Marks model as multimodal (set by base class) |
supports_multimodal_raw_input_only | ClassVar[bool] | False | Model consumes raw media, never pre-encoded embeddings |
supports_encoder_tp_data | ClassVar[bool] | False | Model supports mm_encoder_tp_mode="data" |
requires_raw_input_tokens | ClassVar[bool] | False | Forward pass needs raw token IDs, not embeddings |
supports_encoder_tp_data = True is set explicitly in Qwen2_5_VLForConditionalGeneration and Qwen3-VL to enable data-parallel ViT execution over tensor-parallel groups.
Sources: vllm/model_executor/models/interfaces.py91-116 vllm/model_executor/models/qwen2_5_vl.py1016
During __init__, models use context managers provided by SupportsMultiModal to annotate which submodules are the language model and which are the media tower (vision/audio encoder). This information is used for:
--mm-encoder-only mode: language model layers are replaced with StageMissingLayer stubs.--limit-mm-per-prompt 0 mode: the tower itself is stubbed out when the modality count is zero.Component Marking Diagram
The three context managers available are:
| Context Manager | Purpose |
|---|---|
_mark_language_model(vllm_config, targets=None) | Mark submodules assigned in this block as language model components |
_mark_tower_model(vllm_config, modalities, targets=None) | Mark submodules as vision/audio tower components for given modalities |
_mark_composite_model(vllm_config, language_targets, tower_targets) | Convenience wrapper combining both of the above |
Sources: vllm/model_executor/models/interfaces.py189-297
MultiModalEmbeddings Typevllm/model_executor/models/interfaces.py57-64
Returned by embed_multimodal. Accepted formats:
The ordering must match the order in which the corresponding media items appear in the prompt.
The model runner calls into the model in two stages per forward pass:
embed_multimodal(**kwargs) — encodes raw pixel values / audio features through the tower and returns MultiModalEmbeddings.embed_input_ids(input_ids, multimodal_embeddings, is_multimodal=...) — applies the token embedding table to text tokens, then scatters multimodal embeddings into the positions indicated by the is_multimodal boolean mask.Embedding Merge Sequence Diagram
The default implementation of embed_input_ids in SupportsMultiModal handles the scatter via _merge_multimodal_embeddings from vllm/model_executor/models/utils.py. Models with out-of-vocabulary multimodal token IDs can set handle_oov_mm_token=True to skip calling the LM embedding table for those positions.
Sources: vllm/model_executor/models/interfaces.py140-387
SupportsMultiModalPruningvllm/model_executor/models/interfaces.py389-426
For models that support dynamic pruning of multimodal embeddings at prefill time (e.g., Efficient Video Sampling / EVS in Qwen2.5-VL and Qwen3-VL). Requires implementing:
recompute_mrope_positions(
input_ids, multimodal_embeddings, mrope_positions, num_computed_tokens
) -> (MultiModalEmbeddings, mrope_positions, mrope_position_delta)
When tokens are pruned, the MRoPE position IDs for the remaining sequence must be recalculated. This method updates the positions starting at num_computed_tokens to reflect pruning.
Implementing classes: Qwen2_5_VLForConditionalGeneration, Qwen3VLForConditionalGeneration.
SupportsLoRAvllm/model_executor/models/interfaces.py514-535
Fields that a model must populate:
| Field | Type | Purpose |
|---|---|---|
embedding_modules | ClassVar[dict[str, str]] | Maps embedding weight names to their shard names |
packed_modules_mapping | dict[str, list[str]] | Maps fused weight names to their component shards |
lora_skip_prefixes | ClassVar[list[str]] | Module name prefixes to skip during LoRA application |
is_3d_moe_weight | ClassVar[bool] | MoE weight layout flag |
is_non_gated_moe | ClassVar[bool] | Whether MoE uses non-gated routing |
Example from Qwen2_5_VLForConditionalGeneration:
Sources: vllm/model_executor/models/qwen2_5_vl.py998-1002
SupportsMRoPEUsed by models that apply Multimodal Rotary Position Embedding — a 3D extension of RoPE where each token has separate position IDs along temporal, height, and width axes. The key method is get_mrope_input_positions, which returns a (3, N) tensor of position IDs and a mrope_position_delta integer.
Implementing classes: Qwen2VLForConditionalGeneration, Qwen2_5_VLForConditionalGeneration, Qwen3VLForConditionalGeneration, Ernie4_5_VLForConditionalGeneration, Glm4vForConditionalGeneration.
The computation in Qwen2_5_VLForConditionalGeneration.get_mrope_input_positions iterates over multimodal features sorted by their prompt offset, building per-modality grid position IDs using iter_mm_grid_thw.
Sources: vllm/model_executor/models/qwen2_5_vl.py1056-1100
SupportsTranscriptionFor encoder-decoder speech-to-text models (Whisper, Voxtral). These models take audio (mel-spectrogram features) as encoder input and produce transcribed text token-by-token in the decoder.
Implementing classes: WhisperForConditionalGeneration, VoxtralForConditionalGeneration, Qwen3OmniMoeForConditionalGeneration.
The interface works in conjunction with SpeechToTextConfig (from vllm.config) and ExplicitEncoderDecoderPrompt to support language-forced decoding and prompt-prefix injection for language identification.
Sources: vllm/model_executor/models/whisper.py67-71 vllm/model_executor/models/voxtral.py66
SupportsEagle3Signals that a model exposes intermediate hidden states at configurable layer indices for use by EAGLE3 speculative decoding draft heads. Implementing classes include LlavaForConditionalGeneration and Qwen2_5_VLForConditionalGeneration.
Sources: vllm/model_executor/models/llava.py57-62 vllm/model_executor/models/qwen2_5_vl.py88-97
TensorSchema)Each model defines typed input schema classes that inherit from TensorSchema (in vllm/utils/tensor_schema.py). These use Annotated + TensorShape to declare expected tensor dimensions, and are used as discriminated unions (via a type field) to select the correct forward path.
Example schema classes:
| Model | Pixel input class | Embedding input class |
|---|---|---|
| LLaVA | LlavaImagePixelInputs | LlavaImageEmbeddingInputs |
| Pixtral | PixtralImagePixelInputs | — |
| Qwen2-VL | Qwen2VLImagePixelInputs, Qwen2VLVideoPixelInputs | Qwen2VLImageEmbeddingInputs, Qwen2VLVideoEmbeddingInputs |
| Qwen2.5-VL | Qwen2_5_VLImagePixelInputs, Qwen2_5_VLVideoPixelInputs | Qwen2_5_VLImageEmbeddingInputs, Qwen2_5_VLVideoEmbeddingInputs |
| MiniCPM-V | MiniCPMVImagePixelInputs | MiniCPMVImageEmbeddingInputs |
| Whisper | WhisperAudioInputs | — |
| Gemma3 | Gemma3ImagePixelInputs | — |
Models support embedding inputs (pre-encoded features) when users bypass the vision encoder and supply vectors directly, which enables encoder caching or external preprocessing.
Sources: vllm/model_executor/models/llava.py76-126 vllm/model_executor/models/qwen2_vl.py121-239 vllm/model_executor/models/qwen2_5_vl.py121-258 vllm/model_executor/models/whisper.py89-101
Protocol Implementation Matrix
Sources: vllm/model_executor/models/llava.py56-62 vllm/model_executor/models/pixtral.py71-76 vllm/model_executor/models/qwen2_5_vl.py988-997 vllm/model_executor/models/qwen3_vl.py99-108 vllm/model_executor/models/whisper.py67-71 vllm/model_executor/models/minicpmv.py91-96 vllm/model_executor/models/gemma3_mm.py41-45
llava.py)Architecture: CLIP/SigLIP/Pixtral vision encoder → LlavaMultiModalProjector (two-layer MLP) → language model.
embed_multimodal passes pixel values through the vision encoder then the projector.LlavaMultiModalProjector) uses two ColumnParallelLinear/RowParallelLinear layers with an activation between them.LlavaImagePixelInputs) and pre-encoded embeddings (LlavaImageEmbeddingInputs).Sources: vllm/model_executor/models/llava.py129-161
pixtral.py)Architecture: native Mistral vision transformer (PixtralVisionModel) with dynamic resolution (no fixed image size) → language model.
PixtralProcessorAdapter to wrap mistral_common's ImageEncoder in a HF-compatible interface.PixtralRotaryEmbedding) in the vision attention layers.Sources: vllm/model_executor/models/pixtral.py105-122 vllm/model_executor/models/pixtral.py124-208
These three model families share a vision architecture pattern:
pixel patches (flat) → PatchEmbed (Conv3d) → Transformer blocks → PatchMerger → LLM embeddings
Vision Encoder Stages (Qwen family)
Key differences across the three:
| Model | Patch embed | Position encoding | Special features |
|---|---|---|---|
Qwen2VisionTransformer | Conv3dLayer (t, h, w) | 2D RoPE (h, w axes) | — |
Qwen2_5_VisionTransformer | Conv3dLayer (t, h, w) | 2D RoPE + window attention | Window-partitioned attention blocks, windowed vs. full-attn by layer index |
Qwen3_VisionTransformer | Conv3dLayer (t, h, w) | 2D RoPE + learned pos embed | fast_pos_embed_interpolate for variable resolution; deepstack feature extraction |
All three models implement SupportsMRoPE and compute 3D MRoPE position IDs for the combined token sequence (text + image patches + video frames), tracked via get_mrope_input_positions.
Sources: vllm/model_executor/models/qwen2_vl.py525-694 vllm/model_executor/models/qwen2_5_vl.py562-875 vllm/model_executor/models/qwen3_vl.py316-618
whisper.py)Architecture: Convolutional audio encoder → cross-attention decoder.
SupportsTranscription in addition to SupportsMultiModal.WhisperAudioInputs wraps mel-spectrogram features as input_features.WhisperEncoder) uses 1D convolutions to downsample the spectrogram, then self-attention transformer blocks with WhisperEncoderAttention (an MMEncoderAttention subclass supporting 2D tensors).CrossAttention) to attend to encoder outputs.ExplicitEncoderDecoderPrompt is used to pass the audio as encoder input while the decoder generates transcription text.Sources: vllm/model_executor/models/whisper.py89-101 vllm/model_executor/models/whisper.py103-128
minicpmv.py)Architecture: Idefics2VisionTransformer → Resampler2 (perceiver-style cross-attention) → language model.
Resampler2 compresses variable-length patch sequences into a fixed number of latent queries using cross-attention and 2D sinusoidal position embeddings.MiniCPMVImagePixelInputs batches images across slices using a combined bns (batch × images × slices) dimension.Sources: vllm/model_executor/models/minicpmv.py45-48 vllm/model_executor/models/minicpmv.py91-97
All multimodal models are registered with the MULTIMODAL_REGISTRY via the @MULTIMODAL_REGISTRY.register_processor decorator. This decorator is applied to the model class itself and links three objects to it:
| Argument | Class | Role |
|---|---|---|
| First positional | BaseMultiModalProcessor subclass | Tokenizes and processes raw inputs; produces MultiModalKwargsItems |
info= | BaseProcessingInfo subclass | Config queries (supported modalities, max tokens, image sizes) |
dummy_inputs= | BaseDummyInputsBuilder subclass | Generates dummy inputs for CUDA graph capture and profiling |
The decorator stores the factory on the class as _processor_factory (a ClassVar), which the registry uses later to instantiate a processor per model configuration.
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.