This page covers how vLLM identifies, loads, and configures model architectures at startup — from reading the architectures field in a model's config.json all the way to instantiating a torch.nn.Module with weights. It also covers the multimodal processing pipeline and the plugin system for adding external model architectures.
For detailed treatment of individual subsystems, refer to:
For engine-level configuration of the model (dtype, max_model_len, quantization), see page 2.2.
When a user specifies a model (e.g., LLM(model="meta-llama/Llama-3.2-1B-Instruct")), vLLM must:
config.json to extract the architectures field.All native vLLM model implementations live under vllm/model_executor/models/. Models supported only through the Hugging Face Transformers backend are handled by vllm/model_executor/models/transformers/.
The central registry is defined in vllm/model_executor/models/registry.py Architecture names (strings that appear in config.json) are mapped to (module, class) tuples. There are four primary groupings:
| Dictionary | Task Type | Example Entry |
|---|---|---|
_TEXT_GENERATION_MODELS | Decoder-only causal LM | "LlamaForCausalLM": ("llama", "LlamaForCausalLM") |
_EMBEDDING_MODELS | Embedding / pooling | "BertModel": ("bert", "BertEmbeddingModel") |
_CROSS_ENCODER_MODELS | Sequence classification / reranking | "BertForSequenceClassification": ("bert", "BertForSequenceClassification") |
_MULTIMODAL_MODELS | Vision-language, audio-language | "LlavaForConditionalGeneration": ("llava", "LlavaForConditionalGeneration") |
Each tuple (module, class) refers to a submodule inside vllm.model_executor.models and the class within it. Imports are lazy — the module is only imported when the architecture is actually needed.
Architecture name aliasing is common. For example, several architecture strings all resolve to LlamaForCausalLM (e.g., AquilaModel, InternLMForCausalLM, InternLM3ForCausalLM, CwmForCausalLM, XverseForCausalLM). This allows vLLM to support fine-tuned variants that inherit Llama's architecture without requiring separate implementations.
Sources: vllm/model_executor/models/registry.py70-512
Model resolution flow — from architecture name to class
Sources: vllm/model_executor/models/registry.py70-512 vllm/model_executor/model_loader/utils.py35-70
Before the registry is consulted, vLLM must parse the model's configuration. This is handled by vllm/transformers_utils/config.py.
vLLM supports two config formats:
| Format | Parser Class | Config File |
|---|---|---|
"hf" | HFConfigParser | config.json via transformers.AutoConfig |
"mistral" | MistralConfigParser | params.json (Mistral native format) |
get_config_parser(config_format) returns the appropriate parser. Additional parsers can be registered via register_config_parser(config_format).
_CONFIG_REGISTRY)When model_type in config.json is not recognized by the standard Transformers AutoConfig, vLLM checks its own _CONFIG_REGISTRY (a LazyConfigDict) defined at vllm/transformers_utils/config.py73-111 This maps model_type strings to custom config classes in vllm/transformers_utils/configs/.
Examples:
"chatglm" → ChatGLMConfig"RefinedWeb" → RWConfig (original Falcon models)"kimi_vl" → KimiVLConfigThe LazyConfigDict defers actual imports until access, keeping startup overhead low.
Many older models use non-standard field names for rotary position embedding parameters (rope_theta, rotary_emb_base, etc.). The patch_rope_parameters function in vllm/transformers_utils/config.py316-365 normalizes these fields to the standard rope_parameters format expected by Transformers v5. It also handles legacy rope_type aliases like "su" → "longrope" and "mrope" → "default".
The _CONFIG_ATTRS_MAPPING dict remaps config field names to canonical names expected by vLLM (e.g., llm_config → text_config). This is applied via _maybe_remap_hf_config_attrs after config loading.
Sources: vllm/transformers_utils/config.py63-240 vllm/transformers_utils/configs/__init__.py1-70
Configuration loading flow
Sources: vllm/transformers_utils/config.py132-240
For models not explicitly registered in vLLM's registry, vLLM falls back to the Transformers modeling backend in vllm/model_executor/models/transformers/. This backend wraps any HuggingFace Transformers model that satisfies a compatibility check (is_backend_compatible()).
To be compatible, the model's base class must:
kwargs down to attention layers so the attention implementation can be swapped.ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]._supports_attention_backend = True on the model class.When loaded through this backend, vLLM sets config._attn_implementation = "vllm" so vLLM's paged attention kernels are used instead of the default HF attention.
The Transformers backend supports encoder-only, decoder-only, and MoE architectures, and integrates with tensor parallelism via base_model_tp_plan and pipeline parallelism via base_model_pp_plan config attributes.
To force use of this backend even for natively-supported models: LLM(model=..., model_impl="transformers").
Architectures not supported by vLLM natively or the Transformers backend can be added via vLLM's plugin system. For example, the bart-plugin repository adds support for BartForConditionalGeneration and Florence2ForConditionalGeneration as separate packages.
Sources: docs/models/supported_models.md16-186 vllm/model_executor/models/registry.py514-900
vLLM uses Python Protocol classes and class-level flags to express model capabilities. These are defined in vllm/model_executor/models/interfaces.py
| Protocol / Flag | Purpose | Checked by |
|---|---|---|
SupportsMultiModal | Model accepts image/audio/video inputs | supports_multimodal() |
SupportsLoRA | Model supports LoRA adapter injection | Registry / LoRA subsystem |
SupportsPP | Model supports pipeline parallelism | supports_pp() |
SupportsTranscription | Model supports audio transcription | supports_transcription() |
has_inner_state | Model has recurrent state (e.g., Mamba) | Scheduler / cache management |
is_attention_free | Model has no attention layers | Attention backend selection |
is_hybrid | Model mixes attention and SSM layers | is_hybrid() |
supports_mamba_prefix_caching | Mamba model supports prefix caching | KV cache |
These flags are read by vllm/model_executor/models/registry.py via imports from interfaces.py and interfaces_base.py. The ModelRegistry uses them to validate that a requested feature (e.g., LoRA) is supported before proceeding.
SupportsMultiModal in DetailAny model that handles images, audio, or video must implement SupportsMultiModal. Key members:
supports_multimodal: ClassVar[Literal[True]] — presence marker_processor_factory: ClassVar[_ProcessorFactories] — set by MultiModalRegistry.register_processorembed_multimodal(**kwargs) -> MultiModalEmbeddings — returns embeddings for multimodal inputsget_language_model() -> VllmModel — returns the text backboneget_placeholder_str(modality, i) -> str | None — returns the placeholder token string for the i-th multimodal itemSources: vllm/model_executor/models/interfaces.py87-200
Model capability resolution
Sources: vllm/model_executor/models/interfaces.py87-300 vllm/model_executor/models/registry.py44-67
Multimodal models in vLLM follow a consistent structure, though each model may customize each stage.
A typical multimodal model (e.g., LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration) is composed of:
The get_language_model() method on SupportsMultiModal returns the language model component. The embed_multimodal() method returns embeddings from the encoder+projector.
Multimodal processors are registered at class definition time via the @MULTIMODAL_REGISTRY.register_processor(...) decorator. This sets _processor_factory on the model class. The MULTIMODAL_REGISTRY is a MultiModalRegistry singleton in vllm/multimodal/__init__.py.
Each processor is a subclass of BaseMultiModalProcessor and implements:
get_supported_mm_limits() — maximum number of each modality per prompt_get_prompt_updates() — rules for replacing placeholder tokens in prompts_process_mm_inputs() — converts raw modality data into tensorsTo avoid re-encoding the same image/audio in batched requests (e.g., when the same image appears in multiple prompts in a batch), vLLM uses EncoderCacheManager. Encoder outputs are stored keyed by content hash and reused across requests.
Sources: vllm/model_executor/models/interfaces.py87-200 vllm/model_executor/models/llava.py1-80 vllm/model_executor/models/pixtral.py1-85 vllm/model_executor/models/whisper.py1-50
For models that either lack HuggingFace config support or require overrides, vLLM defines custom config classes in vllm/transformers_utils/configs/. These are registered in two places:
_CONFIG_REGISTRY in vllm/transformers_utils/config.py73-111 — maps model_type strings to config class names (resolved lazily via LazyConfigDict)._CLASS_TO_MODULE in vllm/transformers_utils/configs/__init__.py17-70 — maps class name strings to the module path where the class is defined.This two-level lazy loading avoids importing all config classes at startup. Accessing a key in _CONFIG_REGISTRY triggers getattr(configs, value), which uses __getattr__ in configs/__init__.py to import the module on demand.
Example custom configs:
model_type | Config Class | Reason |
|---|---|---|
"chatglm" | ChatGLMConfig | Not in Transformers library |
"RefinedWeb" / "RefinedWebModel" | RWConfig | Original Falcon model format |
"mlp_speculator" | MLPSpeculatorConfig | vLLM-specific speculative decoding config |
"eagle" | EAGLEConfig | EAGLE speculative decoding |
"ovis" | OvisConfig | Custom multimodal config |
Sources: vllm/transformers_utils/config.py63-111 vllm/transformers_utils/configs/__init__.py1-80
After the model class is resolved and instantiated, weights are loaded. The entry point is initialize_model in vllm/model_executor/model_loader/utils.py35-70
The function:
get_model_architecture(model_config) to obtain the class.configure_quant_config.vllm_config and prefix for weight name scoping.Models implement load_weights(weights: Iterable[tuple[str, torch.Tensor]]) to consume weight tensors. Common helpers:
AutoWeightsLoader (vllm/model_executor/models/utils.py109) — iterates named parameters and delegates to each module's own load_weights or weight_loader.WeightsMapper (vllm/model_executor/models/utils.py46-106) — renames weight keys via prefix/suffix/substring substitution, used to bridge naming differences between HuggingFace checkpoints and vLLM's internal module layout.default_weight_loader — copies a tensor into a parameter in-place.For quantized models (GPTQ, AWQ, FP8, etc.), the quantization config replaces standard nn.Linear layers with quantized equivalents before weight loading. See page 7 for details.
Sources: vllm/model_executor/model_loader/utils.py35-120 vllm/model_executor/models/utils.py46-200
The test suite maintains a parallel registry in tests/models/registry.py containing _HfExamplesInfo dataclass instances that record:
default: the canonical HuggingFace model ID for CI testingextras: additional model variants (quantized, alternative sizes, etc.)min_transformers_version / max_transformers_version: version gates for skipping teststrust_remote_code, dtype, enforce_eager, max_model_len: test-specific overridesThe dictionaries mirror the main registry:
_TEXT_GENERATION_EXAMPLE_MODELS — text-only generative models_EMBEDDING_EXAMPLE_MODELS — embedding / pooling models_MULTIMODAL_EXAMPLE_MODELS — vision, audio, and video-language models_SEQUENCE_CLASSIFICATION_EXAMPLE_MODELS — cross-encoders and classifiers_AUTOMATIC_CONVERTED_MODELS — models using automatic CausalLM → SequenceClassification conversionWhen a new architecture is added to the main registry, the note at the top of vllm/model_executor/models/registry.py1-6 explicitly requests a corresponding entry in the test registry.
Sources: tests/models/registry.py1-200 vllm/model_executor/models/registry.py1-6
From LLM(model=...) to a running model instance
Sources: vllm/model_executor/model_loader/utils.py35-120 vllm/model_executor/models/registry.py514-900 vllm/transformers_utils/config.py132-240
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.