Model Support and Registration

Relevant source files

This page covers how vLLM identifies, loads, and configures model architectures at startup — from reading the architectures field in a model's config.json all the way to instantiating a torch.nn.Module with weights. It also covers the multimodal processing pipeline and the plugin system for adding external model architectures.

For detailed treatment of individual subsystems, refer to:

For engine-level configuration of the model (dtype, max_model_len, quantization), see page 2.2.

Overview

When a user specifies a model (e.g., LLM(model="meta-llama/Llama-3.2-1B-Instruct")), vLLM must:

Download and parse the model's config.json to extract the architectures field.
Resolve that architecture name to a Python class through the model registry.
Instantiate the class and load weights into it.
If the model is multimodal, attach a processor pipeline to handle images, audio, or video.

All native vLLM model implementations live under vllm/model_executor/models/. Models supported only through the Hugging Face Transformers backend are handled by vllm/model_executor/models/transformers/.

Model Registry

The central registry is defined in vllm/model_executor/models/registry.py Architecture names (strings that appear in config.json) are mapped to (module, class) tuples. There are four primary groupings:

Dictionary	Task Type	Example Entry
`_TEXT_GENERATION_MODELS`	Decoder-only causal LM	`"LlamaForCausalLM": ("llama", "LlamaForCausalLM")`
`_EMBEDDING_MODELS`	Embedding / pooling	`"BertModel": ("bert", "BertEmbeddingModel")`
`_CROSS_ENCODER_MODELS`	Sequence classification / reranking	`"BertForSequenceClassification": ("bert", "BertForSequenceClassification")`
`_MULTIMODAL_MODELS`	Vision-language, audio-language	`"LlavaForConditionalGeneration": ("llava", "LlavaForConditionalGeneration")`

Each tuple (module, class) refers to a submodule inside vllm.model_executor.models and the class within it. Imports are lazy — the module is only imported when the architecture is actually needed.

Architecture name aliasing is common. For example, several architecture strings all resolve to LlamaForCausalLM (e.g., AquilaModel, InternLMForCausalLM, InternLM3ForCausalLM, CwmForCausalLM, XverseForCausalLM). This allows vLLM to support fine-tuned variants that inherit Llama's architecture without requiring separate implementations.

Sources: vllm/model_executor/models/registry.py70-512

Model resolution flow — from architecture name to class

Sources: vllm/model_executor/models/registry.py70-512 vllm/model_executor/model_loader/utils.py35-70

Configuration Loading

Before the registry is consulted, vLLM must parse the model's configuration. This is handled by vllm/transformers_utils/config.py.

Config Parsers

vLLM supports two config formats:

Format	Parser Class	Config File
`"hf"`	`HFConfigParser`	`config.json` via `transformers.AutoConfig`
`"mistral"`	`MistralConfigParser`	`params.json` (Mistral native format)

get_config_parser(config_format) returns the appropriate parser. Additional parsers can be registered via register_config_parser(config_format).

Custom Config Registry (`_CONFIG_REGISTRY`)

When model_type in config.json is not recognized by the standard Transformers AutoConfig, vLLM checks its own _CONFIG_REGISTRY (a LazyConfigDict) defined at vllm/transformers_utils/config.py73-111 This maps model_type strings to custom config classes in vllm/transformers_utils/configs/.

Examples:

"chatglm" → ChatGLMConfig
"RefinedWeb" → RWConfig (original Falcon models)
"kimi_vl" → KimiVLConfig

The LazyConfigDict defers actual imports until access, keeping startup overhead low.

RoPE Parameter Patching

Many older models use non-standard field names for rotary position embedding parameters (rope_theta, rotary_emb_base, etc.). The patch_rope_parameters function in vllm/transformers_utils/config.py316-365 normalizes these fields to the standard rope_parameters format expected by Transformers v5. It also handles legacy rope_type aliases like "su" → "longrope" and "mrope" → "default".

Config Attribute Remapping

The _CONFIG_ATTRS_MAPPING dict remaps config field names to canonical names expected by vLLM (e.g., llm_config → text_config). This is applied via _maybe_remap_hf_config_attrs after config loading.

Sources: vllm/transformers_utils/config.py63-240 vllm/transformers_utils/configs/__init__.py1-70

Configuration loading flow

Sources: vllm/transformers_utils/config.py132-240

Transformers Backend

For models not explicitly registered in vLLM's registry, vLLM falls back to the Transformers modeling backend in vllm/model_executor/models/transformers/. This backend wraps any HuggingFace Transformers model that satisfies a compatibility check (is_backend_compatible()).

To be compatible, the model's base class must:

Pass kwargs down to attention layers so the attention implementation can be swapped.
Call attention via ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation].
Set _supports_attention_backend = True on the model class.

When loaded through this backend, vLLM sets config._attn_implementation = "vllm" so vLLM's paged attention kernels are used instead of the default HF attention.

The Transformers backend supports encoder-only, decoder-only, and MoE architectures, and integrates with tensor parallelism via base_model_tp_plan and pipeline parallelism via base_model_pp_plan config attributes.

To force use of this backend even for natively-supported models: LLM(model=..., model_impl="transformers").

Plugin System

Architectures not supported by vLLM natively or the Transformers backend can be added via vLLM's plugin system. For example, the bart-plugin repository adds support for BartForConditionalGeneration and Florence2ForConditionalGeneration as separate packages.

Sources: docs/models/supported_models.md16-186 vllm/model_executor/models/registry.py514-900

Model Interface Protocols

vLLM uses Python Protocol classes and class-level flags to express model capabilities. These are defined in vllm/model_executor/models/interfaces.py

Capability Flags and Protocols

Protocol / Flag	Purpose	Checked by
`SupportsMultiModal`	Model accepts image/audio/video inputs	`supports_multimodal()`
`SupportsLoRA`	Model supports LoRA adapter injection	Registry / LoRA subsystem
`SupportsPP`	Model supports pipeline parallelism	`supports_pp()`
`SupportsTranscription`	Model supports audio transcription	`supports_transcription()`
`has_inner_state`	Model has recurrent state (e.g., Mamba)	Scheduler / cache management
`is_attention_free`	Model has no attention layers	Attention backend selection
`is_hybrid`	Model mixes attention and SSM layers	`is_hybrid()`
`supports_mamba_prefix_caching`	Mamba model supports prefix caching	KV cache

These flags are read by vllm/model_executor/models/registry.py via imports from interfaces.py and interfaces_base.py. The ModelRegistry uses them to validate that a requested feature (e.g., LoRA) is supported before proceeding.

`SupportsMultiModal` in Detail

Any model that handles images, audio, or video must implement SupportsMultiModal. Key members:

supports_multimodal: ClassVar[Literal[True]] — presence marker
_processor_factory: ClassVar[_ProcessorFactories] — set by MultiModalRegistry.register_processor
embed_multimodal(**kwargs) -> MultiModalEmbeddings — returns embeddings for multimodal inputs
get_language_model() -> VllmModel — returns the text backbone
get_placeholder_str(modality, i) -> str | None — returns the placeholder token string for the i-th multimodal item

Sources: vllm/model_executor/models/interfaces.py87-200

Model capability resolution

Sources: vllm/model_executor/models/interfaces.py87-300 vllm/model_executor/models/registry.py44-67

Multimodal Model Architecture

Multimodal models in vLLM follow a consistent structure, though each model may customize each stage.

Structure

A typical multimodal model (e.g., LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration) is composed of:

Vision encoder (or audio encoder) — processes raw pixel/audio data into feature embeddings.
Multimodal projector — linear layer(s) that project encoder features into the language model's embedding space.
Language model backbone — processes the merged token+multimodal embeddings and produces output.

The get_language_model() method on SupportsMultiModal returns the language model component. The embed_multimodal() method returns embeddings from the encoder+projector.

Processor Registration

Multimodal processors are registered at class definition time via the @MULTIMODAL_REGISTRY.register_processor(...) decorator. This sets _processor_factory on the model class. The MULTIMODAL_REGISTRY is a MultiModalRegistry singleton in vllm/multimodal/__init__.py.

Each processor is a subclass of BaseMultiModalProcessor and implements:

get_supported_mm_limits() — maximum number of each modality per prompt
_get_prompt_updates() — rules for replacing placeholder tokens in prompts
_process_mm_inputs() — converts raw modality data into tensors

Encoder Cache

To avoid re-encoding the same image/audio in batched requests (e.g., when the same image appears in multiple prompts in a batch), vLLM uses EncoderCacheManager. Encoder outputs are stored keyed by content hash and reused across requests.

Sources: vllm/model_executor/models/interfaces.py87-200 vllm/model_executor/models/llava.py1-80 vllm/model_executor/models/pixtral.py1-85 vllm/model_executor/models/whisper.py1-50

Custom Config Classes

For models that either lack HuggingFace config support or require overrides, vLLM defines custom config classes in vllm/transformers_utils/configs/. These are registered in two places:

_CONFIG_REGISTRY in vllm/transformers_utils/config.py73-111 — maps model_type strings to config class names (resolved lazily via LazyConfigDict).
_CLASS_TO_MODULE in vllm/transformers_utils/configs/__init__.py17-70 — maps class name strings to the module path where the class is defined.

This two-level lazy loading avoids importing all config classes at startup. Accessing a key in _CONFIG_REGISTRY triggers getattr(configs, value), which uses __getattr__ in configs/__init__.py to import the module on demand.

Example custom configs:

`model_type`	Config Class	Reason
`"chatglm"`	`ChatGLMConfig`	Not in Transformers library
`"RefinedWeb"` / `"RefinedWebModel"`	`RWConfig`	Original Falcon model format
`"mlp_speculator"`	`MLPSpeculatorConfig`	vLLM-specific speculative decoding config
`"eagle"`	`EAGLEConfig`	EAGLE speculative decoding
`"ovis"`	`OvisConfig`	Custom multimodal config

Sources: vllm/transformers_utils/config.py63-111 vllm/transformers_utils/configs/__init__.py1-80

Weight Loading and Model Initialization

After the model class is resolved and instantiated, weights are loaded. The entry point is initialize_model in vllm/model_executor/model_loader/utils.py35-70

The function:

Calls get_model_architecture(model_config) to obtain the class.
Applies quantization configuration via configure_quant_config.
Instantiates the class, passing vllm_config and prefix for weight name scoping.

Models implement load_weights(weights: Iterable[tuple[str, torch.Tensor]]) to consume weight tensors. Common helpers:

AutoWeightsLoader (vllm/model_executor/models/utils.py109) — iterates named parameters and delegates to each module's own load_weights or weight_loader.
WeightsMapper (vllm/model_executor/models/utils.py46-106) — renames weight keys via prefix/suffix/substring substitution, used to bridge naming differences between HuggingFace checkpoints and vLLM's internal module layout.
default_weight_loader — copies a tensor into a parameter in-place.

For quantized models (GPTQ, AWQ, FP8, etc.), the quantization config replaces standard nn.Linear layers with quantized equivalents before weight loading. See page 7 for details.

Sources: vllm/model_executor/model_loader/utils.py35-120 vllm/model_executor/models/utils.py46-200

Test Registry

The test suite maintains a parallel registry in tests/models/registry.py containing _HfExamplesInfo dataclass instances that record:

default: the canonical HuggingFace model ID for CI testing
extras: additional model variants (quantized, alternative sizes, etc.)
min_transformers_version / max_transformers_version: version gates for skipping tests
trust_remote_code, dtype, enforce_eager, max_model_len: test-specific overrides

The dictionaries mirror the main registry:

_TEXT_GENERATION_EXAMPLE_MODELS — text-only generative models
_EMBEDDING_EXAMPLE_MODELS — embedding / pooling models
_MULTIMODAL_EXAMPLE_MODELS — vision, audio, and video-language models
_SEQUENCE_CLASSIFICATION_EXAMPLE_MODELS — cross-encoders and classifiers
_AUTOMATIC_CONVERTED_MODELS — models using automatic CausalLM → SequenceClassification conversion

When a new architecture is added to the main registry, the note at the top of vllm/model_executor/models/registry.py1-6 explicitly requests a corresponding entry in the test registry.

Sources: tests/models/registry.py1-200 vllm/model_executor/models/registry.py1-6

End-to-End Flow

From LLM(model=...) to a running model instance

Sources: vllm/model_executor/model_loader/utils.py35-120 vllm/model_executor/models/registry.py514-900 vllm/transformers_utils/config.py132-240

Model Support and Registration

Relevant source files

For detailed treatment of individual subsystems, refer to:

For engine-level configuration of the model (dtype, max_model_len, quantization), see page 2.2.

Overview

When a user specifies a model (e.g., LLM(model="meta-llama/Llama-3.2-1B-Instruct")), vLLM must:

Download and parse the model's config.json to extract the architectures field.
Resolve that architecture name to a Python class through the model registry.
Instantiate the class and load weights into it.
If the model is multimodal, attach a processor pipeline to handle images, audio, or video.

Model Registry

Dictionary	Task Type	Example Entry
`_TEXT_GENERATION_MODELS`	Decoder-only causal LM	`"LlamaForCausalLM": ("llama", "LlamaForCausalLM")`
`_EMBEDDING_MODELS`	Embedding / pooling	`"BertModel": ("bert", "BertEmbeddingModel")`
`_CROSS_ENCODER_MODELS`	Sequence classification / reranking	`"BertForSequenceClassification": ("bert", "BertForSequenceClassification")`
`_MULTIMODAL_MODELS`	Vision-language, audio-language	`"LlavaForConditionalGeneration": ("llava", "LlavaForConditionalGeneration")`

Sources: vllm/model_executor/models/registry.py70-512

Model resolution flow — from architecture name to class

Sources: vllm/model_executor/models/registry.py70-512 vllm/model_executor/model_loader/utils.py35-70

Configuration Loading

Before the registry is consulted, vLLM must parse the model's configuration. This is handled by vllm/transformers_utils/config.py.

Config Parsers

vLLM supports two config formats:

Format	Parser Class	Config File
`"hf"`	`HFConfigParser`	`config.json` via `transformers.AutoConfig`
`"mistral"`	`MistralConfigParser`	`params.json` (Mistral native format)

get_config_parser(config_format) returns the appropriate parser. Additional parsers can be registered via register_config_parser(config_format).

Custom Config Registry (`_CONFIG_REGISTRY`)

Examples:

"chatglm" → ChatGLMConfig
"RefinedWeb" → RWConfig (original Falcon models)
"kimi_vl" → KimiVLConfig

The LazyConfigDict defers actual imports until access, keeping startup overhead low.

RoPE Parameter Patching

Config Attribute Remapping

Sources: vllm/transformers_utils/config.py63-240 vllm/transformers_utils/configs/__init__.py1-70

Configuration loading flow

Sources: vllm/transformers_utils/config.py132-240

Transformers Backend

To be compatible, the model's base class must:

Pass kwargs down to attention layers so the attention implementation can be swapped.
Call attention via ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation].
Set _supports_attention_backend = True on the model class.

When loaded through this backend, vLLM sets config._attn_implementation = "vllm" so vLLM's paged attention kernels are used instead of the default HF attention.

To force use of this backend even for natively-supported models: LLM(model=..., model_impl="transformers").

Plugin System

Sources: docs/models/supported_models.md16-186 vllm/model_executor/models/registry.py514-900

Model Interface Protocols

vLLM uses Python Protocol classes and class-level flags to express model capabilities. These are defined in vllm/model_executor/models/interfaces.py

Capability Flags and Protocols

Protocol / Flag	Purpose	Checked by
`SupportsMultiModal`	Model accepts image/audio/video inputs	`supports_multimodal()`
`SupportsLoRA`	Model supports LoRA adapter injection	Registry / LoRA subsystem
`SupportsPP`	Model supports pipeline parallelism	`supports_pp()`
`SupportsTranscription`	Model supports audio transcription	`supports_transcription()`
`has_inner_state`	Model has recurrent state (e.g., Mamba)	Scheduler / cache management
`is_attention_free`	Model has no attention layers	Attention backend selection
`is_hybrid`	Model mixes attention and SSM layers	`is_hybrid()`
`supports_mamba_prefix_caching`	Mamba model supports prefix caching	KV cache

`SupportsMultiModal` in Detail

Any model that handles images, audio, or video must implement SupportsMultiModal. Key members:

supports_multimodal: ClassVar[Literal[True]] — presence marker
_processor_factory: ClassVar[_ProcessorFactories] — set by MultiModalRegistry.register_processor
embed_multimodal(**kwargs) -> MultiModalEmbeddings — returns embeddings for multimodal inputs
get_language_model() -> VllmModel — returns the text backbone
get_placeholder_str(modality, i) -> str | None — returns the placeholder token string for the i-th multimodal item

Sources: vllm/model_executor/models/interfaces.py87-200

Model capability resolution

Sources: vllm/model_executor/models/interfaces.py87-300 vllm/model_executor/models/registry.py44-67

Multimodal Model Architecture

Multimodal models in vLLM follow a consistent structure, though each model may customize each stage.

Structure

A typical multimodal model (e.g., LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration) is composed of:

Vision encoder (or audio encoder) — processes raw pixel/audio data into feature embeddings.
Multimodal projector — linear layer(s) that project encoder features into the language model's embedding space.
Language model backbone — processes the merged token+multimodal embeddings and produces output.

The get_language_model() method on SupportsMultiModal returns the language model component. The embed_multimodal() method returns embeddings from the encoder+projector.

Processor Registration

Each processor is a subclass of BaseMultiModalProcessor and implements:

get_supported_mm_limits() — maximum number of each modality per prompt
_get_prompt_updates() — rules for replacing placeholder tokens in prompts
_process_mm_inputs() — converts raw modality data into tensors

Encoder Cache

Sources: vllm/model_executor/models/interfaces.py87-200 vllm/model_executor/models/llava.py1-80 vllm/model_executor/models/pixtral.py1-85 vllm/model_executor/models/whisper.py1-50

Custom Config Classes

For models that either lack HuggingFace config support or require overrides, vLLM defines custom config classes in vllm/transformers_utils/configs/. These are registered in two places:

_CONFIG_REGISTRY in vllm/transformers_utils/config.py73-111 — maps model_type strings to config class names (resolved lazily via LazyConfigDict).
_CLASS_TO_MODULE in vllm/transformers_utils/configs/__init__.py17-70 — maps class name strings to the module path where the class is defined.

Example custom configs:

`model_type`	Config Class	Reason
`"chatglm"`	`ChatGLMConfig`	Not in Transformers library
`"RefinedWeb"` / `"RefinedWebModel"`	`RWConfig`	Original Falcon model format
`"mlp_speculator"`	`MLPSpeculatorConfig`	vLLM-specific speculative decoding config
`"eagle"`	`EAGLEConfig`	EAGLE speculative decoding
`"ovis"`	`OvisConfig`	Custom multimodal config

Sources: vllm/transformers_utils/config.py63-111 vllm/transformers_utils/configs/__init__.py1-80

Weight Loading and Model Initialization

After the model class is resolved and instantiated, weights are loaded. The entry point is initialize_model in vllm/model_executor/model_loader/utils.py35-70

The function:

Calls get_model_architecture(model_config) to obtain the class.
Applies quantization configuration via configure_quant_config.
Instantiates the class, passing vllm_config and prefix for weight name scoping.

Models implement load_weights(weights: Iterable[tuple[str, torch.Tensor]]) to consume weight tensors. Common helpers:

AutoWeightsLoader (vllm/model_executor/models/utils.py109) — iterates named parameters and delegates to each module's own load_weights or weight_loader.
WeightsMapper (vllm/model_executor/models/utils.py46-106) — renames weight keys via prefix/suffix/substring substitution, used to bridge naming differences between HuggingFace checkpoints and vLLM's internal module layout.
default_weight_loader — copies a tensor into a parameter in-place.

For quantized models (GPTQ, AWQ, FP8, etc.), the quantization config replaces standard nn.Linear layers with quantized equivalents before weight loading. See page 7 for details.

Sources: vllm/model_executor/model_loader/utils.py35-120 vllm/model_executor/models/utils.py46-200

Test Registry

The test suite maintains a parallel registry in tests/models/registry.py containing _HfExamplesInfo dataclass instances that record:

default: the canonical HuggingFace model ID for CI testing
extras: additional model variants (quantized, alternative sizes, etc.)
min_transformers_version / max_transformers_version: version gates for skipping tests
trust_remote_code, dtype, enforce_eager, max_model_len: test-specific overrides

The dictionaries mirror the main registry:

_TEXT_GENERATION_EXAMPLE_MODELS — text-only generative models
_EMBEDDING_EXAMPLE_MODELS — embedding / pooling models
_MULTIMODAL_EXAMPLE_MODELS — vision, audio, and video-language models
_SEQUENCE_CLASSIFICATION_EXAMPLE_MODELS — cross-encoders and classifiers
_AUTOMATIC_CONVERTED_MODELS — models using automatic CausalLM → SequenceClassification conversion

When a new architecture is added to the main registry, the note at the top of vllm/model_executor/models/registry.py1-6 explicitly requests a corresponding entry in the test registry.

Sources: tests/models/registry.py1-200 vllm/model_executor/models/registry.py1-6

End-to-End Flow

From LLM(model=...) to a running model instance

Sources: vllm/model_executor/model_loader/utils.py35-120 vllm/model_executor/models/registry.py514-900 vllm/transformers_utils/config.py132-240

Model Support and Registration

Overview

Model Registry

Configuration Loading

Config Parsers

Custom Config Registry (_CONFIG_REGISTRY)

RoPE Parameter Patching

Config Attribute Remapping

Transformers Backend

Plugin System

Model Interface Protocols

Capability Flags and Protocols

SupportsMultiModal in Detail

Multimodal Model Architecture

Structure

Processor Registration

Encoder Cache

Custom Config Classes

Weight Loading and Model Initialization

Test Registry

End-to-End Flow

On this page

Model Support and Registration

Overview

Model Registry

Configuration Loading

Config Parsers

Custom Config Registry (_CONFIG_REGISTRY)

RoPE Parameter Patching

Config Attribute Remapping

Transformers Backend

Plugin System

Model Interface Protocols

Capability Flags and Protocols

SupportsMultiModal in Detail

Multimodal Model Architecture

Structure

Processor Registration

Encoder Cache

Custom Config Classes

Weight Loading and Model Initialization

Test Registry

End-to-End Flow

On this page

Custom Config Registry (`_CONFIG_REGISTRY`)

`SupportsMultiModal` in Detail

Custom Config Registry (`_CONFIG_REGISTRY`)

`SupportsMultiModal` in Detail