Model correctness testing verifies that vLLM produces outputs equivalent to reference implementations (primarily Hugging Face Transformers) for supported model architectures. This ensures that vLLM's optimized inference engine maintains accuracy while providing performance benefits.
This page covers the test infrastructure for validating model outputs. For information about which models are supported, see the Supported Models documentation. For details on how models are registered and loaded in the runtime system, see Model Registry and Architecture Detection.
Scope:
The model correctness testing infrastructure is built around a centralized registry system that maintains metadata about test models and their execution requirements.
The test registry in tests/models/registry.py is organized in parallel to the production model registry in vllm/model_executor/models/registry.py. Each architecture maps to an _HfExamplesInfo dataclass instance. The HfExampleModels wrapper class aggregates the per-category dicts and is used directly in tests.
Registry layout in tests/models/registry.py
Sources: tests/models/registry.py187-530 tests/models/registry.py532-696 tests/models/test_initialization.py1-24
Key metadata fields on _HfExamplesInfo
Sources: tests/models/registry.py15-114
The _HfExamplesInfo dataclass encapsulates all metadata needed to test a model architecture:
| Field | Type | Purpose |
|---|---|---|
default | str | Primary model identifier for this architecture |
extras | Mapping[str, str] | Additional model variants (e.g., quantized, different sizes) |
tokenizer | str | None | Override tokenizer to use |
tokenizer_mode | TokenizerMode | str | Tokenizer mode ("auto", "slow", etc.) |
speculative_model | str | None | Model for speculative decoding tests |
min_transformers_version | str | None | Minimum required Transformers version |
max_transformers_version | str | None | Maximum compatible Transformers version |
transformers_version_reason | dict | Explanation for version constraints |
require_embed_inputs | bool | Whether model requires embedding inputs |
dtype | ModelDType | Data type for model weights |
enforce_eager | bool | Disable CUDA graphs |
is_available_online | bool | Whether model exists on HF Hub |
trust_remote_code | bool | Allow remote code execution |
hf_overrides | dict[str, Any] | Config overrides |
max_model_len | int | None | Maximum sequence length for tests |
max_num_batched_tokens | int | None | Batch size limit |
revision | str | None | Specific model revision |
max_num_seqs | int | None | Maximum sequences per iteration |
use_original_num_layers | bool | Use full model instead of minimal layers |
Sources: tests/models/registry.py15-114
The registry includes built-in version validation:
The check_transformers_version() method validates compatibility and can either raise an error, skip the test, or return a message. The check_version_reason parameter distinguishes between vLLM implementation issues ("vllm") and HF compatibility issues ("hf").
Sources: tests/models/registry.py115-168
HfExampleModels is a wrapper class instantiated at module level. Tests import its instances directly rather than the raw dicts.
| Instance | Aggregates | Primary use |
|---|---|---|
HF_EXAMPLE_MODELS | All major per-category dicts | get_hf_info(arch), get_supported_archs(), find_hf_info(model_id) |
AUTO_EXAMPLE_MODELS | _AUTOMATIC_CONVERTED_MODELS | Architectures auto-converted to pooling/classification tasks |
Key methods used in tests:
get_hf_info(model_arch: str) -> _HfExamplesInfo — returns the registry entry for an architecture name.find_hf_info(model_id: str) -> _HfExamplesInfo — reverse-lookup by HF model ID.get_supported_archs() -> Set[str] — returns the set of all registered architecture names._TRANSFORMERS_BACKEND_MODELS is a separate dict of model architectures tested via the HF Transformers modeling backend (see page 5.3). Processing tests use this to exclude Transformers-backend models from vLLM-native processing comparisons.
Text generation model (with extras):
"LlamaForCausalLM" maps to _HfExamplesInfo("meta-llama/Llama-3.2-1B-Instruct", extras={"guard": "meta-llama/Llama-Guard-3-1B", "fp8": "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8", "tiny": "hmellor/tiny-random-LlamaForCausalLM"}) — the extras field provides variant model IDs used to expand test parametrization.
Multimodal model with version constraint:
"KimiVLForConditionalGeneration" sets max_transformers_version="4.53.3" and transformers_version_reason={"hf": "HF model uses deprecated transformers API ..."}. The "hf" key means only tests that use the HfRunner will be skipped; vLLM-native tests continue.
Model with reduced layers for CI:
"Step3p5ForCausalLM" sets use_original_num_layers=False (default) and hf_overrides={"num_hidden_layers": 4} to force a minimal model structure, allowing fast initialization without downloading the full architecture.
Sources: tests/models/registry.py352-360 tests/models/registry.py495-505 tests/models/registry.py826-838
vLLM uses a dual-runner pattern to validate correctness by comparing outputs from vLLM against reference implementations.
Sources: tests/conftest.py290-497 tests/conftest.py520-706
The HfRunner class provides a reference implementation using Hugging Face Transformers:
Key Methods:
| Method | Purpose |
|---|---|
__init__() | Load model, tokenizer, and processor |
generate() | Generate text using model.generate() |
generate_greedy() | Generate with greedy sampling |
generate_beam_search() | Generate with beam search |
generate_encoder_decoder_greedy() | For seq2seq models |
encode() | Encode inputs for embedding models |
Device Management:
wrap_device()device_map="auto" for multi-GPUSources: tests/conftest.py290-497
The VllmRunner class wraps vLLM's inference engine:
Key Features:
| Feature | Implementation |
|---|---|
| Model loading | LLM(model=model_name, **init_kwargs) |
| Generation | llm.generate(prompts, sampling_params) |
| Encoding | llm.encode(prompts) (for pooling models) |
| Multi-modal | Automatic handling via MultiModalDataDict |
| LoRA | Dynamic LoRA adapter loading |
Context Manager Support:
Sources: tests/conftest.py520-706
Sources: tests/models/utils.py tests/conftest.py290-706
Model correctness tests are organized by modality and functionality.
Sources: tests/models/multimodal/generation/vlm_utils/types.py
For text-only models, tests validate:
Test Pattern:
Sources: tests/models/multimodal/generation/test_common.py260-350
Multimodal tests add complexity by handling images, audio, and video:
Image Processing:
Image Size Variations: Tests validate correctness across different image sizes by applying size factors:
image_size_factors=[(0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)]Sources: tests/models/multimodal/generation/test_common.py1-650
Embedding models use a different validation approach since they don't generate text:
Validation Strategy:
Runner Configuration:
Sources: tests/conftest.py520-706
Custom input tests validate special cases and edge conditions:
Common Custom Test Scenarios:
| Test Case | Purpose |
|---|---|
| Multi-aspect ratio | Images of different sizes in one batch |
| Interleaved content | Mixed text and images |
| Empty inputs | Graceful handling of missing data |
| Maximum limits | Boundary testing for limit_mm_per_prompt |
Example:
Sources: tests/models/multimodal/generation/test_common.py99-106
Sources: tests/models/utils.py
Some models require output post-processing before comparison:
Common Post-Processors:
| Function | Models | Transformation |
|---|---|---|
llava_image_vllm_to_hf_output() | LLaVA variants | Adjust for image token handling |
qwen2_vllm_to_hf_output() | Qwen2, Qwen2.5-VL | Append <|im_end|> token |
paligemma_vllm_to_hf_output() | PaliGemma | Handle special output format |
blip2_vllm_to_hf_output() | BLIP-2 | Add newline to output |
Example Processor:
Sources: tests/models/multimodal/generation/vlm_utils/model_utils.py74-82
The multimodal processing system includes caching that must maintain correctness:
The cache test validates that cached processing produces identical results to non-cached processing across multiple batches with varying input sizes and hit rates.
Sources: tests/models/multimodal/processing/test_common.py261-375
Logprobs comparison allows small numerical differences:
Tolerance Levels:
atol=1e-2, rtol=1e-2atol=1e-3, rtol=1e-3atol=5e-2, rtol=5e-2Implementation:
Sources: tests/models/utils.py
Basic Test Execution:
Sources: tests/models/multimodal/generation/test_common.py tests/models/multimodal/generation/vlm_utils/case_filtering.py
vLLM uses pytest markers to organize tests:
| Marker | Purpose |
|---|---|
@pytest.mark.core_model | Critical models tested in every CI run |
@pytest.mark.cpu_model | Models that can run on CPU |
@pytest.mark.skip_global_cleanup | Skip cleanup for unit tests |
@pytest.mark.distributed | Requires multiple GPUs |
Example:
Sources: tests/conftest.py204-218
Multi-GPU tests use special coordination:
Coordination:
StatelessProcessGroup to avoid global state pollutionSources: tests/conftest.py178-189 tests/distributed/test_utils.py
Test Sharding:
source_file_dependencies in test area YAML filesSources: Diagram 6 from system architecture overview
Model correctness testing in vLLM follows a systematic approach:
tests/models/registry.py) maintains test metadata for all supported architecturesThis infrastructure enables vLLM to confidently optimize inference while maintaining accuracy equivalent to reference implementations.
Sources: tests/models/registry.py tests/conftest.py tests/models/multimodal/generation/test_common.py tests/models/utils.py
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.