Model Correctness Testing

Relevant source files

Purpose and Scope

Model correctness testing verifies that vLLM produces outputs equivalent to reference implementations (primarily Hugging Face Transformers) for supported model architectures. This ensures that vLLM's optimized inference engine maintains accuracy while providing performance benefits.

This page covers the test infrastructure for validating model outputs. For information about which models are supported, see the Supported Models documentation. For details on how models are registered and loaded in the runtime system, see Model Registry and Architecture Detection.

Scope:

Test model registry and metadata system
Dual runner architecture (HfRunner vs VllmRunner)
Comparison methodologies for different model types
Test organization and execution patterns

Test Model Registry System

The model correctness testing infrastructure is built around a centralized registry system that maintains metadata about test models and their execution requirements.

Registry Structure

The test registry in tests/models/registry.py is organized in parallel to the production model registry in vllm/model_executor/models/registry.py. Each architecture maps to an _HfExamplesInfo dataclass instance. The HfExampleModels wrapper class aggregates the per-category dicts and is used directly in tests.

Registry layout in tests/models/registry.py

Sources: tests/models/registry.py187-530 tests/models/registry.py532-696 tests/models/test_initialization.py1-24

Key metadata fields on _HfExamplesInfo

Sources: tests/models/registry.py15-114

_HfExamplesInfo Dataclass

The _HfExamplesInfo dataclass encapsulates all metadata needed to test a model architecture:

Field	Type	Purpose
`default`	`str`	Primary model identifier for this architecture
`extras`	`Mapping[str, str]`	Additional model variants (e.g., quantized, different sizes)
`tokenizer`	`str \| None`	Override tokenizer to use
`tokenizer_mode`	`TokenizerMode \| str`	Tokenizer mode ("auto", "slow", etc.)
`speculative_model`	`str \| None`	Model for speculative decoding tests
`min_transformers_version`	`str \| None`	Minimum required Transformers version
`max_transformers_version`	`str \| None`	Maximum compatible Transformers version
`transformers_version_reason`	`dict`	Explanation for version constraints
`require_embed_inputs`	`bool`	Whether model requires embedding inputs
`dtype`	`ModelDType`	Data type for model weights
`enforce_eager`	`bool`	Disable CUDA graphs
`is_available_online`	`bool`	Whether model exists on HF Hub
`trust_remote_code`	`bool`	Allow remote code execution
`hf_overrides`	`dict[str, Any]`	Config overrides
`max_model_len`	`int \| None`	Maximum sequence length for tests
`max_num_batched_tokens`	`int \| None`	Batch size limit
`revision`	`str \| None`	Specific model revision
`max_num_seqs`	`int \| None`	Maximum sequences per iteration
`use_original_num_layers`	`bool`	Use full model instead of minimal layers

Sources: tests/models/registry.py15-114

Version Checking Mechanism

The registry includes built-in version validation:

The check_transformers_version() method validates compatibility and can either raise an error, skip the test, or return a message. The check_version_reason parameter distinguishes between vLLM implementation issues ("vllm") and HF compatibility issues ("hf").

Sources: tests/models/registry.py115-168

HfExampleModels Class

HfExampleModels is a wrapper class instantiated at module level. Tests import its instances directly rather than the raw dicts.

Instance	Aggregates	Primary use
`HF_EXAMPLE_MODELS`	All major per-category dicts	`get_hf_info(arch)`, `get_supported_archs()`, `find_hf_info(model_id)`
`AUTO_EXAMPLE_MODELS`	`_AUTOMATIC_CONVERTED_MODELS`	Architectures auto-converted to pooling/classification tasks

Key methods used in tests:

get_hf_info(model_arch: str) -> _HfExamplesInfo — returns the registry entry for an architecture name.
find_hf_info(model_id: str) -> _HfExamplesInfo — reverse-lookup by HF model ID.
get_supported_archs() -> Set[str] — returns the set of all registered architecture names.

_TRANSFORMERS_BACKEND_MODELS is a separate dict of model architectures tested via the HF Transformers modeling backend (see page 5.3). Processing tests use this to exclude Transformers-backend models from vLLM-native processing comparisons.

Example Registry Entries

Text generation model (with extras):

"LlamaForCausalLM" maps to _HfExamplesInfo("meta-llama/Llama-3.2-1B-Instruct", extras={"guard": "meta-llama/Llama-Guard-3-1B", "fp8": "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8", "tiny": "hmellor/tiny-random-LlamaForCausalLM"}) — the extras field provides variant model IDs used to expand test parametrization.

Multimodal model with version constraint:

"KimiVLForConditionalGeneration" sets max_transformers_version="4.53.3" and transformers_version_reason={"hf": "HF model uses deprecated transformers API ..."}. The "hf" key means only tests that use the HfRunner will be skipped; vLLM-native tests continue.

Model with reduced layers for CI:

"Step3p5ForCausalLM" sets use_original_num_layers=False (default) and hf_overrides={"num_hidden_layers": 4} to force a minimal model structure, allowing fast initialization without downloading the full architecture.

Sources: tests/models/registry.py352-360 tests/models/registry.py495-505 tests/models/registry.py826-838

Dual Runner Architecture

vLLM uses a dual-runner pattern to validate correctness by comparing outputs from vLLM against reference implementations.

Runner Hierarchy

Sources: tests/conftest.py290-497 tests/conftest.py520-706

HfRunner Implementation

The HfRunner class provides a reference implementation using Hugging Face Transformers:

Key Methods:

Method	Purpose
`__init__()`	Load model, tokenizer, and processor
`generate()`	Generate text using `model.generate()`
`generate_greedy()`	Generate with greedy sampling
`generate_beam_search()`	Generate with beam search
`generate_encoder_decoder_greedy()`	For seq2seq models
`encode()`	Encode inputs for embedding models

Device Management:

Automatically detects platform (CPU/CUDA/ROCm)
Wraps tensors to target device via wrap_device()
Supports device_map="auto" for multi-GPU

Sources: tests/conftest.py290-497

VllmRunner Implementation

The VllmRunner class wraps vLLM's inference engine:

Key Features:

Feature	Implementation
Model loading	`LLM(model=model_name, **init_kwargs)`
Generation	`llm.generate(prompts, sampling_params)`
Encoding	`llm.encode(prompts)` (for pooling models)
Multi-modal	Automatic handling via `MultiModalDataDict`
LoRA	Dynamic LoRA adapter loading

Context Manager Support:

Sources: tests/conftest.py520-706

Output Comparison Flow

Sources: tests/models/utils.py tests/conftest.py290-706

Test Types and Categories

Model correctness tests are organized by modality and functionality.

Test Type Enumeration

Sources: tests/models/multimodal/generation/vlm_utils/types.py

Text Generation Tests

For text-only models, tests validate:

Token ID Correctness: Exact match of generated token sequences
String Equivalence: Decoded text matches after normalization
Logprobs Accuracy: Log probabilities are within tolerance

Test Pattern:

Sources: tests/models/multimodal/generation/test_common.py260-350

Multimodal Tests

Multimodal tests add complexity by handling images, audio, and video:

Image Processing:

Image Size Variations: Tests validate correctness across different image sizes by applying size factors:

image_size_factors=[(0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)]
Each tuple represents size multipliers for one or more images

Sources: tests/models/multimodal/generation/test_common.py1-650

Embedding Model Tests

Embedding models use a different validation approach since they don't generate text:

Validation Strategy:

Run both HF and vLLM in pooling mode
Extract embedding vectors
Compare vector similarity (cosine similarity or L2 distance)
Allow small numerical differences due to implementation details

Runner Configuration:

Sources: tests/conftest.py520-706

Custom Input Tests

Custom input tests validate special cases and edge conditions:

Common Custom Test Scenarios:

Test Case	Purpose
Multi-aspect ratio	Images of different sizes in one batch
Interleaved content	Mixed text and images
Empty inputs	Graceful handling of missing data
Maximum limits	Boundary testing for `limit_mm_per_prompt`

Example:

Sources: tests/models/multimodal/generation/test_common.py99-106

Validation Strategies

Output Comparison Methods

Sources: tests/models/utils.py

Model-Specific Output Processors

Some models require output post-processing before comparison:

Common Post-Processors:

Function	Models	Transformation
`llava_image_vllm_to_hf_output()`	LLaVA variants	Adjust for image token handling
`qwen2_vllm_to_hf_output()`	Qwen2, Qwen2.5-VL	Append `<\|im_end\|>` token
`paligemma_vllm_to_hf_output()`	PaliGemma	Handle special output format
`blip2_vllm_to_hf_output()`	BLIP-2	Add newline to output

Example Processor:

Sources: tests/models/multimodal/generation/vlm_utils/model_utils.py74-82

Processing Cache Validation

The multimodal processing system includes caching that must maintain correctness:

The cache test validates that cached processing produces identical results to non-cached processing across multiple batches with varying input sizes and hit rates.

Sources: tests/models/multimodal/processing/test_common.py261-375

Logprobs Tolerance

Logprobs comparison allows small numerical differences:

Tolerance Levels:

Default: atol=1e-2, rtol=1e-2
Strict (CPU models): atol=1e-3, rtol=1e-3
Relaxed (some quantized): atol=5e-2, rtol=5e-2

Implementation:

Sources: tests/models/utils.py

Running Model Tests

Test Execution Patterns

Basic Test Execution:

Sources: tests/models/multimodal/generation/test_common.py tests/models/multimodal/generation/vlm_utils/case_filtering.py

Test Markers

vLLM uses pytest markers to organize tests:

Marker	Purpose
`@pytest.mark.core_model`	Critical models tested in every CI run
`@pytest.mark.cpu_model`	Models that can run on CPU
`@pytest.mark.skip_global_cleanup`	Skip cleanup for unit tests
`@pytest.mark.distributed`	Requires multiple GPUs

Example:

Sources: tests/conftest.py204-218

Distributed Testing

Multi-GPU tests use special coordination:

Coordination:

Uses StatelessProcessGroup to avoid global state pollution
Each test gets isolated distributed environment
Automatic cleanup after each test

Sources: tests/conftest.py178-189 tests/distributed/test_utils.py

CI/CD Integration

Test Sharding:

Tests are sharded based on source_file_dependencies in test area YAML files
Hardware routing based on GPU requirements
Core models tested on every platform

Sources: Diagram 6 from system architecture overview

Summary

Model correctness testing in vLLM follows a systematic approach:

Centralized Registry (tests/models/registry.py) maintains test metadata for all supported architectures
Dual Runners (HfRunner and VllmRunner) enable direct comparison with reference implementations
Comprehensive Coverage spans text generation, embeddings, multimodal, and custom inputs
Validation Strategies compare token IDs, strings, and logprobs with appropriate tolerances
CI/CD Integration ensures continuous validation across multiple hardware platforms

This infrastructure enables vLLM to confidently optimize inference while maintaining accuracy equivalent to reference implementations.

Sources: tests/models/registry.py tests/conftest.py tests/models/multimodal/generation/test_common.py tests/models/utils.py

Model Correctness Testing

Relevant source files

Purpose and Scope

Scope:

Test model registry and metadata system
Dual runner architecture (HfRunner vs VllmRunner)
Comparison methodologies for different model types
Test organization and execution patterns

Test Model Registry System

The model correctness testing infrastructure is built around a centralized registry system that maintains metadata about test models and their execution requirements.

Registry Structure

Registry layout in tests/models/registry.py

Sources: tests/models/registry.py187-530 tests/models/registry.py532-696 tests/models/test_initialization.py1-24

Key metadata fields on _HfExamplesInfo

Sources: tests/models/registry.py15-114

_HfExamplesInfo Dataclass

The _HfExamplesInfo dataclass encapsulates all metadata needed to test a model architecture:

Field	Type	Purpose
`default`	`str`	Primary model identifier for this architecture
`extras`	`Mapping[str, str]`	Additional model variants (e.g., quantized, different sizes)
`tokenizer`	`str \| None`	Override tokenizer to use
`tokenizer_mode`	`TokenizerMode \| str`	Tokenizer mode ("auto", "slow", etc.)
`speculative_model`	`str \| None`	Model for speculative decoding tests
`min_transformers_version`	`str \| None`	Minimum required Transformers version
`max_transformers_version`	`str \| None`	Maximum compatible Transformers version
`transformers_version_reason`	`dict`	Explanation for version constraints
`require_embed_inputs`	`bool`	Whether model requires embedding inputs
`dtype`	`ModelDType`	Data type for model weights
`enforce_eager`	`bool`	Disable CUDA graphs
`is_available_online`	`bool`	Whether model exists on HF Hub
`trust_remote_code`	`bool`	Allow remote code execution
`hf_overrides`	`dict[str, Any]`	Config overrides
`max_model_len`	`int \| None`	Maximum sequence length for tests
`max_num_batched_tokens`	`int \| None`	Batch size limit
`revision`	`str \| None`	Specific model revision
`max_num_seqs`	`int \| None`	Maximum sequences per iteration
`use_original_num_layers`	`bool`	Use full model instead of minimal layers

Sources: tests/models/registry.py15-114

Version Checking Mechanism

The registry includes built-in version validation:

Sources: tests/models/registry.py115-168

HfExampleModels Class

HfExampleModels is a wrapper class instantiated at module level. Tests import its instances directly rather than the raw dicts.

Instance	Aggregates	Primary use
`HF_EXAMPLE_MODELS`	All major per-category dicts	`get_hf_info(arch)`, `get_supported_archs()`, `find_hf_info(model_id)`
`AUTO_EXAMPLE_MODELS`	`_AUTOMATIC_CONVERTED_MODELS`	Architectures auto-converted to pooling/classification tasks

Key methods used in tests:

get_hf_info(model_arch: str) -> _HfExamplesInfo — returns the registry entry for an architecture name.
find_hf_info(model_id: str) -> _HfExamplesInfo — reverse-lookup by HF model ID.
get_supported_archs() -> Set[str] — returns the set of all registered architecture names.

Example Registry Entries

Text generation model (with extras):

Multimodal model with version constraint:

Model with reduced layers for CI:

Sources: tests/models/registry.py352-360 tests/models/registry.py495-505 tests/models/registry.py826-838

Dual Runner Architecture

vLLM uses a dual-runner pattern to validate correctness by comparing outputs from vLLM against reference implementations.

Runner Hierarchy

Sources: tests/conftest.py290-497 tests/conftest.py520-706

HfRunner Implementation

The HfRunner class provides a reference implementation using Hugging Face Transformers:

Key Methods:

Method	Purpose
`__init__()`	Load model, tokenizer, and processor
`generate()`	Generate text using `model.generate()`
`generate_greedy()`	Generate with greedy sampling
`generate_beam_search()`	Generate with beam search
`generate_encoder_decoder_greedy()`	For seq2seq models
`encode()`	Encode inputs for embedding models

Device Management:

Automatically detects platform (CPU/CUDA/ROCm)
Wraps tensors to target device via wrap_device()
Supports device_map="auto" for multi-GPU

Sources: tests/conftest.py290-497

VllmRunner Implementation

The VllmRunner class wraps vLLM's inference engine:

Key Features:

Feature	Implementation
Model loading	`LLM(model=model_name, **init_kwargs)`
Generation	`llm.generate(prompts, sampling_params)`
Encoding	`llm.encode(prompts)` (for pooling models)
Multi-modal	Automatic handling via `MultiModalDataDict`
LoRA	Dynamic LoRA adapter loading

Context Manager Support:

Sources: tests/conftest.py520-706

Output Comparison Flow

Sources: tests/models/utils.py tests/conftest.py290-706

Test Types and Categories

Model correctness tests are organized by modality and functionality.

Test Type Enumeration

Sources: tests/models/multimodal/generation/vlm_utils/types.py

Text Generation Tests

For text-only models, tests validate:

Token ID Correctness: Exact match of generated token sequences
String Equivalence: Decoded text matches after normalization
Logprobs Accuracy: Log probabilities are within tolerance

Test Pattern:

Sources: tests/models/multimodal/generation/test_common.py260-350

Multimodal Tests

Multimodal tests add complexity by handling images, audio, and video:

Image Processing:

Image Size Variations: Tests validate correctness across different image sizes by applying size factors:

image_size_factors=[(0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)]
Each tuple represents size multipliers for one or more images

Sources: tests/models/multimodal/generation/test_common.py1-650

Embedding Model Tests

Embedding models use a different validation approach since they don't generate text:

Validation Strategy:

Run both HF and vLLM in pooling mode
Extract embedding vectors
Compare vector similarity (cosine similarity or L2 distance)
Allow small numerical differences due to implementation details

Runner Configuration:

Sources: tests/conftest.py520-706

Custom Input Tests

Custom input tests validate special cases and edge conditions:

Common Custom Test Scenarios:

Test Case	Purpose
Multi-aspect ratio	Images of different sizes in one batch
Interleaved content	Mixed text and images
Empty inputs	Graceful handling of missing data
Maximum limits	Boundary testing for `limit_mm_per_prompt`

Example:

Sources: tests/models/multimodal/generation/test_common.py99-106

Validation Strategies

Output Comparison Methods

Sources: tests/models/utils.py

Model-Specific Output Processors

Some models require output post-processing before comparison:

Common Post-Processors:

Function	Models	Transformation
`llava_image_vllm_to_hf_output()`	LLaVA variants	Adjust for image token handling
`qwen2_vllm_to_hf_output()`	Qwen2, Qwen2.5-VL	Append `<\|im_end\|>` token
`paligemma_vllm_to_hf_output()`	PaliGemma	Handle special output format
`blip2_vllm_to_hf_output()`	BLIP-2	Add newline to output

Example Processor:

Sources: tests/models/multimodal/generation/vlm_utils/model_utils.py74-82

Processing Cache Validation

The multimodal processing system includes caching that must maintain correctness:

The cache test validates that cached processing produces identical results to non-cached processing across multiple batches with varying input sizes and hit rates.

Sources: tests/models/multimodal/processing/test_common.py261-375

Logprobs Tolerance

Logprobs comparison allows small numerical differences:

Tolerance Levels:

Default: atol=1e-2, rtol=1e-2
Strict (CPU models): atol=1e-3, rtol=1e-3
Relaxed (some quantized): atol=5e-2, rtol=5e-2

Implementation:

Sources: tests/models/utils.py

Running Model Tests

Test Execution Patterns

Basic Test Execution:

Sources: tests/models/multimodal/generation/test_common.py tests/models/multimodal/generation/vlm_utils/case_filtering.py

Test Markers

vLLM uses pytest markers to organize tests:

Marker	Purpose
`@pytest.mark.core_model`	Critical models tested in every CI run
`@pytest.mark.cpu_model`	Models that can run on CPU
`@pytest.mark.skip_global_cleanup`	Skip cleanup for unit tests
`@pytest.mark.distributed`	Requires multiple GPUs

Example:

Sources: tests/conftest.py204-218

Distributed Testing

Multi-GPU tests use special coordination:

Coordination:

Uses StatelessProcessGroup to avoid global state pollution
Each test gets isolated distributed environment
Automatic cleanup after each test

Sources: tests/conftest.py178-189 tests/distributed/test_utils.py

CI/CD Integration

Test Sharding:

Tests are sharded based on source_file_dependencies in test area YAML files
Hardware routing based on GPU requirements
Core models tested on every platform

Sources: Diagram 6 from system architecture overview

Summary

Model correctness testing in vLLM follows a systematic approach:

Centralized Registry (tests/models/registry.py) maintains test metadata for all supported architectures
Dual Runners (HfRunner and VllmRunner) enable direct comparison with reference implementations
Comprehensive Coverage spans text generation, embeddings, multimodal, and custom inputs
Validation Strategies compare token IDs, strings, and logprobs with appropriate tolerances
CI/CD Integration ensures continuous validation across multiple hardware platforms

This infrastructure enables vLLM to confidently optimize inference while maintaining accuracy equivalent to reference implementations.

Sources: tests/models/registry.py tests/conftest.py tests/models/multimodal/generation/test_common.py tests/models/utils.py

Model Correctness Testing

Purpose and Scope

Test Model Registry System

Registry Structure

_HfExamplesInfo Dataclass

Version Checking Mechanism

HfExampleModels Class

Example Registry Entries

Dual Runner Architecture

Runner Hierarchy

HfRunner Implementation

VllmRunner Implementation

Output Comparison Flow

Test Types and Categories

Test Type Enumeration

Text Generation Tests

Multimodal Tests

Embedding Model Tests

Custom Input Tests

Validation Strategies

Output Comparison Methods

Model-Specific Output Processors

Processing Cache Validation

Logprobs Tolerance

Running Model Tests

Test Execution Patterns

Test Markers

Distributed Testing

CI/CD Integration

Summary

On this page

Model Correctness Testing

Purpose and Scope

Test Model Registry System

Registry Structure

_HfExamplesInfo Dataclass

Version Checking Mechanism

HfExampleModels Class

Example Registry Entries

Dual Runner Architecture

Runner Hierarchy

HfRunner Implementation

VllmRunner Implementation

Output Comparison Flow

Test Types and Categories

Test Type Enumeration

Text Generation Tests

Multimodal Tests

Embedding Model Tests

Custom Input Tests

Validation Strategies

Output Comparison Methods

Model-Specific Output Processors

Processing Cache Validation

Logprobs Tolerance

Running Model Tests

Test Execution Patterns

Test Markers

Distributed Testing

CI/CD Integration

Summary

On this page