This page documents all configuration dataclasses in vllm/config/, covering VllmConfig (the engine-wide root object), each subordinate configuration class, their key fields and defaults, and how they validate against one another during initialization.
For the CLI/YAML argument parsing pipeline that constructs these objects from user input, see 2.1. For environment variables that supplement configuration at runtime, see 2.3. For CompilationConfig and torch.compile integration in depth, see 2.4. For how AttentionConfig drives attention backend selection, see 8.1.
All config classes live in vllm/config/ and are re-exported from vllm/config/__init__.py. Each class uses the @config decorator from vllm/config/utils.py, which overlays pydantic validation (field validators, model validators) on a dataclass-style declaration. Fields declared with pydantic.Field support ge/le constraints, default factories, and metadata for documentation generation.
| File | Primary Class(es) |
|---|---|
vllm/config/vllm.py | VllmConfig, OptimizationLevel, PerformanceMode |
vllm/config/model.py | ModelConfig |
vllm/config/cache.py | CacheConfig |
vllm/config/parallel.py | ParallelConfig, EPLBConfig |
vllm/config/scheduler.py | SchedulerConfig |
vllm/config/attention.py | AttentionConfig |
vllm/config/compilation.py | CompilationConfig, PassConfig, CompilationMode, CUDAGraphMode |
vllm/config/load.py | LoadConfig |
vllm/config/lora.py | LoRAConfig |
vllm/config/speculative.py | SpeculativeConfig |
vllm/config/multimodal.py | MultiModalConfig |
vllm/config/observability.py | ObservabilityConfig |
vllm/config/offload.py | OffloadConfig, UVAOffloadConfig, PrefetchOffloadConfig |
vllm/config/profiler.py | ProfilerConfig |
vllm/config/structured_outputs.py | StructuredOutputsConfig |
vllm/config/kv_transfer.py | KVTransferConfig |
vllm/config/kv_events.py | KVEventsConfig |
vllm/config/ec_transfer.py | ECTransferConfig |
vllm/config/weight_transfer.py | WeightTransferConfig |
vllm/config/device.py | DeviceConfig |
vllm/config/kernel.py | KernelConfig |
vllm/config/pooler.py | PoolerConfig |
Sources: vllm/config/__init__.py1-130
VllmConfig is the single config object passed throughout the entire engine, workers, and execution pipeline. It aggregates every other config class as a typed field. Instantiating it with no arguments is valid in tests; all sub-configs have defaults.
VllmConfig composition — mapping concepts to code classes:
Sources: vllm/config/vllm.py247-328
Key fields on VllmConfig:
| Field | Type | Default | Description |
|---|---|---|---|
model_config | ModelConfig | None | Model identity, dtype, tokenizer |
cache_config | CacheConfig | CacheConfig() | KV cache block allocation |
parallel_config | ParallelConfig | ParallelConfig() | TP/PP/DP degrees |
scheduler_config | SchedulerConfig | factory | Batch sizing, chunked prefill |
attention_config | AttentionConfig | AttentionConfig() | Backend selection |
compilation_config | CompilationConfig | CompilationConfig() | torch.compile, CUDA graph config |
lora_config | LoRAConfig | None | None | LoRA adapter settings |
speculative_config | SpeculativeConfig | None | None | Speculative decoding |
kv_transfer_config | KVTransferConfig | None | None | Disaggregated KV transfer |
optimization_level | OptimizationLevel | O2 | Compilation/graph optimization level |
performance_mode | PerformanceMode | "balanced" | Runtime performance strategy |
instance_id | str | set in __post_init__ | Per-instance unique ID |
Sources: vllm/config/vllm.py246-329
Defined in vllm/config/model.py99 ModelConfig holds all information about the model to load: its identity, tokenizer settings, data type, context length, and quantization. The __post_init__ method fetches the hf_config from HuggingFace (or local path) and populates derived attributes like runner_type, hf_text_config, dtype (resolved from the string alias), and max_model_len.
Key fields:
| Field | Type | Default | Notes |
|---|---|---|---|
model | str | "Qwen/Qwen3-0.6B" | HF repo ID or local path |
tokenizer | str | None | None (falls back to model) | Override tokenizer path |
tokenizer_mode | TokenizerMode | str | "auto" | "auto", "hf", "slow", "mistral" |
dtype | ModelDType | torch.dtype | "auto" | Resolved to torch.dtype in __post_init__ |
max_model_len | int | None | Auto-derived from HF config if unset |
quantization | QuantizationMethods | str | None | None | Quantization method name |
seed | int | 0 | Global RNG seed |
enforce_eager | bool | False | Disable CUDA graph capture |
trust_remote_code | bool | False | Allow custom model code from HF |
runner | RunnerOption | "auto" | "auto", "generate", "pooling", "draft" |
served_model_name | str | list[str] | None | None | Names exposed via API |
hf_config | PretrainedConfig | (loaded in __post_init__) | Not set by user directly |
multimodal_config | MultiModalConfig | None | (inferred) | Set if model is multimodal |
pooler_config | PoolerConfig | None | None | Pooling settings for embedding models |
logprobs_mode | LogprobsMode | "raw_logprobs" | Content of returned logprobs |
hf_overrides | HfOverrides | {} | Dict or callable to patch HF config |
TokenizerMode, ModelDType, RunnerOption, ConvertOption, LogprobsMode are all Literal type aliases defined at the top of vllm/config/model.py76-86
After __post_init__, ModelConfig also stores hf_text_config, encoder_config, model_arch_config, runner_type, convert_type, and _model_info.
Sources: vllm/config/model.py99-600
Defined in vllm/config/cache.py38 CacheConfig controls KV cache memory allocation and eviction policies.
Key fields:
| Field | Type | Default | Notes |
|---|---|---|---|
block_size | BlockSize | platform-set | Token block size; set by Platform.check_and_update_config() |
gpu_memory_utilization | float | 0.9 | Fraction of GPU VRAM reserved for the model executor |
swap_space | float | 4 | CPU swap space in GiB per GPU |
cache_dtype | CacheDType | "auto" | KV cache storage dtype: "auto", "fp8", "fp8_e4m3", "bfloat16" |
enable_prefix_caching | bool | True | Enable prefix/prompt caching |
prefix_caching_hash_algo | PrefixCachingHashAlgo | "sha256" | Hash algorithm for prefix cache keys |
num_gpu_blocks_override | int | None | None | Force a specific number of GPU KV blocks |
sliding_window | int | None | None | Sliding window size; usually mirrored from ModelConfig |
kv_offloading_size | float | None | None | GiB of KV cache to offload to CPU/disk |
kv_offloading_backend | KVOffloadingBackend | "native" | "native" or "lmcache" |
mamba_cache_dtype | MambaDType | "auto" | SSM state dtype for Mamba models |
mamba_cache_mode | MambaCacheMode | "none" | "all", "align", "none" |
calculate_kv_scales | bool | False | Compute per-block KV scales for FP8 |
BlockSize is Literal[1, 8, 16, 32, 64, 128, 256]. block_size has no static default and must be set by Platform.check_and_update_config() before use.
Sources: vllm/config/cache.py38-200
Defined in vllm/config/parallel.py93 ParallelConfig describes all distributed execution topology settings.
Key fields:
| Field | Type | Default | Notes |
|---|---|---|---|
tensor_parallel_size | int | 1 | Number of TP shards per pipeline stage |
pipeline_parallel_size | int | 1 | Number of pipeline stages |
data_parallel_size | int | 1 | Number of data-parallel engine replicas |
prefill_context_parallel_size | int | 1 | Context parallelism during prefill |
decode_context_parallel_size | int | 1 | Context parallelism during decode |
distributed_executor_backend | DistributedExecutorBackend | None | None (auto) | "ray", "mp", "uni", "external_launcher" |
enable_expert_parallel | bool | False | Use expert parallelism for MoE layers |
enable_eplb | bool | False | Enable expert parallel load balancing |
eplb_config | EPLBConfig | EPLBConfig() | EPLB window size, rebalance interval |
all2all_backend | All2AllBackend | "allgather_reducescatter" | MoE expert comm backend |
worker_cls | str | "auto" | Fully-qualified worker class name |
disable_custom_all_reduce | bool | False | Fall back to NCCL all-reduce |
master_addr | str | "127.0.0.1" | Master node address for multi-node MP |
master_port | int | 29501 | Master node port for multi-node MP |
world_size | int | (computed) | TP × PP; set in __post_init__ |
enable_dbo | bool | False | Dual batch overlap for microbatching |
EPLBConfig (vllm/config/parallel.py50-90) is a nested config holding the EPLB rebalancing policy, window size, step interval, and number of redundant experts.
Sources: vllm/config/parallel.py50-400
Defined in vllm/config/scheduler.py25 SchedulerConfig governs how the scheduler batches requests and allocates computation across iterations.
Key fields:
| Field | Type | Default | Notes |
|---|---|---|---|
max_num_batched_tokens | int | 2048 | Max tokens processed per iteration |
max_num_seqs | int | 128 | Max concurrent sequences per iteration |
enable_chunked_prefill | bool | (platform-determined) | Split long prefills across iterations |
max_num_partial_prefills | int | 1 | Max simultaneous chunked prefill requests |
long_prefill_token_threshold | int | — | Tokens above which a prefill is "long" |
policy | SchedulerPolicy | "fcfs" | "fcfs" or "priority" |
async_scheduling | bool | None | None | Overlap scheduling and execution; auto-detected |
scheduler_cls | str | type | None | None | Custom scheduler class |
stream_interval | int | — | Token streaming granularity |
disable_chunked_mm_input | bool | False | Disable chunked multimodal input processing |
disable_hybrid_kv_cache_manager | bool | None | None | Force or disable hybrid KV manager |
max_model_len and is_encoder_decoder are InitVar parameters consumed in __post_init__ to set defaults and validate other fields, but are not stored as attributes.
Sources: vllm/config/scheduler.py25-200
Defined in vllm/config/attention.py, AttentionConfig is a thin wrapper that selects the attention backend and stores per-layer overrides.
Key fields:
| Field | Type | Default | Notes |
|---|---|---|---|
backend | AttentionBackendEnum | None | None (auto) | Override attention backend; auto-selected by platform if None |
Backend auto-selection logic lives in the platform's check_and_update_config() hook. The resolved backend enum (AttentionBackendEnum) is defined in vllm/v1/attention/backends/registry.py For full backend documentation, see 8.1.
Sources: vllm/engine/arg_utils.py578 vllm/config/vllm.py270-271
Defined in vllm/config/compilation.py336 CompilationConfig controls all torch.compile and CUDA graph capture behavior. It is covered in depth in 2.4; this section summarizes the most commonly used fields.
Key fields:
| Field | Type | Default | Notes |
|---|---|---|---|
mode | CompilationMode | None (→ VLLM_COMPILE for V1) | NONE, STOCK_TORCH_COMPILE, DYNAMO_TRACE_ONCE, VLLM_COMPILE |
cudagraph_mode | CUDAGraphMode | None (→ FULL_AND_PIECEWISE) | NONE, PIECEWISE, FULL, FULL_AND_PIECEWISE, FULL_DECODE_ONLY |
cudagraph_capture_sizes | list[int] | None | None (auto) | Explicit batch sizes to capture CUDA graphs for |
max_cudagraph_capture_size | int | None | None | Upper bound for auto-generated capture sizes |
custom_ops | list[str] | [] | Enable/disable specific custom ops ("+op" / "-op") |
splitting_ops | list[str] | None | None | Ops used to split the graph for piecewise compilation |
backend | str | "" (→ "inductor") | Inductor backend or qualified name |
pass_config | PassConfig | PassConfig() | Per-pass fusion flags |
cache_dir | str | "" (auto) | Directory for compiled artifact cache |
compile_mm_encoder | bool | False | Compile multimodal encoder as well |
PassConfig (vllm/config/compilation.py101-270) exposes fine-grained flags like fuse_norm_quant, fuse_act_quant, fuse_allreduce_rms, enable_sp, and fuse_gemm_comms.
Sources: vllm/config/compilation.py336-650
Defined in vllm/config/speculative.py and referenced in vllm/config/vllm.py276 SpeculativeConfig is None when speculative decoding is disabled. When enabled, key fields include:
| Field | Type | Notes |
|---|---|---|
method | str | Method type: "eagle", "eagle3", "ngram", "draft_model", etc. |
num_speculative_tokens | int | None | Draft tokens to propose per step |
disable_padded_drafter_batch | bool | Disables padding in draft model batches |
EagleModelTypes is a Literal type alias in vllm/config/speculative.py used in VllmConfig.__post_init__ to check async scheduling compatibility. For full speculative decoding documentation, see 4.5.
Sources: vllm/config/vllm.py276-277 vllm/config/vllm.py692-745
Defined in vllm/config/load.py, LoadConfig governs weight loading format and source.
| Field | Type | Default | Notes |
|---|---|---|---|
load_format | str | LoadFormats | "auto" | "auto", "pt", "safetensors", "npcache", "dummy", "gguf", "bitsandbytes", "mistral" |
download_dir | str | None | None | Override HF cache directory |
safetensors_load_strategy | str | "auto" | Strategy for loading safetensors files |
model_loader_extra_config | dict | {} | Passed to the model loader implementation |
ignore_patterns | str | list[str] | [] | Glob patterns for weight files to skip |
use_tqdm_on_load | bool | True | Show progress bar during weight loading |
Sources: vllm/engine/arg_utils.py379-381 vllm/engine/arg_utils.py751-768
Defined in vllm/config/lora.py, LoRAConfig is only present in VllmConfig when --enable-lora is set.
| Field | Type | Default | Notes |
|---|---|---|---|
max_loras | int | 1 | Maximum simultaneous LoRA adapters |
max_lora_rank | MaxLoRARanks | — | Maximum LoRA rank dimension |
max_cpu_loras | int | None | None | CPU-cached adapters |
lora_dtype | str | torch.dtype | None | None | LoRA weight dtype |
fully_sharded_loras | bool | False | Shard LoRA across TP ranks |
Sources: vllm/engine/arg_utils.py500-509
Defined in vllm/config/observability.py, controls metrics and tracing.
| Field | Type | Notes |
|---|---|---|
otlp_traces_endpoint | str | None | OTLP endpoint for distributed tracing |
collect_detailed_traces | list[DetailedTraceModules] | None | Modules to trace in detail |
kv_cache_metrics | bool | Enable per-request KV cache metrics |
enable_mfu_metrics | bool | Enable MFU (Model FLOP Utilization) metrics |
show_hidden_metrics_for_version | str | None | Expose unstable metrics for a given version |
Sources: vllm/engine/arg_utils.py531-550
OptimizationLevel (vllm/config/vllm.py64-77) is an IntEnum that provides a coarse knob controlling startup-vs-performance trade-offs by adjusting defaults in CompilationConfig and KernelConfig:
| Level | Compilation defaults | CUDA graph mode | Notes |
|---|---|---|---|
O0 | No fusion passes, no compile | NONE | Fastest startup; eager PyTorch only |
O1 | Norm/act fusion | PIECEWISE | Quick optimizations |
O2 | All O1 + allreduce fusion | FULL_AND_PIECEWISE | Default |
O3 | Same as O2 (currently) | FULL_AND_PIECEWISE | Reserved for future use |
These defaults are declared in OPTIMIZATION_LEVEL_TO_CONFIG (vllm/config/vllm.py238-243) and applied via VllmConfig._apply_optimization_level_defaults() during __post_init__. User-provided values in CompilationConfig always take precedence.
PerformanceMode (vllm/config/vllm.py79) is a Literal["balanced", "interactivity", "throughput"] that influences scheduling and kernel selection independently of OptimizationLevel.
Sources: vllm/config/vllm.py64-243
VllmConfig.__post_init__ (vllm/config/vllm.py652) performs validation and reconciliation across multiple sub-configs after all fields are set. The sequence is:
VllmConfig __post_init__ validation flow:
Key cross-config constraints enforced here:
"mp", "uni", or "external_launcher".CacheConfig.kv_offloading_size is set, VllmConfig._post_init_kv_transfer_config() creates or patches KVTransferConfig to wire up the offloading backend.VllmConfig._get_quantization_config() resolves ModelConfig.quantization to a QuantizationConfig and validates GPU capability and dtype compatibility.LoRAConfig.verify_with_model_config() which checks that the model supports LoRA.Sources: vllm/config/vllm.py652-840
Each config class implements compute_hash() -> str, returning a short hex digest used to cache compiled artifacts. VllmConfig.compute_hash() (vllm/config/vllm.py330-432) calls compute_hash() on each sub-config and hashes the concatenation.
Fields that are excluded from a config's hash are those that don't affect the computation graph structure — for example, ModelConfig.compute_hash() excludes tokenizer, seed, served_model_name, and logprobs_mode.
The resulting hash appears in the compiled artifact cache directory path, ensuring that different configurations produce separate caches.
Sources: vllm/config/vllm.py330-432 vllm/config/model.py311-366
The following diagram shows where config objects originate (from EngineArgs) and how they flow into the broader engine subsystems:
Sources: vllm/engine/arg_utils.py361-640 vllm/config/vllm.py246-328
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.