Configuration Objects

Relevant source files

Purpose and Scope

This page documents all configuration dataclasses in vllm/config/, covering VllmConfig (the engine-wide root object), each subordinate configuration class, their key fields and defaults, and how they validate against one another during initialization.

For the CLI/YAML argument parsing pipeline that constructs these objects from user input, see 2.1. For environment variables that supplement configuration at runtime, see 2.3. For CompilationConfig and torch.compile integration in depth, see 2.4. For how AttentionConfig drives attention backend selection, see 8.1.

Package Layout

All config classes live in vllm/config/ and are re-exported from vllm/config/__init__.py. Each class uses the @config decorator from vllm/config/utils.py, which overlays pydantic validation (field validators, model validators) on a dataclass-style declaration. Fields declared with pydantic.Field support ge/le constraints, default factories, and metadata for documentation generation.

File	Primary Class(es)
`vllm/config/vllm.py`	`VllmConfig`, `OptimizationLevel`, `PerformanceMode`
`vllm/config/model.py`	`ModelConfig`
`vllm/config/cache.py`	`CacheConfig`
`vllm/config/parallel.py`	`ParallelConfig`, `EPLBConfig`
`vllm/config/scheduler.py`	`SchedulerConfig`
`vllm/config/attention.py`	`AttentionConfig`
`vllm/config/compilation.py`	`CompilationConfig`, `PassConfig`, `CompilationMode`, `CUDAGraphMode`
`vllm/config/load.py`	`LoadConfig`
`vllm/config/lora.py`	`LoRAConfig`
`vllm/config/speculative.py`	`SpeculativeConfig`
`vllm/config/multimodal.py`	`MultiModalConfig`
`vllm/config/observability.py`	`ObservabilityConfig`
`vllm/config/offload.py`	`OffloadConfig`, `UVAOffloadConfig`, `PrefetchOffloadConfig`
`vllm/config/profiler.py`	`ProfilerConfig`
`vllm/config/structured_outputs.py`	`StructuredOutputsConfig`
`vllm/config/kv_transfer.py`	`KVTransferConfig`
`vllm/config/kv_events.py`	`KVEventsConfig`
`vllm/config/ec_transfer.py`	`ECTransferConfig`
`vllm/config/weight_transfer.py`	`WeightTransferConfig`
`vllm/config/device.py`	`DeviceConfig`
`vllm/config/kernel.py`	`KernelConfig`
`vllm/config/pooler.py`	`PoolerConfig`

Sources: vllm/config/__init__.py1-130

VllmConfig — Root Container

VllmConfig is the single config object passed throughout the entire engine, workers, and execution pipeline. It aggregates every other config class as a typed field. Instantiating it with no arguments is valid in tests; all sub-configs have defaults.

VllmConfig composition — mapping concepts to code classes:

Sources: vllm/config/vllm.py247-328

Key fields on VllmConfig:

Field	Type	Default	Description
`model_config`	`ModelConfig`	`None`	Model identity, dtype, tokenizer
`cache_config`	`CacheConfig`	`CacheConfig()`	KV cache block allocation
`parallel_config`	`ParallelConfig`	`ParallelConfig()`	TP/PP/DP degrees
`scheduler_config`	`SchedulerConfig`	factory	Batch sizing, chunked prefill
`attention_config`	`AttentionConfig`	`AttentionConfig()`	Backend selection
`compilation_config`	`CompilationConfig`	`CompilationConfig()`	torch.compile, CUDA graph config
`lora_config`	`LoRAConfig \| None`	`None`	LoRA adapter settings
`speculative_config`	`SpeculativeConfig \| None`	`None`	Speculative decoding
`kv_transfer_config`	`KVTransferConfig \| None`	`None`	Disaggregated KV transfer
`optimization_level`	`OptimizationLevel`	`O2`	Compilation/graph optimization level
`performance_mode`	`PerformanceMode`	`"balanced"`	Runtime performance strategy
`instance_id`	`str`	set in `__post_init__`	Per-instance unique ID

Sources: vllm/config/vllm.py246-329

ModelConfig

Defined in vllm/config/model.py99 ModelConfig holds all information about the model to load: its identity, tokenizer settings, data type, context length, and quantization. The __post_init__ method fetches the hf_config from HuggingFace (or local path) and populates derived attributes like runner_type, hf_text_config, dtype (resolved from the string alias), and max_model_len.

Key fields:

Field	Type	Default	Notes
`model`	`str`	`"Qwen/Qwen3-0.6B"`	HF repo ID or local path
`tokenizer`	`str \| None`	`None` (falls back to `model`)	Override tokenizer path
`tokenizer_mode`	`TokenizerMode \| str`	`"auto"`	`"auto"`, `"hf"`, `"slow"`, `"mistral"`
`dtype`	`ModelDType \| torch.dtype`	`"auto"`	Resolved to `torch.dtype` in `__post_init__`
`max_model_len`	`int`	`None`	Auto-derived from HF config if unset
`quantization`	`QuantizationMethods \| str \| None`	`None`	Quantization method name
`seed`	`int`	`0`	Global RNG seed
`enforce_eager`	`bool`	`False`	Disable CUDA graph capture
`trust_remote_code`	`bool`	`False`	Allow custom model code from HF
`runner`	`RunnerOption`	`"auto"`	`"auto"`, `"generate"`, `"pooling"`, `"draft"`
`served_model_name`	`str \| list[str] \| None`	`None`	Names exposed via API
`hf_config`	`PretrainedConfig`	(loaded in `__post_init__`)	Not set by user directly
`multimodal_config`	`MultiModalConfig \| None`	(inferred)	Set if model is multimodal
`pooler_config`	`PoolerConfig \| None`	`None`	Pooling settings for embedding models
`logprobs_mode`	`LogprobsMode`	`"raw_logprobs"`	Content of returned logprobs
`hf_overrides`	`HfOverrides`	`{}`	Dict or callable to patch HF config

TokenizerMode, ModelDType, RunnerOption, ConvertOption, LogprobsMode are all Literal type aliases defined at the top of vllm/config/model.py76-86

After __post_init__, ModelConfig also stores hf_text_config, encoder_config, model_arch_config, runner_type, convert_type, and _model_info.

Sources: vllm/config/model.py99-600

CacheConfig

Defined in vllm/config/cache.py38 CacheConfig controls KV cache memory allocation and eviction policies.

Key fields:

Field	Type	Default	Notes
`block_size`	`BlockSize`	platform-set	Token block size; set by `Platform.check_and_update_config()`
`gpu_memory_utilization`	`float`	`0.9`	Fraction of GPU VRAM reserved for the model executor
`swap_space`	`float`	`4`	CPU swap space in GiB per GPU
`cache_dtype`	`CacheDType`	`"auto"`	KV cache storage dtype: `"auto"`, `"fp8"`, `"fp8_e4m3"`, `"bfloat16"`
`enable_prefix_caching`	`bool`	`True`	Enable prefix/prompt caching
`prefix_caching_hash_algo`	`PrefixCachingHashAlgo`	`"sha256"`	Hash algorithm for prefix cache keys
`num_gpu_blocks_override`	`int \| None`	`None`	Force a specific number of GPU KV blocks
`sliding_window`	`int \| None`	`None`	Sliding window size; usually mirrored from `ModelConfig`
`kv_offloading_size`	`float \| None`	`None`	GiB of KV cache to offload to CPU/disk
`kv_offloading_backend`	`KVOffloadingBackend`	`"native"`	`"native"` or `"lmcache"`
`mamba_cache_dtype`	`MambaDType`	`"auto"`	SSM state dtype for Mamba models
`mamba_cache_mode`	`MambaCacheMode`	`"none"`	`"all"`, `"align"`, `"none"`
`calculate_kv_scales`	`bool`	`False`	Compute per-block KV scales for FP8

BlockSize is Literal[1, 8, 16, 32, 64, 128, 256]. block_size has no static default and must be set by Platform.check_and_update_config() before use.

Sources: vllm/config/cache.py38-200

ParallelConfig

Defined in vllm/config/parallel.py93 ParallelConfig describes all distributed execution topology settings.

Key fields:

Field	Type	Default	Notes
`tensor_parallel_size`	`int`	`1`	Number of TP shards per pipeline stage
`pipeline_parallel_size`	`int`	`1`	Number of pipeline stages
`data_parallel_size`	`int`	`1`	Number of data-parallel engine replicas
`prefill_context_parallel_size`	`int`	`1`	Context parallelism during prefill
`decode_context_parallel_size`	`int`	`1`	Context parallelism during decode
`distributed_executor_backend`	`DistributedExecutorBackend \| None`	`None` (auto)	`"ray"`, `"mp"`, `"uni"`, `"external_launcher"`
`enable_expert_parallel`	`bool`	`False`	Use expert parallelism for MoE layers
`enable_eplb`	`bool`	`False`	Enable expert parallel load balancing
`eplb_config`	`EPLBConfig`	`EPLBConfig()`	EPLB window size, rebalance interval
`all2all_backend`	`All2AllBackend`	`"allgather_reducescatter"`	MoE expert comm backend
`worker_cls`	`str`	`"auto"`	Fully-qualified worker class name
`disable_custom_all_reduce`	`bool`	`False`	Fall back to NCCL all-reduce
`master_addr`	`str`	`"127.0.0.1"`	Master node address for multi-node MP
`master_port`	`int`	`29501`	Master node port for multi-node MP
`world_size`	`int`	(computed)	TP × PP; set in `__post_init__`
`enable_dbo`	`bool`	`False`	Dual batch overlap for microbatching

EPLBConfig (vllm/config/parallel.py50-90) is a nested config holding the EPLB rebalancing policy, window size, step interval, and number of redundant experts.

Sources: vllm/config/parallel.py50-400

SchedulerConfig

Defined in vllm/config/scheduler.py25 SchedulerConfig governs how the scheduler batches requests and allocates computation across iterations.

Key fields:

Field	Type	Default	Notes
`max_num_batched_tokens`	`int`	`2048`	Max tokens processed per iteration
`max_num_seqs`	`int`	`128`	Max concurrent sequences per iteration
`enable_chunked_prefill`	`bool`	(platform-determined)	Split long prefills across iterations
`max_num_partial_prefills`	`int`	`1`	Max simultaneous chunked prefill requests
`long_prefill_token_threshold`	`int`	—	Tokens above which a prefill is "long"
`policy`	`SchedulerPolicy`	`"fcfs"`	`"fcfs"` or `"priority"`
`async_scheduling`	`bool \| None`	`None`	Overlap scheduling and execution; auto-detected
`scheduler_cls`	`str \| type \| None`	`None`	Custom scheduler class
`stream_interval`	`int`	—	Token streaming granularity
`disable_chunked_mm_input`	`bool`	`False`	Disable chunked multimodal input processing
`disable_hybrid_kv_cache_manager`	`bool \| None`	`None`	Force or disable hybrid KV manager

max_model_len and is_encoder_decoder are InitVar parameters consumed in __post_init__ to set defaults and validate other fields, but are not stored as attributes.

Sources: vllm/config/scheduler.py25-200

AttentionConfig

Defined in vllm/config/attention.py, AttentionConfig is a thin wrapper that selects the attention backend and stores per-layer overrides.

Key fields:

Field	Type	Default	Notes
`backend`	`AttentionBackendEnum \| None`	`None` (auto)	Override attention backend; auto-selected by platform if `None`

Backend auto-selection logic lives in the platform's check_and_update_config() hook. The resolved backend enum (AttentionBackendEnum) is defined in vllm/v1/attention/backends/registry.py For full backend documentation, see 8.1.

Sources: vllm/engine/arg_utils.py578 vllm/config/vllm.py270-271

CompilationConfig

Defined in vllm/config/compilation.py336 CompilationConfig controls all torch.compile and CUDA graph capture behavior. It is covered in depth in 2.4; this section summarizes the most commonly used fields.

Key fields:

Field	Type	Default	Notes
`mode`	`CompilationMode`	`None` (→ `VLLM_COMPILE` for V1)	`NONE`, `STOCK_TORCH_COMPILE`, `DYNAMO_TRACE_ONCE`, `VLLM_COMPILE`
`cudagraph_mode`	`CUDAGraphMode`	`None` (→ `FULL_AND_PIECEWISE`)	`NONE`, `PIECEWISE`, `FULL`, `FULL_AND_PIECEWISE`, `FULL_DECODE_ONLY`
`cudagraph_capture_sizes`	`list[int] \| None`	`None` (auto)	Explicit batch sizes to capture CUDA graphs for
`max_cudagraph_capture_size`	`int \| None`	`None`	Upper bound for auto-generated capture sizes
`custom_ops`	`list[str]`	`[]`	Enable/disable specific custom ops (`"+op"` / `"-op"`)
`splitting_ops`	`list[str] \| None`	`None`	Ops used to split the graph for piecewise compilation
`backend`	`str`	`""` (→ `"inductor"`)	Inductor backend or qualified name
`pass_config`	`PassConfig`	`PassConfig()`	Per-pass fusion flags
`cache_dir`	`str`	`""` (auto)	Directory for compiled artifact cache
`compile_mm_encoder`	`bool`	`False`	Compile multimodal encoder as well

PassConfig (vllm/config/compilation.py101-270) exposes fine-grained flags like fuse_norm_quant, fuse_act_quant, fuse_allreduce_rms, enable_sp, and fuse_gemm_comms.

Sources: vllm/config/compilation.py336-650

SpeculativeConfig

Defined in vllm/config/speculative.py and referenced in vllm/config/vllm.py276 SpeculativeConfig is None when speculative decoding is disabled. When enabled, key fields include:

Field	Type	Notes
`method`	`str`	Method type: `"eagle"`, `"eagle3"`, `"ngram"`, `"draft_model"`, etc.
`num_speculative_tokens`	`int \| None`	Draft tokens to propose per step
`disable_padded_drafter_batch`	`bool`	Disables padding in draft model batches

EagleModelTypes is a Literal type alias in vllm/config/speculative.py used in VllmConfig.__post_init__ to check async scheduling compatibility. For full speculative decoding documentation, see 4.5.

Sources: vllm/config/vllm.py276-277 vllm/config/vllm.py692-745

LoadConfig

Defined in vllm/config/load.py, LoadConfig governs weight loading format and source.

Field	Type	Default	Notes
`load_format`	`str \| LoadFormats`	`"auto"`	`"auto"`, `"pt"`, `"safetensors"`, `"npcache"`, `"dummy"`, `"gguf"`, `"bitsandbytes"`, `"mistral"`
`download_dir`	`str \| None`	`None`	Override HF cache directory
`safetensors_load_strategy`	`str`	`"auto"`	Strategy for loading safetensors files
`model_loader_extra_config`	`dict`	`{}`	Passed to the model loader implementation
`ignore_patterns`	`str \| list[str]`	`[]`	Glob patterns for weight files to skip
`use_tqdm_on_load`	`bool`	`True`	Show progress bar during weight loading

Sources: vllm/engine/arg_utils.py379-381 vllm/engine/arg_utils.py751-768

LoRAConfig

Defined in vllm/config/lora.py, LoRAConfig is only present in VllmConfig when --enable-lora is set.

Field	Type	Default	Notes
`max_loras`	`int`	`1`	Maximum simultaneous LoRA adapters
`max_lora_rank`	`MaxLoRARanks`	—	Maximum LoRA rank dimension
`max_cpu_loras`	`int \| None`	`None`	CPU-cached adapters
`lora_dtype`	`str \| torch.dtype \| None`	`None`	LoRA weight dtype
`fully_sharded_loras`	`bool`	`False`	Shard LoRA across TP ranks

Sources: vllm/engine/arg_utils.py500-509

ObservabilityConfig

Defined in vllm/config/observability.py, controls metrics and tracing.

Field	Type	Notes
`otlp_traces_endpoint`	`str \| None`	OTLP endpoint for distributed tracing
`collect_detailed_traces`	`list[DetailedTraceModules] \| None`	Modules to trace in detail
`kv_cache_metrics`	`bool`	Enable per-request KV cache metrics
`enable_mfu_metrics`	`bool`	Enable MFU (Model FLOP Utilization) metrics
`show_hidden_metrics_for_version`	`str \| None`	Expose unstable metrics for a given version

Sources: vllm/engine/arg_utils.py531-550

OptimizationLevel and PerformanceMode

OptimizationLevel (vllm/config/vllm.py64-77) is an IntEnum that provides a coarse knob controlling startup-vs-performance trade-offs by adjusting defaults in CompilationConfig and KernelConfig:

Level	Compilation defaults	CUDA graph mode	Notes
`O0`	No fusion passes, no compile	`NONE`	Fastest startup; eager PyTorch only
`O1`	Norm/act fusion	`PIECEWISE`	Quick optimizations
`O2`	All O1 + allreduce fusion	`FULL_AND_PIECEWISE`	Default
`O3`	Same as O2 (currently)	`FULL_AND_PIECEWISE`	Reserved for future use

These defaults are declared in OPTIMIZATION_LEVEL_TO_CONFIG (vllm/config/vllm.py238-243) and applied via VllmConfig._apply_optimization_level_defaults() during __post_init__. User-provided values in CompilationConfig always take precedence.

PerformanceMode (vllm/config/vllm.py79) is a Literal["balanced", "interactivity", "throughput"] that influences scheduling and kernel selection independently of OptimizationLevel.

Sources: vllm/config/vllm.py64-243

Cross-Config Validation

VllmConfig.__post_init__ (vllm/config/vllm.py652) performs validation and reconciliation across multiple sub-configs after all fields are set. The sequence is:

VllmConfig __post_init__ validation flow:

Key cross-config constraints enforced here:

Async scheduling is auto-disabled when the speculative method is not EAGLE/MTP and when the executor backend is not "mp", "uni", or "external_launcher".
DP NCCL is disabled automatically when async scheduling is active.
KV offloading: if CacheConfig.kv_offloading_size is set, VllmConfig._post_init_kv_transfer_config() creates or patches KVTransferConfig to wire up the offloading backend.
Quantization config: VllmConfig._get_quantization_config() resolves ModelConfig.quantization to a QuantizationConfig and validates GPU capability and dtype compatibility.
LoRA calls LoRAConfig.verify_with_model_config() which checks that the model supports LoRA.

Sources: vllm/config/vllm.py652-840

Config Hashing

Each config class implements compute_hash() -> str, returning a short hex digest used to cache compiled artifacts. VllmConfig.compute_hash() (vllm/config/vllm.py330-432) calls compute_hash() on each sub-config and hashes the concatenation.

Fields that are excluded from a config's hash are those that don't affect the computation graph structure — for example, ModelConfig.compute_hash() excludes tokenizer, seed, served_model_name, and logprobs_mode.

The resulting hash appears in the compiled artifact cache directory path, ensuring that different configurations produce separate caches.

Sources: vllm/config/vllm.py330-432 vllm/config/model.py311-366

Config Object Relationships in Context

The following diagram shows where config objects originate (from EngineArgs) and how they flow into the broader engine subsystems:

Sources: vllm/engine/arg_utils.py361-640 vllm/config/vllm.py246-328

Configuration Objects

Relevant source files

Purpose and Scope

Package Layout

File	Primary Class(es)
`vllm/config/vllm.py`	`VllmConfig`, `OptimizationLevel`, `PerformanceMode`
`vllm/config/model.py`	`ModelConfig`
`vllm/config/cache.py`	`CacheConfig`
`vllm/config/parallel.py`	`ParallelConfig`, `EPLBConfig`
`vllm/config/scheduler.py`	`SchedulerConfig`
`vllm/config/attention.py`	`AttentionConfig`
`vllm/config/compilation.py`	`CompilationConfig`, `PassConfig`, `CompilationMode`, `CUDAGraphMode`
`vllm/config/load.py`	`LoadConfig`
`vllm/config/lora.py`	`LoRAConfig`
`vllm/config/speculative.py`	`SpeculativeConfig`
`vllm/config/multimodal.py`	`MultiModalConfig`
`vllm/config/observability.py`	`ObservabilityConfig`
`vllm/config/offload.py`	`OffloadConfig`, `UVAOffloadConfig`, `PrefetchOffloadConfig`
`vllm/config/profiler.py`	`ProfilerConfig`
`vllm/config/structured_outputs.py`	`StructuredOutputsConfig`
`vllm/config/kv_transfer.py`	`KVTransferConfig`
`vllm/config/kv_events.py`	`KVEventsConfig`
`vllm/config/ec_transfer.py`	`ECTransferConfig`
`vllm/config/weight_transfer.py`	`WeightTransferConfig`
`vllm/config/device.py`	`DeviceConfig`
`vllm/config/kernel.py`	`KernelConfig`
`vllm/config/pooler.py`	`PoolerConfig`

Sources: vllm/config/__init__.py1-130

VllmConfig — Root Container

VllmConfig composition — mapping concepts to code classes:

Sources: vllm/config/vllm.py247-328

Key fields on VllmConfig:

Field	Type	Default	Description
`model_config`	`ModelConfig`	`None`	Model identity, dtype, tokenizer
`cache_config`	`CacheConfig`	`CacheConfig()`	KV cache block allocation
`parallel_config`	`ParallelConfig`	`ParallelConfig()`	TP/PP/DP degrees
`scheduler_config`	`SchedulerConfig`	factory	Batch sizing, chunked prefill
`attention_config`	`AttentionConfig`	`AttentionConfig()`	Backend selection
`compilation_config`	`CompilationConfig`	`CompilationConfig()`	torch.compile, CUDA graph config
`lora_config`	`LoRAConfig \| None`	`None`	LoRA adapter settings
`speculative_config`	`SpeculativeConfig \| None`	`None`	Speculative decoding
`kv_transfer_config`	`KVTransferConfig \| None`	`None`	Disaggregated KV transfer
`optimization_level`	`OptimizationLevel`	`O2`	Compilation/graph optimization level
`performance_mode`	`PerformanceMode`	`"balanced"`	Runtime performance strategy
`instance_id`	`str`	set in `__post_init__`	Per-instance unique ID

Sources: vllm/config/vllm.py246-329

ModelConfig

Key fields:

Field	Type	Default	Notes
`model`	`str`	`"Qwen/Qwen3-0.6B"`	HF repo ID or local path
`tokenizer`	`str \| None`	`None` (falls back to `model`)	Override tokenizer path
`tokenizer_mode`	`TokenizerMode \| str`	`"auto"`	`"auto"`, `"hf"`, `"slow"`, `"mistral"`
`dtype`	`ModelDType \| torch.dtype`	`"auto"`	Resolved to `torch.dtype` in `__post_init__`
`max_model_len`	`int`	`None`	Auto-derived from HF config if unset
`quantization`	`QuantizationMethods \| str \| None`	`None`	Quantization method name
`seed`	`int`	`0`	Global RNG seed
`enforce_eager`	`bool`	`False`	Disable CUDA graph capture
`trust_remote_code`	`bool`	`False`	Allow custom model code from HF
`runner`	`RunnerOption`	`"auto"`	`"auto"`, `"generate"`, `"pooling"`, `"draft"`
`served_model_name`	`str \| list[str] \| None`	`None`	Names exposed via API
`hf_config`	`PretrainedConfig`	(loaded in `__post_init__`)	Not set by user directly
`multimodal_config`	`MultiModalConfig \| None`	(inferred)	Set if model is multimodal
`pooler_config`	`PoolerConfig \| None`	`None`	Pooling settings for embedding models
`logprobs_mode`	`LogprobsMode`	`"raw_logprobs"`	Content of returned logprobs
`hf_overrides`	`HfOverrides`	`{}`	Dict or callable to patch HF config

TokenizerMode, ModelDType, RunnerOption, ConvertOption, LogprobsMode are all Literal type aliases defined at the top of vllm/config/model.py76-86

After __post_init__, ModelConfig also stores hf_text_config, encoder_config, model_arch_config, runner_type, convert_type, and _model_info.

Sources: vllm/config/model.py99-600

CacheConfig

Defined in vllm/config/cache.py38 CacheConfig controls KV cache memory allocation and eviction policies.

Key fields:

Field	Type	Default	Notes
`block_size`	`BlockSize`	platform-set	Token block size; set by `Platform.check_and_update_config()`
`gpu_memory_utilization`	`float`	`0.9`	Fraction of GPU VRAM reserved for the model executor
`swap_space`	`float`	`4`	CPU swap space in GiB per GPU
`cache_dtype`	`CacheDType`	`"auto"`	KV cache storage dtype: `"auto"`, `"fp8"`, `"fp8_e4m3"`, `"bfloat16"`
`enable_prefix_caching`	`bool`	`True`	Enable prefix/prompt caching
`prefix_caching_hash_algo`	`PrefixCachingHashAlgo`	`"sha256"`	Hash algorithm for prefix cache keys
`num_gpu_blocks_override`	`int \| None`	`None`	Force a specific number of GPU KV blocks
`sliding_window`	`int \| None`	`None`	Sliding window size; usually mirrored from `ModelConfig`
`kv_offloading_size`	`float \| None`	`None`	GiB of KV cache to offload to CPU/disk
`kv_offloading_backend`	`KVOffloadingBackend`	`"native"`	`"native"` or `"lmcache"`
`mamba_cache_dtype`	`MambaDType`	`"auto"`	SSM state dtype for Mamba models
`mamba_cache_mode`	`MambaCacheMode`	`"none"`	`"all"`, `"align"`, `"none"`
`calculate_kv_scales`	`bool`	`False`	Compute per-block KV scales for FP8

BlockSize is Literal[1, 8, 16, 32, 64, 128, 256]. block_size has no static default and must be set by Platform.check_and_update_config() before use.

Sources: vllm/config/cache.py38-200

ParallelConfig

Defined in vllm/config/parallel.py93 ParallelConfig describes all distributed execution topology settings.

Key fields:

Field	Type	Default	Notes
`tensor_parallel_size`	`int`	`1`	Number of TP shards per pipeline stage
`pipeline_parallel_size`	`int`	`1`	Number of pipeline stages
`data_parallel_size`	`int`	`1`	Number of data-parallel engine replicas
`prefill_context_parallel_size`	`int`	`1`	Context parallelism during prefill
`decode_context_parallel_size`	`int`	`1`	Context parallelism during decode
`distributed_executor_backend`	`DistributedExecutorBackend \| None`	`None` (auto)	`"ray"`, `"mp"`, `"uni"`, `"external_launcher"`
`enable_expert_parallel`	`bool`	`False`	Use expert parallelism for MoE layers
`enable_eplb`	`bool`	`False`	Enable expert parallel load balancing
`eplb_config`	`EPLBConfig`	`EPLBConfig()`	EPLB window size, rebalance interval
`all2all_backend`	`All2AllBackend`	`"allgather_reducescatter"`	MoE expert comm backend
`worker_cls`	`str`	`"auto"`	Fully-qualified worker class name
`disable_custom_all_reduce`	`bool`	`False`	Fall back to NCCL all-reduce
`master_addr`	`str`	`"127.0.0.1"`	Master node address for multi-node MP
`master_port`	`int`	`29501`	Master node port for multi-node MP
`world_size`	`int`	(computed)	TP × PP; set in `__post_init__`
`enable_dbo`	`bool`	`False`	Dual batch overlap for microbatching

EPLBConfig (vllm/config/parallel.py50-90) is a nested config holding the EPLB rebalancing policy, window size, step interval, and number of redundant experts.

Sources: vllm/config/parallel.py50-400

SchedulerConfig

Defined in vllm/config/scheduler.py25 SchedulerConfig governs how the scheduler batches requests and allocates computation across iterations.

Key fields:

Field	Type	Default	Notes
`max_num_batched_tokens`	`int`	`2048`	Max tokens processed per iteration
`max_num_seqs`	`int`	`128`	Max concurrent sequences per iteration
`enable_chunked_prefill`	`bool`	(platform-determined)	Split long prefills across iterations
`max_num_partial_prefills`	`int`	`1`	Max simultaneous chunked prefill requests
`long_prefill_token_threshold`	`int`	—	Tokens above which a prefill is "long"
`policy`	`SchedulerPolicy`	`"fcfs"`	`"fcfs"` or `"priority"`
`async_scheduling`	`bool \| None`	`None`	Overlap scheduling and execution; auto-detected
`scheduler_cls`	`str \| type \| None`	`None`	Custom scheduler class
`stream_interval`	`int`	—	Token streaming granularity
`disable_chunked_mm_input`	`bool`	`False`	Disable chunked multimodal input processing
`disable_hybrid_kv_cache_manager`	`bool \| None`	`None`	Force or disable hybrid KV manager

max_model_len and is_encoder_decoder are InitVar parameters consumed in __post_init__ to set defaults and validate other fields, but are not stored as attributes.

Sources: vllm/config/scheduler.py25-200

AttentionConfig

Defined in vllm/config/attention.py, AttentionConfig is a thin wrapper that selects the attention backend and stores per-layer overrides.

Key fields:

Field	Type	Default	Notes
`backend`	`AttentionBackendEnum \| None`	`None` (auto)	Override attention backend; auto-selected by platform if `None`

Sources: vllm/engine/arg_utils.py578 vllm/config/vllm.py270-271

CompilationConfig

Key fields:

Field	Type	Default	Notes
`mode`	`CompilationMode`	`None` (→ `VLLM_COMPILE` for V1)	`NONE`, `STOCK_TORCH_COMPILE`, `DYNAMO_TRACE_ONCE`, `VLLM_COMPILE`
`cudagraph_mode`	`CUDAGraphMode`	`None` (→ `FULL_AND_PIECEWISE`)	`NONE`, `PIECEWISE`, `FULL`, `FULL_AND_PIECEWISE`, `FULL_DECODE_ONLY`
`cudagraph_capture_sizes`	`list[int] \| None`	`None` (auto)	Explicit batch sizes to capture CUDA graphs for
`max_cudagraph_capture_size`	`int \| None`	`None`	Upper bound for auto-generated capture sizes
`custom_ops`	`list[str]`	`[]`	Enable/disable specific custom ops (`"+op"` / `"-op"`)
`splitting_ops`	`list[str] \| None`	`None`	Ops used to split the graph for piecewise compilation
`backend`	`str`	`""` (→ `"inductor"`)	Inductor backend or qualified name
`pass_config`	`PassConfig`	`PassConfig()`	Per-pass fusion flags
`cache_dir`	`str`	`""` (auto)	Directory for compiled artifact cache
`compile_mm_encoder`	`bool`	`False`	Compile multimodal encoder as well

PassConfig (vllm/config/compilation.py101-270) exposes fine-grained flags like fuse_norm_quant, fuse_act_quant, fuse_allreduce_rms, enable_sp, and fuse_gemm_comms.

Sources: vllm/config/compilation.py336-650

SpeculativeConfig

Defined in vllm/config/speculative.py and referenced in vllm/config/vllm.py276 SpeculativeConfig is None when speculative decoding is disabled. When enabled, key fields include:

Field	Type	Notes
`method`	`str`	Method type: `"eagle"`, `"eagle3"`, `"ngram"`, `"draft_model"`, etc.
`num_speculative_tokens`	`int \| None`	Draft tokens to propose per step
`disable_padded_drafter_batch`	`bool`	Disables padding in draft model batches

Sources: vllm/config/vllm.py276-277 vllm/config/vllm.py692-745

LoadConfig

Defined in vllm/config/load.py, LoadConfig governs weight loading format and source.

Field	Type	Default	Notes
`load_format`	`str \| LoadFormats`	`"auto"`	`"auto"`, `"pt"`, `"safetensors"`, `"npcache"`, `"dummy"`, `"gguf"`, `"bitsandbytes"`, `"mistral"`
`download_dir`	`str \| None`	`None`	Override HF cache directory
`safetensors_load_strategy`	`str`	`"auto"`	Strategy for loading safetensors files
`model_loader_extra_config`	`dict`	`{}`	Passed to the model loader implementation
`ignore_patterns`	`str \| list[str]`	`[]`	Glob patterns for weight files to skip
`use_tqdm_on_load`	`bool`	`True`	Show progress bar during weight loading

Sources: vllm/engine/arg_utils.py379-381 vllm/engine/arg_utils.py751-768

LoRAConfig

Defined in vllm/config/lora.py, LoRAConfig is only present in VllmConfig when --enable-lora is set.

Field	Type	Default	Notes
`max_loras`	`int`	`1`	Maximum simultaneous LoRA adapters
`max_lora_rank`	`MaxLoRARanks`	—	Maximum LoRA rank dimension
`max_cpu_loras`	`int \| None`	`None`	CPU-cached adapters
`lora_dtype`	`str \| torch.dtype \| None`	`None`	LoRA weight dtype
`fully_sharded_loras`	`bool`	`False`	Shard LoRA across TP ranks

Sources: vllm/engine/arg_utils.py500-509

ObservabilityConfig

Defined in vllm/config/observability.py, controls metrics and tracing.

Field	Type	Notes
`otlp_traces_endpoint`	`str \| None`	OTLP endpoint for distributed tracing
`collect_detailed_traces`	`list[DetailedTraceModules] \| None`	Modules to trace in detail
`kv_cache_metrics`	`bool`	Enable per-request KV cache metrics
`enable_mfu_metrics`	`bool`	Enable MFU (Model FLOP Utilization) metrics
`show_hidden_metrics_for_version`	`str \| None`	Expose unstable metrics for a given version

Sources: vllm/engine/arg_utils.py531-550

OptimizationLevel and PerformanceMode

Level	Compilation defaults	CUDA graph mode	Notes
`O0`	No fusion passes, no compile	`NONE`	Fastest startup; eager PyTorch only
`O1`	Norm/act fusion	`PIECEWISE`	Quick optimizations
`O2`	All O1 + allreduce fusion	`FULL_AND_PIECEWISE`	Default
`O3`	Same as O2 (currently)	`FULL_AND_PIECEWISE`	Reserved for future use

PerformanceMode (vllm/config/vllm.py79) is a Literal["balanced", "interactivity", "throughput"] that influences scheduling and kernel selection independently of OptimizationLevel.

Sources: vllm/config/vllm.py64-243

Cross-Config Validation

VllmConfig.__post_init__ (vllm/config/vllm.py652) performs validation and reconciliation across multiple sub-configs after all fields are set. The sequence is:

VllmConfig __post_init__ validation flow:

Key cross-config constraints enforced here:

Async scheduling is auto-disabled when the speculative method is not EAGLE/MTP and when the executor backend is not "mp", "uni", or "external_launcher".
DP NCCL is disabled automatically when async scheduling is active.
KV offloading: if CacheConfig.kv_offloading_size is set, VllmConfig._post_init_kv_transfer_config() creates or patches KVTransferConfig to wire up the offloading backend.
Quantization config: VllmConfig._get_quantization_config() resolves ModelConfig.quantization to a QuantizationConfig and validates GPU capability and dtype compatibility.
LoRA calls LoRAConfig.verify_with_model_config() which checks that the model supports LoRA.

Sources: vllm/config/vllm.py652-840

Config Hashing

The resulting hash appears in the compiled artifact cache directory path, ensuring that different configurations produce separate caches.

Sources: vllm/config/vllm.py330-432 vllm/config/model.py311-366

Config Object Relationships in Context

The following diagram shows where config objects originate (from EngineArgs) and how they flow into the broader engine subsystems:

Sources: vllm/engine/arg_utils.py361-640 vllm/config/vllm.py246-328

Configuration Objects

Purpose and Scope

Package Layout

VllmConfig — Root Container

ModelConfig

CacheConfig

ParallelConfig

SchedulerConfig

AttentionConfig

CompilationConfig

SpeculativeConfig

LoadConfig

LoRAConfig

ObservabilityConfig

OptimizationLevel and PerformanceMode

Cross-Config Validation

Config Hashing

Config Object Relationships in Context

On this page

Configuration Objects

Purpose and Scope

Package Layout

VllmConfig — Root Container

ModelConfig

CacheConfig

ParallelConfig

SchedulerConfig

AttentionConfig

CompilationConfig

SpeculativeConfig

LoadConfig

LoRAConfig

ObservabilityConfig

OptimizationLevel and PerformanceMode

Cross-Config Validation

Config Hashing

Config Object Relationships in Context

On this page