Environment Variables System

Relevant source files

This page documents vllm/envs.py, vLLM's centralized environment variable registry. It covers the module's structure, the access mechanism, helper utilities, and a categorized reference of all VLLM_* variables. For information on how these variables feed into strongly-typed configuration objects at engine startup, see Configuration Objects. For compilation-specific settings, see Compilation Configuration.

Overview

vllm/envs.py serves as the single authoritative source for all environment variable definitions in vLLM. Rather than scattering os.environ.get(...) calls throughout the codebase, every variable is declared in one place with its type, default value, and validation logic. The rest of the codebase reads variables through a module-level __getattr__ that lazily evaluates the appropriate callable from the environment_variables dict.

Usage pattern throughout the codebase:

Module Structure

The file is organized into three layers that work together:

Module structure diagram: vllm/envs.py

Sources: vllm/envs.py1-250 vllm/envs.py466-530

The `TYPE_CHECKING` Block

vllm/envs.py14-248 contains a block guarded by if TYPE_CHECKING:. This block declares all variables with their Python types and default values. It is never executed at runtime; its sole purpose is to give type checkers and IDE tooling accurate type information when code does import vllm.envs as envs; envs.VLLM_CACHE_ROOT.

The `environment_variables` Dict

vllm/envs.py473-881 defines the dict:

Each entry is a zero-argument callable. The lambdas are evaluated each time the variable is accessed, so environment changes after import are reflected.

Module `getattr`

The module defines a __getattr__ function that intercepts attribute access (e.g., envs.VLLM_CACHE_ROOT), looks up the name in environment_variables, and calls the associated lambda. This is what makes the import vllm.envs as envs pattern work.

Helper Utilities

Several utility functions are defined to reduce repetition in the variable lambdas.

Utility functions in vllm/envs.py

Function	Location	Purpose
`get_default_cache_root()`	vllm/envs.py250-254	Returns `XDG_CACHE_HOME` or `~/.cache`
`get_default_config_root()`	vllm/envs.py257-261	Returns `XDG_CONFIG_HOME` or `~/.config`
`maybe_convert_int(value)`	vllm/envs.py264-267	Converts string to int or returns `None`
`maybe_convert_bool(value)`	vllm/envs.py270-273	Converts `"0"`/`"1"` string to bool or `None`
`disable_compile_cache()`	vllm/envs.py276-277	Reads `VLLM_DISABLE_COMPILE_CACHE` as bool
`use_aot_compile()`	vllm/envs.py280-295	Computes AOT compile default based on torch version
`env_with_choices(...)`	vllm/envs.py298-340	Validated single-value enum env var
`env_list_with_choices(...)`	vllm/envs.py343-395	Validated comma-separated list env var
`env_set_with_choices(...)`	vllm/envs.py398-413	Like `env_list_with_choices` but returns `set`
`get_vllm_port()`	vllm/envs.py416-442	Parses `VLLM_PORT` with Kubernetes URI detection
`get_env_or_set_default(...)`	vllm/envs.py445-463	Returns env var or generates+writes a default

Sources: vllm/envs.py250-463

Validated Choices Pattern

env_with_choices is used for variables with a fixed set of allowed values. For example:

If an invalid value is set, a ValueError is raised with a message listing the valid options.

Access Pattern Across the Codebase

How vllm/envs.py is consumed

Sources: vllm/model_executor/layers/quantization/fp8.py10-11 vllm/model_executor/layers/quantization/mxfp4.py8 vllm/model_executor/layers/fused_moe/layer.py11 vllm/model_executor/layers/fused_moe/config.py9 vllm/utils/flashinfer.py20 vllm/model_executor/layers/quantization/quark/quark_moe.py9

Variable Reference by Category

Installation / Build Time

These variables are read during setup.py or at package build time and affect compilation, not inference.

Variable	Type	Default	Purpose
`VLLM_TARGET_DEVICE`	`str`	`"cuda"`	Target hardware: `cuda`, `rocm`, `cpu`
`VLLM_MAIN_CUDA_VERSION`	`str`	`"12.9"`	CUDA version string for build
`VLLM_FLOAT32_MATMUL_PRECISION`	`"highest"\|"high"\|"medium"`	`"highest"`	`torch.set_float32_matmul_precision` mode in workers
`MAX_JOBS`	`str\|None`	`None`	Parallel compilation job count
`NVCC_THREADS`	`str\|None`	`None`	Threads per nvcc invocation
`VLLM_USE_PRECOMPILED`	`bool`	`False`	Load precompiled `.so` binaries
`VLLM_SKIP_PRECOMPILED_VERSION_SUFFIX`	`bool`	`False`	Omit `+precompiled` from version string
`VLLM_DOCKER_BUILD_CONTEXT`	`bool`	`False`	Force precompiled in Docker context
`CMAKE_BUILD_TYPE`	`"Debug"\|"Release"\|"RelWithDebInfo"\|None`	`None`	CMake build type
`VERBOSE`	`bool`	`False`	Verbose build output
`VLLM_CONFIG_ROOT`	`str`	`~/.config/vllm`	Config file root (also affects install paths)

Sources: vllm/envs.py474-530

Paths and Caching

Variable	Type	Default	Purpose
`VLLM_CACHE_ROOT`	`str`	`~/.cache/vllm`	Root for all vLLM cache files; respects `XDG_CACHE_HOME`
`VLLM_ASSETS_CACHE`	`str`	`~/.cache/vllm/assets`	Downloaded assets cache
`VLLM_ASSETS_CACHE_MODEL_CLEAN`	`bool`	`False`	Clean model files from assets cache on exit
`VLLM_XLA_CACHE_PATH`	`str`	`~/.cache/vllm/xla_cache`	XLA persistent cache (TPU only)
`VLLM_RPC_BASE_PATH`	`str`	`tempfile.gettempdir()`	IPC socket base path for multiprocessing mode
`VLLM_LORA_RESOLVER_CACHE_DIR`	`str\|None`	`None`	Local directory for unrecognized LoRA adapters
`VLLM_TUNED_CONFIG_FOLDER`	`str\|None`	`None`	Folder for pre-tuned kernel configs

Sources: vllm/envs.py531-545 vllm/envs.py739-750

Distributed Execution

Variable	Type	Default	Purpose
`VLLM_HOST_IP`	`str`	`""`	IP of the current node in multi-node setups
`VLLM_PORT`	`int\|None`	`None`	Base communication port; incremented for additional ports
`LOCAL_RANK`	`int`	`0`	Local rank within a node for GPU assignment
`CUDA_VISIBLE_DEVICES`	`str\|None`	`None`	GPU visibility control
`VLLM_NCCL_SO_PATH`	`str\|None`	`None`	Path to a specific NCCL `.so` file
`LD_LIBRARY_PATH`	`str\|None`	`None`	Fallback for NCCL library discovery
`VLLM_NCCL_INCLUDE_PATH`	`str\|None`	`None`	NCCL include path
`VLLM_PP_LAYER_PARTITION`	`str\|None`	`None`	Manual pipeline stage partition spec
`VLLM_DP_RANK`	`int`	`0`	Data parallel rank
`VLLM_DP_RANK_LOCAL`	`int`	`-1`	Local data parallel rank
`VLLM_DP_SIZE`	`int`	`1`	Data parallel world size
`VLLM_DP_MASTER_IP`	`str`	`""`	Master IP for DP coordination
`VLLM_DP_MASTER_PORT`	`int`	`0`	Master port for DP coordination
`VLLM_WORKER_MULTIPROC_METHOD`	`"fork"\|"spawn"`	`"fork"`	Worker process spawn method
`VLLM_DISABLE_PYNCCL`	`bool`	`False`	Disable PyNCCL (use NCCL via torch.distributed instead)
`VLLM_SKIP_P2P_CHECK`	`bool`	`False`	Skip GPU peer-to-peer connectivity check
`VLLM_ALLREDUCE_USE_SYMM_MEM`	`bool`	`True`	Use symmetric memory for all-reduce
`VLLM_ALLREDUCE_USE_FLASHINFER`	`bool`	`False`	Use FlashInfer for all-reduce
`VLLM_USE_NCCL_SYMM_MEM`	`bool`	`False`	Use NCCL symmetric memory
`VLLM_RINGBUFFER_WARNING_INTERVAL`	`int`	`60`	Seconds between ring buffer full warnings
`VLLM_LOOPBACK_IP`	`str`	`""`	Loopback IP override

Sources: vllm/envs.py540-627

Ray-Specific

Variable	Type	Default	Purpose
`VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE`	`"auto"\|"nccl"\|"shm"`	`"auto"`	Channel type for Ray Compiled DAG pipeline parallelism
`VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM`	`bool`	`False`	GPU communication overlap in Ray Compiled DAG
`VLLM_USE_RAY_WRAPPED_PP_COMM`	`bool`	`True`	Use vLLM's communicator wrapper with Ray Compiled DAG
`VLLM_RAY_PER_WORKER_GPUS`	`float`	`1.0`	GPUs allocated per Ray worker
`VLLM_RAY_BUNDLE_INDICES`	`str`	`""`	Ray bundle index assignments
`VLLM_RAY_DP_PACK_STRATEGY`	`"strict"\|"fill"\|"span"`	`"strict"`	DP packing strategy for Ray
`VLLM_RAY_EXTRA_ENV_VAR_PREFIXES_TO_COPY`	`str`	`""`	Env var prefixes to propagate to Ray workers
`VLLM_RAY_EXTRA_ENV_VARS_TO_COPY`	`str`	`""`	Specific env vars to propagate to Ray workers

Sources: vllm/envs.py713-744

Logging

Variable	Type	Default	Purpose
`VLLM_CONFIGURE_LOGGING`	`bool`	`True`	If `False`, vLLM does not configure logging at all
`VLLM_LOGGING_LEVEL`	`str`	`"INFO"`	Default log level (uppercased)
`VLLM_LOGGING_PREFIX`	`str`	`""`	String prepended to all log messages
`VLLM_LOGGING_STREAM`	`str`	`"ext://sys.stdout"`	Log output stream
`VLLM_LOGGING_CONFIG_PATH`	`str\|None`	`None`	Path to a JSON logging config file
`VLLM_LOGGING_COLOR`	`str`	`"auto"`	Color output: `"auto"`, `"1"` (always), `"0"` (never)
`NO_COLOR`	`bool`	`False`	Standard ANSI color disable flag
`VLLM_LOG_STATS_INTERVAL`	`float`	`10.0`	Seconds between stats log emissions
`VLLM_LOG_BATCHSIZE_INTERVAL`	`float`	`-1`	Seconds between batch size logs; `-1` = disabled
`VLLM_TRACE_FUNCTION`	`int`	`0`	Enable function call tracing when `1`
`VLLM_DEBUG_LOG_API_SERVER_RESPONSE`	`bool`	`False`	Log full API server responses (debug)

Sources: vllm/envs.py659-686

Engine and API Server

Variable	Type	Default	Purpose
`VLLM_ENGINE_ITERATION_TIMEOUT_S`	`int`	`60`	Seconds before an engine iteration is considered hung
`VLLM_ENGINE_READY_TIMEOUT_S`	`int`	`600`	Seconds to wait for engine core to become ready
`VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS`	`int`	`300`	Timeout for `execute_model` calls
`VLLM_RPC_TIMEOUT`	`int`	`10000`	ZeroMQ client timeout in milliseconds
`VLLM_HTTP_TIMEOUT_KEEP_ALIVE`	`int`	`5`	HTTP keep-alive timeout in seconds
`VLLM_API_KEY`	`str\|None`	`None`	Bearer token for API authentication
`VLLM_KEEP_ALIVE_ON_ENGINE_DEATH`	`bool`	`False`	Keep HTTP server alive after engine failure
`VLLM_V1_OUTPUT_PROC_CHUNK_SIZE`	`int`	`128`	Output processing chunk size in v1 engine
`VLLM_ENABLE_V1_MULTIPROCESSING`	`bool`	`True`	Enable multiprocessing in v1 engine
`VLLM_SERVER_DEV_MODE`	`bool`	`False`	Enable developer mode on the server
`VLLM_ENABLE_RESPONSES_API_STORE`	`bool`	`False`	Enable persistent Responses API storage
`VLLM_MQ_MAX_CHUNK_BYTES_MB`	`int`	`16`	Max message queue chunk size in MB
`VLLM_DISABLE_REQUEST_ID_RANDOMIZATION`	`bool`	`False`	Disable UUID request IDs (for testing)

Sources: vllm/envs.py627-637 vllm/envs.py856-868

Model Loading

Variable	Type	Default	Purpose
`VLLM_USE_MODELSCOPE`	`bool`	`False`	Load models from ModelScope instead of HuggingFace Hub
`VLLM_MODEL_REDIRECT_PATH`	`str\|None`	`None`	Redirect model paths (local override)
`VLLM_ALLOW_LONG_MAX_MODEL_LEN`	`bool`	`False`	Allow `max_model_len` greater than model's config maximum
`S3_ACCESS_KEY_ID`	`str\|None`	`None`	S3 access key (for tensorizer)
`S3_SECRET_ACCESS_KEY`	`str\|None`	`None`	S3 secret key (for tensorizer)
`S3_ENDPOINT_URL`	`str\|None`	`None`	S3 endpoint URL (for tensorizer)

Sources: vllm/envs.py555-646

Multimodal

Variable	Type	Default	Purpose
`VLLM_IMAGE_FETCH_TIMEOUT`	`int`	`5`	HTTP timeout (seconds) for image fetching
`VLLM_VIDEO_FETCH_TIMEOUT`	`int`	`30`	HTTP timeout (seconds) for video fetching
`VLLM_AUDIO_FETCH_TIMEOUT`	`int`	`10`	HTTP timeout (seconds) for audio fetching
`VLLM_MEDIA_URL_ALLOW_REDIRECTS`	`bool`	`True`	Follow HTTP redirects for media URLs
`VLLM_MEDIA_LOADING_THREAD_COUNT`	`int`	`8`	Thread pool size for media byte loading
`VLLM_MAX_AUDIO_CLIP_FILESIZE_MB`	`int`	`25`	Maximum audio file size for STT requests
`VLLM_VIDEO_LOADER_BACKEND`	`str`	`"opencv"`	Video I/O backend: `"opencv"` or `"identity"`
`VLLM_MEDIA_CONNECTOR`	`str`	`"http"`	Media connector implementation
`VLLM_MM_HASHER_ALGORITHM`	`"blake3"\|"sha256"\|"sha512"`	`"blake3"`	Hash algorithm for multimodal content dedup

Sources: vllm/envs.py751-810

Attention

Variable	Type	Default	Purpose
`VLLM_MLA_DISABLE`	`bool`	`False`	Force-disable Multi-head Latent Attention (MLA)
`VLLM_KV_CACHE_LAYOUT`	`"NHD"\|"HND"\|None`	`None`	Force a specific KV cache layout
`VLLM_ALLOW_CHUNKED_LOCAL_ATTN_WITH_HYBRID_KV_CACHE`	`bool`	`True`	Allow chunked local attention with hybrid KV cache
`VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE`	`int`	`394 * 1024 * 1024`	FlashInfer workspace buffer size in bytes
`VLLM_FLASHINFER_ALLREDUCE_BACKEND`	`"auto"\|"trtllm"\|"mnnvl"`	`"auto"`	FlashInfer all-reduce backend selection
`Q_SCALE_CONSTANT`	`int`	`200`	Query scale constant for attention quantization
`K_SCALE_CONSTANT`	`int`	`200`	Key scale constant for attention quantization
`V_SCALE_CONSTANT`	`int`	`100`	Value scale constant for attention quantization

Sources: vllm/envs.py122-126 vllm/envs.py184-185

Compilation and `torch.compile`

For full documentation of compilation behavior, see Compilation Configuration. These variables directly influence that subsystem.

Variable	Type	Default	Purpose
`VLLM_DISABLE_COMPILE_CACHE`	`bool`	`False`	Disable the compilation artifact cache
`VLLM_USE_AOT_COMPILE`	`bool`	computed	Ahead-of-time compile during warmup
`VLLM_USE_BYTECODE_HOOK`	`bool`	`True`	Enable bytecode hook in `TorchCompileWithNoGuardsWrapper`
`VLLM_FORCE_AOT_LOAD`	`bool`	`False`	Require AOT artifacts to exist; hard error otherwise
`VLLM_USE_MEGA_AOT_ARTIFACT`	`bool`	`False`	Load compiled models from mega AOT artifact
`VLLM_USE_STANDALONE_COMPILE`	`bool`	`True`	Enable Inductor standalone compile
`VLLM_ENABLE_PREGRAD_PASSES`	`bool`	`False`	Enable Inductor pre-grad passes (normally skipped)
`VLLM_PATTERN_MATCH_DEBUG`	`str\|None`	`None`	`fx.Node` name to debug in custom passes
`VLLM_DEBUG_DUMP_PATH`	`str\|None`	`None`	Directory to dump fx graphs
`VLLM_COMPILE_CACHE_SAVE_FORMAT`	`"binary"\|"unpacked"`	`"binary"`	Format for cached compile artifacts
`VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`	`bool`	`True`	Enable Inductor max autotune
`VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING`	`bool`	`True`	Enable coordinate descent tuning
`VLLM_ENABLE_CUDAGRAPH_GC`	`bool`	`False`	Enable CUDA graph garbage collection
`VLLM_ENABLE_PREGRAD_PASSES`	`bool`	`False`	Enable pre-grad passes (off by default for compile speed)

VLLM_USE_AOT_COMPILE has computed logic: it defaults to True on torch >= 2.10.0 if disable_compile_cache() is False and vllm_is_batch_invariant() is False. See vllm/envs.py280-295

Sources: vllm/envs.py579-622 vllm/envs.py121-122

Quantization

Quantization environment variable mapping

Variable	Type	Default	Purpose
`VLLM_USE_TRITON_AWQ`	`bool`	`False`	Use Triton AWQ kernels instead of default
`VLLM_USE_DEEP_GEMM`	`bool`	`True`	Enable DeepGEMM library for FP8 GEMMs
`VLLM_MOE_USE_DEEP_GEMM`	`bool`	`True`	Enable DeepGEMM for MoE FP8 GEMMs
`VLLM_USE_DEEP_GEMM_E8M0`	`bool`	`True`	Enable E8M0 scale format in DeepGEMM
`VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES`	`bool`	`True`	Use TMA-aligned scales in DeepGEMM
`VLLM_DEEP_GEMM_WARMUP`	`"skip"\|"full"\|"relax"`	`"relax"`	DeepGEMM JIT warmup strategy
`VLLM_MARLIN_USE_ATOMIC_ADD`	`bool`	`False`	Use atomic add in Marlin kernel
`VLLM_MARLIN_INPUT_DTYPE`	`"int8"\|"fp8"\|None`	`None`	Override Marlin input dtype
`VLLM_MXFP4_USE_MARLIN`	`bool\|None`	`None`	Force MXFP4 to use Marlin backend
`VLLM_NVFP4_GEMM_BACKEND`	`str\|None`	`None`	Override NVFP4 GEMM backend
`VLLM_USE_NVFP4_CT_EMULATIONS`	`bool`	`False`	Use NVFP4 chip-level emulation
`VLLM_XGRAMMAR_CACHE_MB`	`int`	`0`	xGrammar cache size in MB; `0` = disabled
`Q_SCALE_CONSTANT`	`int`	`200`	Q attention quantization scale
`K_SCALE_CONSTANT`	`int`	`200`	K attention quantization scale
`V_SCALE_CONSTANT`	`int`	`100`	V attention quantization scale
`VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER`	`bool`	`True`	Use FlashInfer for block-scale FP8 GEMM
`VLLM_DEEPEPLL_NVFP4_DISPATCH`	`bool`	`False`	NVFP4 dispatch via DeepEP LL

Sources: vllm/envs.py94-100 vllm/envs.py145-168 vllm/envs.py153-161

MoE (Mixture of Experts)

Variable	Type	Default	Purpose
`VLLM_FUSED_MOE_CHUNK_SIZE`	`int`	`16384`	Token chunk size for fused MoE kernel
`VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING`	`bool`	`True`	Enable activation chunking in fused MoE
`VLLM_USE_FUSED_MOE_GROUPED_TOPK`	`bool`	`True`	Enable grouped top-k in fused MoE
`VLLM_MOE_DP_CHUNK_SIZE`	`int`	`256`	Max tokens per MoE data-parallel chunk
`VLLM_ENABLE_MOE_DP_CHUNK`	`bool`	`True`	Enable MoE data-parallel chunking
`VLLM_RANDOMIZE_DP_DUMMY_INPUTS`	`bool`	`False`	Randomize dummy inputs in DP (testing)
`VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE`	`int`	`163840`	Max tokens per expert for FP4 MoE kernels
`VLLM_DBO_COMM_SMS`	`int`	`20`	SMs for DBO communication
`VLLM_DISABLE_SHARED_EXPERTS_STREAM`	`bool`	`False`	Disable shared expert stream overlap
`VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD`	`int`	`256`	Token threshold for shared expert stream

For variable usage by FusedMoE and FusedMoEConfig, see vllm/model_executor/layers/fused_moe/layer.py557 and vllm/model_executor/layers/fused_moe/config.py9

Sources: vllm/envs.py56-58 vllm/envs.py139-141 vllm/envs.py824-832

FlashInfer

Variable	Type	Default	Purpose
`VLLM_USE_FLASHINFER_SAMPLER`	`bool\|None`	`None`	Force FlashInfer sampler on/off
`VLLM_HAS_FLASHINFER_CUBIN`	`bool`	`False`	Override cubin detection (signals preinstalled cubin)
`VLLM_USE_FLASHINFER_MOE_FP16`	`bool`	`False`	Enable FlashInfer FP16 MoE kernel
`VLLM_USE_FLASHINFER_MOE_FP8`	`bool`	`False`	Enable FlashInfer FP8 MoE kernel
`VLLM_USE_FLASHINFER_MOE_FP4`	`bool`	`False`	Enable FlashInfer FP4 MoE kernel
`VLLM_USE_FLASHINFER_MOE_INT4`	`bool`	`False`	Enable FlashInfer INT4 MoE kernel
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8`	`bool`	`False`	Enable FlashInfer MXFP4/MXFP8 TRTLLM MoE (SM100)
`VLLM_USE_FLASHINFER_MOE_MXFP4_BF16`	`bool`	`False`	Enable FlashInfer MXFP4/BF16 MoE (SM90)
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS`	`bool`	`False`	Enable FlashInfer MXFP4/MXFP8 CUTLASS MoE (SM100)
`VLLM_FLASHINFER_MOE_BACKEND`	`"throughput"\|"latency"\|"masked_gemm"`	`"latency"`	FlashInfer MoE backend mode

These variables are checked in get_mxfp4_backend() vllm/model_executor/layers/quantization/mxfp4.py108-183 and has_flashinfer_cubin() vllm/utils/flashinfer.py39-46

Sources: vllm/envs.py163-172 vllm/envs.py204-208

ROCm / AMD Platform

Variable	Type	Default	Purpose
`VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE`	`int`	`256`	Memory chunk size (MB) for sleeping memory on ROCm
`VLLM_ROCM_USE_AITER`	`bool`	`False`	Master switch for AITER library on ROCm
`VLLM_ROCM_USE_AITER_PAGED_ATTN`	`bool`	`False`	AITER paged attention
`VLLM_ROCM_USE_AITER_LINEAR`	`bool`	`True`	AITER linear kernels
`VLLM_ROCM_USE_AITER_MOE`	`bool`	`True`	AITER MoE kernels
`VLLM_ROCM_USE_AITER_RMSNORM`	`bool`	`True`	AITER RMSNorm kernels
`VLLM_ROCM_USE_AITER_MLA`	`bool`	`True`	AITER MLA kernels
`VLLM_ROCM_USE_AITER_MHA`	`bool`	`True`	AITER MHA kernels
`VLLM_ROCM_USE_AITER_FP8BMM`	`bool`	`True`	AITER FP8 batched MM
`VLLM_ROCM_USE_AITER_FP4BMM`	`bool`	`True`	AITER FP4 batched MM
`VLLM_ROCM_USE_AITER_TRITON_ROPE`	`bool`	`True`	AITER Triton RoPE
`VLLM_ROCM_USE_AITER_TRITON_GEMM`	`bool`	`True`	AITER Triton GEMM
`VLLM_ROCM_USE_AITER_FP4_ASM_GEMM`	`bool`	`False`	AITER FP4 ASM GEMM
`VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION`	`bool`	`False`	AITER unified attention
`VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS`	`bool`	`False`	AITER shared experts fusion
`VLLM_ROCM_USE_SKINNY_GEMM`	`bool`	`True`	Use skinny GEMM kernels on ROCm
`VLLM_ROCM_FP8_PADDING`	`bool`	`True`	Enable FP8 weight padding on ROCm
`VLLM_ROCM_MOE_PADDING`	`bool`	`True`	Enable MoE weight padding on ROCm
`VLLM_ROCM_CUSTOM_PAGED_ATTN`	`bool`	`True`	Use custom paged attention on ROCm
`VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT`	`bool`	`False`	Shuffle KV cache layout on ROCm
`VLLM_ROCM_FP8_MFMA_PAGE_ATTN`	`bool`	`False`	FP8 MFMA paged attention on ROCm
`VLLM_ROCM_QUICK_REDUCE_QUANTIZATION`	`"FP"\|"INT8"\|"INT6"\|"INT4"\|"NONE"`	`"NONE"`	Quick reduce quantization scheme
`VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16`	`bool`	`True`	Cast BF16 to FP16 in quick reduce
`VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB`	`int\|None`	`None`	Max tensor size for quick reduce

Sources: vllm/envs.py101-119 vllm/envs.py187-192

TPU / XLA

Variable	Type	Default	Purpose
`VLLM_XLA_USE_SPMD`	`bool`	`False`	Enable SPMD mode for TPU backend
`VLLM_XLA_CHECK_RECOMPILATION`	`bool`	`False`	Assert on XLA recompilation after each step
`VLLM_XLA_CACHE_PATH`	`str`	`~/.cache/vllm/xla_cache`	XLA persistent cache directory
`VLLM_TPU_BUCKET_PADDING_GAP`	`int`	`0`	TPU bucket padding gap for sequence lengths
`VLLM_TPU_MOST_MODEL_LEN`	`int\|None`	`None`	Most common model length hint for TPU
`VLLM_TPU_USING_PATHWAYS`	`bool`	`False`	Using Pathways runtime for TPU

Sources: vllm/envs.py810-823

CPU Backend

Variable	Type	Default	Purpose
`VLLM_CPU_KVCACHE_SPACE`	`int\|None`	`None`	KV cache space in GB for CPU backend (default 4 GB)
`VLLM_CPU_OMP_THREADS_BIND`	`str`	`"auto"`	OpenMP thread CPU binding spec (e.g., `"0-31"`)
`VLLM_CPU_NUM_OF_RESERVED_CPU`	`int\|None`	`None`	CPU cores reserved from OMP threads
`VLLM_CPU_SGL_KERNEL`	`bool`	`False`	Use SGL kernels (optimized for small batch) on CPU

Sources: vllm/envs.py696-711

LoRA and Plugins

Variable	Type	Default	Purpose
`VLLM_PLUGINS`	`list[str]\|None`	`None`	Comma-separated plugin names; `None` = all; `""` = none
`VLLM_ALLOW_RUNTIME_LORA_UPDATING`	`bool`	`False`	Enable hot-loading LoRA adapters at runtime
`VLLM_LORA_RESOLVER_CACHE_DIR`	`str\|None`	`None`	Local directory for LoRA adapter resolution
`VLLM_LORA_RESOLVER_HF_REPO_LIST`	`str\|None`	`None`	Comma-separated HF repos for LoRA resolution
`VLLM_LORA_DISABLE_PDL`	`bool`	`False`	Disable PDL for LoRA

Sources: vllm/envs.py862-880

Usage Statistics

Variable	Type	Default	Purpose
`VLLM_USAGE_STATS_SERVER`	`str`	`"https://stats.vllm.ai"`	Stats reporting server URL
`VLLM_NO_USAGE_STATS`	`bool`	`False`	Disable usage stats collection entirely
`VLLM_DO_NOT_TRACK`	`bool`	`False`	Respects `DO_NOT_TRACK` standard as well
`VLLM_USAGE_SOURCE`	`str`	`"production"`	Tag for the usage stats source

Sources: vllm/envs.py647-658

KV Cache Transfer / Disaggregated Serving

For context on the transfer system, see KV Cache Transfer and Disaggregated Serving.

Variable	Type	Default	Purpose
`VLLM_NIXL_SIDE_CHANNEL_HOST`	`str`	`"localhost"`	NIXL side channel host for KV transfer
`VLLM_NIXL_SIDE_CHANNEL_PORT`	`int`	`5600`	NIXL side channel port
`VLLM_NIXL_ABORT_REQUEST_TIMEOUT`	`int`	`480`	Seconds before aborting a NIXL request
`VLLM_MOONCAKE_BOOTSTRAP_PORT`	`int`	`8998`	Mooncake connector bootstrap port
`VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT`	`int`	`480`	Mooncake request abort timeout
`VLLM_MORIIO_CONNECTOR_READ_MODE`	`bool`	`False`	Moriio connector read mode
`VLLM_MORIIO_QP_PER_TRANSFER`	`int`	`1`	Moriio QPs per transfer
`VLLM_MORIIO_POST_BATCH_SIZE`	`int`	`-1`	Moriio post batch size
`VLLM_MORIIO_NUM_WORKERS`	`int`	`1`	Moriio number of workers
`VLLM_DEEPEP_BUFFER_SIZE_MB`	`int`	`1024`	DeepEP buffer size in MB
`VLLM_DEEPEP_HIGH_THROUGHPUT_FORCE_INTRA_NODE`	`bool`	`False`	Force intra-node for DeepEP HT
`VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL`	`bool`	`False`	Use MNNVL for DeepEP LL

Sources: vllm/envs.py177-199 vllm/envs.py221-224

Communication and Serialization

Variable	Type	Default	Purpose
`VLLM_MSGPACK_ZERO_COPY_THRESHOLD`	`int`	`256`	Byte threshold for zero-copy msgpack serialization
`VLLM_ALLOW_INSECURE_SERIALIZATION`	`bool`	`False`	Allow insecure pickle-based serialization
`VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME`	`str`	`"VLLM_OBJECT_STORAGE_SHM_BUFFER"`	Shared memory buffer name for object storage

Sources: vllm/envs.py173-175 vllm/envs.py220

Debug and Profiling

Variable	Type	Default	Purpose
`VLLM_GC_DEBUG`	`str`	`""`	Garbage collection debug string
`VLLM_DEBUG_WORKSPACE`	`bool`	`False`	Enable debug workspace
`VLLM_COMPUTE_NANS_IN_LOGITS`	`bool`	`False`	Allow NaN computation in logits
`VLLM_CUSTOM_SCOPES_FOR_PROFILING`	`bool`	`False`	Enable custom profiling scopes
`VLLM_NVTX_SCOPES_FOR_PROFILING`	`bool`	`False`	Enable NVTX profiling scopes
`VLLM_DEBUG_MFU_METRICS`	`bool`	`False`	Debug model FLOPs utilization metrics
`VLLM_LOG_MODEL_INSPECTION`	`bool`	`False`	Log model layer inspection
`VLLM_DISABLED_KERNELS`	`list[str]`	`[]`	Kernel names to disable at runtime
`VLLM_USE_OINK_OPS`	`bool`	`False`	Enable OINK ops
`VLLM_DISABLE_LOG_LOGO`	`bool`	`False`	Suppress the vLLM logo at startup

Sources: vllm/envs.py98-100 vllm/envs.py186 vllm/envs.py217-218

CUDA Compatibility and Misc

Variable	Type	Default	Purpose
`CUDA_HOME`	`str\|None`	`None`	CUDA toolkit home directory
`VLLM_CUDART_SO_PATH`	`str\|None`	`None`	Path to `libcudart.so`
`VLLM_ENABLE_CUDA_COMPATIBILITY`	`bool`	`False`	Enable CUDA compatibility mode
`VLLM_CUDA_COMPATIBILITY_PATH`	`str\|None`	`None`	Path for CUDA compatibility libraries
`VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY`	`bool`	`False`	Disable pinned memory for weight offloading
`VLLM_WEIGHT_OFFLOADING_DISABLE_UVA`	`bool`	`False`	Disable unified virtual addressing for weight offloading
`VLLM_USE_FBGEMM`	`bool`	`False`	Enable FBGEMM library
`VLLM_SLEEP_WHEN_IDLE`	`bool`	`False`	Put workers to sleep when idle
`VLLM_KV_EVENTS_USE_INT_BLOCK_HASHES`	`bool`	`True`	Use integer block hashes for KV events
`VLLM_TOOL_PARSE_REGEX_TIMEOUT_SECONDS`	`int`	`1`	Tool parser regex timeout
`VLLM_TOOL_JSON_ERROR_AUTOMATIC_RETRY`	`bool`	`False`	Retry on JSON parse errors in tool calls
`VLLM_ELASTIC_EP_SCALE_UP_LAUNCH`	`bool`	`False`	Enable elastic EP scale-up launch
`VLLM_ELASTIC_EP_DRAIN_REQUESTS`	`bool`	`False`	Drain requests on elastic EP scale events
`VLLM_V1_USE_OUTLINES_CACHE`	`bool`	`False`	Use Outlines structured output cache

Sources: vllm/envs.py131 vllm/envs.py182-184 vllm/envs.py199-214

`VLLM_PORT` Special Handling

VLLM_PORT has dedicated parsing logic in get_vllm_port() vllm/envs.py416-442 If VLLM_PORT is set to a URI string (a common Kubernetes service discovery mishap), it raises a ValueError with a descriptive message rather than silently failing.

Sources: vllm/envs.py416-442

`use_aot_compile()` Logic

VLLM_USE_AOT_COMPILE has non-trivial default logic vllm/envs.py280-295:

Sources: vllm/envs.py280-295

Documentation Generation

The comment block at vllm/envs.py466-469 contains # --8<-- [start:env-vars-definition]. This marker is read by the documentation generator to extract the environment variable list for the official docs. The variables between [start:env-vars-definition] and the corresponding end marker are automatically included in the published documentation at https://docs.vllm.ai/en/stable/serving/env_vars.html.

Sources: vllm/envs.py466-473

Environment Variables System

Relevant source files

Overview

Usage pattern throughout the codebase:

Module Structure

The file is organized into three layers that work together:

Module structure diagram: vllm/envs.py

Sources: vllm/envs.py1-250 vllm/envs.py466-530

The `TYPE_CHECKING` Block

The `environment_variables` Dict

vllm/envs.py473-881 defines the dict:

Each entry is a zero-argument callable. The lambdas are evaluated each time the variable is accessed, so environment changes after import are reflected.

Module `getattr`

Helper Utilities

Several utility functions are defined to reduce repetition in the variable lambdas.

Utility functions in vllm/envs.py

Function	Location	Purpose
`get_default_cache_root()`	vllm/envs.py250-254	Returns `XDG_CACHE_HOME` or `~/.cache`
`get_default_config_root()`	vllm/envs.py257-261	Returns `XDG_CONFIG_HOME` or `~/.config`
`maybe_convert_int(value)`	vllm/envs.py264-267	Converts string to int or returns `None`
`maybe_convert_bool(value)`	vllm/envs.py270-273	Converts `"0"`/`"1"` string to bool or `None`
`disable_compile_cache()`	vllm/envs.py276-277	Reads `VLLM_DISABLE_COMPILE_CACHE` as bool
`use_aot_compile()`	vllm/envs.py280-295	Computes AOT compile default based on torch version
`env_with_choices(...)`	vllm/envs.py298-340	Validated single-value enum env var
`env_list_with_choices(...)`	vllm/envs.py343-395	Validated comma-separated list env var
`env_set_with_choices(...)`	vllm/envs.py398-413	Like `env_list_with_choices` but returns `set`
`get_vllm_port()`	vllm/envs.py416-442	Parses `VLLM_PORT` with Kubernetes URI detection
`get_env_or_set_default(...)`	vllm/envs.py445-463	Returns env var or generates+writes a default

Sources: vllm/envs.py250-463

Validated Choices Pattern

env_with_choices is used for variables with a fixed set of allowed values. For example:

If an invalid value is set, a ValueError is raised with a message listing the valid options.

Access Pattern Across the Codebase

How vllm/envs.py is consumed

Variable Reference by Category

Installation / Build Time

These variables are read during setup.py or at package build time and affect compilation, not inference.

Variable	Type	Default	Purpose
`VLLM_TARGET_DEVICE`	`str`	`"cuda"`	Target hardware: `cuda`, `rocm`, `cpu`
`VLLM_MAIN_CUDA_VERSION`	`str`	`"12.9"`	CUDA version string for build
`VLLM_FLOAT32_MATMUL_PRECISION`	`"highest"\|"high"\|"medium"`	`"highest"`	`torch.set_float32_matmul_precision` mode in workers
`MAX_JOBS`	`str\|None`	`None`	Parallel compilation job count
`NVCC_THREADS`	`str\|None`	`None`	Threads per nvcc invocation
`VLLM_USE_PRECOMPILED`	`bool`	`False`	Load precompiled `.so` binaries
`VLLM_SKIP_PRECOMPILED_VERSION_SUFFIX`	`bool`	`False`	Omit `+precompiled` from version string
`VLLM_DOCKER_BUILD_CONTEXT`	`bool`	`False`	Force precompiled in Docker context
`CMAKE_BUILD_TYPE`	`"Debug"\|"Release"\|"RelWithDebInfo"\|None`	`None`	CMake build type
`VERBOSE`	`bool`	`False`	Verbose build output
`VLLM_CONFIG_ROOT`	`str`	`~/.config/vllm`	Config file root (also affects install paths)

Sources: vllm/envs.py474-530

Paths and Caching

Variable	Type	Default	Purpose
`VLLM_CACHE_ROOT`	`str`	`~/.cache/vllm`	Root for all vLLM cache files; respects `XDG_CACHE_HOME`
`VLLM_ASSETS_CACHE`	`str`	`~/.cache/vllm/assets`	Downloaded assets cache
`VLLM_ASSETS_CACHE_MODEL_CLEAN`	`bool`	`False`	Clean model files from assets cache on exit
`VLLM_XLA_CACHE_PATH`	`str`	`~/.cache/vllm/xla_cache`	XLA persistent cache (TPU only)
`VLLM_RPC_BASE_PATH`	`str`	`tempfile.gettempdir()`	IPC socket base path for multiprocessing mode
`VLLM_LORA_RESOLVER_CACHE_DIR`	`str\|None`	`None`	Local directory for unrecognized LoRA adapters
`VLLM_TUNED_CONFIG_FOLDER`	`str\|None`	`None`	Folder for pre-tuned kernel configs

Sources: vllm/envs.py531-545 vllm/envs.py739-750

Distributed Execution

Variable	Type	Default	Purpose
`VLLM_HOST_IP`	`str`	`""`	IP of the current node in multi-node setups
`VLLM_PORT`	`int\|None`	`None`	Base communication port; incremented for additional ports
`LOCAL_RANK`	`int`	`0`	Local rank within a node for GPU assignment
`CUDA_VISIBLE_DEVICES`	`str\|None`	`None`	GPU visibility control
`VLLM_NCCL_SO_PATH`	`str\|None`	`None`	Path to a specific NCCL `.so` file
`LD_LIBRARY_PATH`	`str\|None`	`None`	Fallback for NCCL library discovery
`VLLM_NCCL_INCLUDE_PATH`	`str\|None`	`None`	NCCL include path
`VLLM_PP_LAYER_PARTITION`	`str\|None`	`None`	Manual pipeline stage partition spec
`VLLM_DP_RANK`	`int`	`0`	Data parallel rank
`VLLM_DP_RANK_LOCAL`	`int`	`-1`	Local data parallel rank
`VLLM_DP_SIZE`	`int`	`1`	Data parallel world size
`VLLM_DP_MASTER_IP`	`str`	`""`	Master IP for DP coordination
`VLLM_DP_MASTER_PORT`	`int`	`0`	Master port for DP coordination
`VLLM_WORKER_MULTIPROC_METHOD`	`"fork"\|"spawn"`	`"fork"`	Worker process spawn method
`VLLM_DISABLE_PYNCCL`	`bool`	`False`	Disable PyNCCL (use NCCL via torch.distributed instead)
`VLLM_SKIP_P2P_CHECK`	`bool`	`False`	Skip GPU peer-to-peer connectivity check
`VLLM_ALLREDUCE_USE_SYMM_MEM`	`bool`	`True`	Use symmetric memory for all-reduce
`VLLM_ALLREDUCE_USE_FLASHINFER`	`bool`	`False`	Use FlashInfer for all-reduce
`VLLM_USE_NCCL_SYMM_MEM`	`bool`	`False`	Use NCCL symmetric memory
`VLLM_RINGBUFFER_WARNING_INTERVAL`	`int`	`60`	Seconds between ring buffer full warnings
`VLLM_LOOPBACK_IP`	`str`	`""`	Loopback IP override

Sources: vllm/envs.py540-627

Ray-Specific

Variable	Type	Default	Purpose
`VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE`	`"auto"\|"nccl"\|"shm"`	`"auto"`	Channel type for Ray Compiled DAG pipeline parallelism
`VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM`	`bool`	`False`	GPU communication overlap in Ray Compiled DAG
`VLLM_USE_RAY_WRAPPED_PP_COMM`	`bool`	`True`	Use vLLM's communicator wrapper with Ray Compiled DAG
`VLLM_RAY_PER_WORKER_GPUS`	`float`	`1.0`	GPUs allocated per Ray worker
`VLLM_RAY_BUNDLE_INDICES`	`str`	`""`	Ray bundle index assignments
`VLLM_RAY_DP_PACK_STRATEGY`	`"strict"\|"fill"\|"span"`	`"strict"`	DP packing strategy for Ray
`VLLM_RAY_EXTRA_ENV_VAR_PREFIXES_TO_COPY`	`str`	`""`	Env var prefixes to propagate to Ray workers
`VLLM_RAY_EXTRA_ENV_VARS_TO_COPY`	`str`	`""`	Specific env vars to propagate to Ray workers

Sources: vllm/envs.py713-744

Logging

Variable	Type	Default	Purpose
`VLLM_CONFIGURE_LOGGING`	`bool`	`True`	If `False`, vLLM does not configure logging at all
`VLLM_LOGGING_LEVEL`	`str`	`"INFO"`	Default log level (uppercased)
`VLLM_LOGGING_PREFIX`	`str`	`""`	String prepended to all log messages
`VLLM_LOGGING_STREAM`	`str`	`"ext://sys.stdout"`	Log output stream
`VLLM_LOGGING_CONFIG_PATH`	`str\|None`	`None`	Path to a JSON logging config file
`VLLM_LOGGING_COLOR`	`str`	`"auto"`	Color output: `"auto"`, `"1"` (always), `"0"` (never)
`NO_COLOR`	`bool`	`False`	Standard ANSI color disable flag
`VLLM_LOG_STATS_INTERVAL`	`float`	`10.0`	Seconds between stats log emissions
`VLLM_LOG_BATCHSIZE_INTERVAL`	`float`	`-1`	Seconds between batch size logs; `-1` = disabled
`VLLM_TRACE_FUNCTION`	`int`	`0`	Enable function call tracing when `1`
`VLLM_DEBUG_LOG_API_SERVER_RESPONSE`	`bool`	`False`	Log full API server responses (debug)

Sources: vllm/envs.py659-686

Engine and API Server

Variable	Type	Default	Purpose
`VLLM_ENGINE_ITERATION_TIMEOUT_S`	`int`	`60`	Seconds before an engine iteration is considered hung
`VLLM_ENGINE_READY_TIMEOUT_S`	`int`	`600`	Seconds to wait for engine core to become ready
`VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS`	`int`	`300`	Timeout for `execute_model` calls
`VLLM_RPC_TIMEOUT`	`int`	`10000`	ZeroMQ client timeout in milliseconds
`VLLM_HTTP_TIMEOUT_KEEP_ALIVE`	`int`	`5`	HTTP keep-alive timeout in seconds
`VLLM_API_KEY`	`str\|None`	`None`	Bearer token for API authentication
`VLLM_KEEP_ALIVE_ON_ENGINE_DEATH`	`bool`	`False`	Keep HTTP server alive after engine failure
`VLLM_V1_OUTPUT_PROC_CHUNK_SIZE`	`int`	`128`	Output processing chunk size in v1 engine
`VLLM_ENABLE_V1_MULTIPROCESSING`	`bool`	`True`	Enable multiprocessing in v1 engine
`VLLM_SERVER_DEV_MODE`	`bool`	`False`	Enable developer mode on the server
`VLLM_ENABLE_RESPONSES_API_STORE`	`bool`	`False`	Enable persistent Responses API storage
`VLLM_MQ_MAX_CHUNK_BYTES_MB`	`int`	`16`	Max message queue chunk size in MB
`VLLM_DISABLE_REQUEST_ID_RANDOMIZATION`	`bool`	`False`	Disable UUID request IDs (for testing)

Sources: vllm/envs.py627-637 vllm/envs.py856-868

Model Loading

Variable	Type	Default	Purpose
`VLLM_USE_MODELSCOPE`	`bool`	`False`	Load models from ModelScope instead of HuggingFace Hub
`VLLM_MODEL_REDIRECT_PATH`	`str\|None`	`None`	Redirect model paths (local override)
`VLLM_ALLOW_LONG_MAX_MODEL_LEN`	`bool`	`False`	Allow `max_model_len` greater than model's config maximum
`S3_ACCESS_KEY_ID`	`str\|None`	`None`	S3 access key (for tensorizer)
`S3_SECRET_ACCESS_KEY`	`str\|None`	`None`	S3 secret key (for tensorizer)
`S3_ENDPOINT_URL`	`str\|None`	`None`	S3 endpoint URL (for tensorizer)

Sources: vllm/envs.py555-646

Multimodal

Variable	Type	Default	Purpose
`VLLM_IMAGE_FETCH_TIMEOUT`	`int`	`5`	HTTP timeout (seconds) for image fetching
`VLLM_VIDEO_FETCH_TIMEOUT`	`int`	`30`	HTTP timeout (seconds) for video fetching
`VLLM_AUDIO_FETCH_TIMEOUT`	`int`	`10`	HTTP timeout (seconds) for audio fetching
`VLLM_MEDIA_URL_ALLOW_REDIRECTS`	`bool`	`True`	Follow HTTP redirects for media URLs
`VLLM_MEDIA_LOADING_THREAD_COUNT`	`int`	`8`	Thread pool size for media byte loading
`VLLM_MAX_AUDIO_CLIP_FILESIZE_MB`	`int`	`25`	Maximum audio file size for STT requests
`VLLM_VIDEO_LOADER_BACKEND`	`str`	`"opencv"`	Video I/O backend: `"opencv"` or `"identity"`
`VLLM_MEDIA_CONNECTOR`	`str`	`"http"`	Media connector implementation
`VLLM_MM_HASHER_ALGORITHM`	`"blake3"\|"sha256"\|"sha512"`	`"blake3"`	Hash algorithm for multimodal content dedup

Sources: vllm/envs.py751-810

Attention

Variable	Type	Default	Purpose
`VLLM_MLA_DISABLE`	`bool`	`False`	Force-disable Multi-head Latent Attention (MLA)
`VLLM_KV_CACHE_LAYOUT`	`"NHD"\|"HND"\|None`	`None`	Force a specific KV cache layout
`VLLM_ALLOW_CHUNKED_LOCAL_ATTN_WITH_HYBRID_KV_CACHE`	`bool`	`True`	Allow chunked local attention with hybrid KV cache
`VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE`	`int`	`394 * 1024 * 1024`	FlashInfer workspace buffer size in bytes
`VLLM_FLASHINFER_ALLREDUCE_BACKEND`	`"auto"\|"trtllm"\|"mnnvl"`	`"auto"`	FlashInfer all-reduce backend selection
`Q_SCALE_CONSTANT`	`int`	`200`	Query scale constant for attention quantization
`K_SCALE_CONSTANT`	`int`	`200`	Key scale constant for attention quantization
`V_SCALE_CONSTANT`	`int`	`100`	Value scale constant for attention quantization

Sources: vllm/envs.py122-126 vllm/envs.py184-185

Compilation and `torch.compile`

For full documentation of compilation behavior, see Compilation Configuration. These variables directly influence that subsystem.

Variable	Type	Default	Purpose
`VLLM_DISABLE_COMPILE_CACHE`	`bool`	`False`	Disable the compilation artifact cache
`VLLM_USE_AOT_COMPILE`	`bool`	computed	Ahead-of-time compile during warmup
`VLLM_USE_BYTECODE_HOOK`	`bool`	`True`	Enable bytecode hook in `TorchCompileWithNoGuardsWrapper`
`VLLM_FORCE_AOT_LOAD`	`bool`	`False`	Require AOT artifacts to exist; hard error otherwise
`VLLM_USE_MEGA_AOT_ARTIFACT`	`bool`	`False`	Load compiled models from mega AOT artifact
`VLLM_USE_STANDALONE_COMPILE`	`bool`	`True`	Enable Inductor standalone compile
`VLLM_ENABLE_PREGRAD_PASSES`	`bool`	`False`	Enable Inductor pre-grad passes (normally skipped)
`VLLM_PATTERN_MATCH_DEBUG`	`str\|None`	`None`	`fx.Node` name to debug in custom passes
`VLLM_DEBUG_DUMP_PATH`	`str\|None`	`None`	Directory to dump fx graphs
`VLLM_COMPILE_CACHE_SAVE_FORMAT`	`"binary"\|"unpacked"`	`"binary"`	Format for cached compile artifacts
`VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`	`bool`	`True`	Enable Inductor max autotune
`VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING`	`bool`	`True`	Enable coordinate descent tuning
`VLLM_ENABLE_CUDAGRAPH_GC`	`bool`	`False`	Enable CUDA graph garbage collection
`VLLM_ENABLE_PREGRAD_PASSES`	`bool`	`False`	Enable pre-grad passes (off by default for compile speed)

VLLM_USE_AOT_COMPILE has computed logic: it defaults to True on torch >= 2.10.0 if disable_compile_cache() is False and vllm_is_batch_invariant() is False. See vllm/envs.py280-295

Sources: vllm/envs.py579-622 vllm/envs.py121-122

Quantization

Quantization environment variable mapping

Variable	Type	Default	Purpose
`VLLM_USE_TRITON_AWQ`	`bool`	`False`	Use Triton AWQ kernels instead of default
`VLLM_USE_DEEP_GEMM`	`bool`	`True`	Enable DeepGEMM library for FP8 GEMMs
`VLLM_MOE_USE_DEEP_GEMM`	`bool`	`True`	Enable DeepGEMM for MoE FP8 GEMMs
`VLLM_USE_DEEP_GEMM_E8M0`	`bool`	`True`	Enable E8M0 scale format in DeepGEMM
`VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES`	`bool`	`True`	Use TMA-aligned scales in DeepGEMM
`VLLM_DEEP_GEMM_WARMUP`	`"skip"\|"full"\|"relax"`	`"relax"`	DeepGEMM JIT warmup strategy
`VLLM_MARLIN_USE_ATOMIC_ADD`	`bool`	`False`	Use atomic add in Marlin kernel
`VLLM_MARLIN_INPUT_DTYPE`	`"int8"\|"fp8"\|None`	`None`	Override Marlin input dtype
`VLLM_MXFP4_USE_MARLIN`	`bool\|None`	`None`	Force MXFP4 to use Marlin backend
`VLLM_NVFP4_GEMM_BACKEND`	`str\|None`	`None`	Override NVFP4 GEMM backend
`VLLM_USE_NVFP4_CT_EMULATIONS`	`bool`	`False`	Use NVFP4 chip-level emulation
`VLLM_XGRAMMAR_CACHE_MB`	`int`	`0`	xGrammar cache size in MB; `0` = disabled
`Q_SCALE_CONSTANT`	`int`	`200`	Q attention quantization scale
`K_SCALE_CONSTANT`	`int`	`200`	K attention quantization scale
`V_SCALE_CONSTANT`	`int`	`100`	V attention quantization scale
`VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER`	`bool`	`True`	Use FlashInfer for block-scale FP8 GEMM
`VLLM_DEEPEPLL_NVFP4_DISPATCH`	`bool`	`False`	NVFP4 dispatch via DeepEP LL

Sources: vllm/envs.py94-100 vllm/envs.py145-168 vllm/envs.py153-161

MoE (Mixture of Experts)

Variable	Type	Default	Purpose
`VLLM_FUSED_MOE_CHUNK_SIZE`	`int`	`16384`	Token chunk size for fused MoE kernel
`VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING`	`bool`	`True`	Enable activation chunking in fused MoE
`VLLM_USE_FUSED_MOE_GROUPED_TOPK`	`bool`	`True`	Enable grouped top-k in fused MoE
`VLLM_MOE_DP_CHUNK_SIZE`	`int`	`256`	Max tokens per MoE data-parallel chunk
`VLLM_ENABLE_MOE_DP_CHUNK`	`bool`	`True`	Enable MoE data-parallel chunking
`VLLM_RANDOMIZE_DP_DUMMY_INPUTS`	`bool`	`False`	Randomize dummy inputs in DP (testing)
`VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE`	`int`	`163840`	Max tokens per expert for FP4 MoE kernels
`VLLM_DBO_COMM_SMS`	`int`	`20`	SMs for DBO communication
`VLLM_DISABLE_SHARED_EXPERTS_STREAM`	`bool`	`False`	Disable shared expert stream overlap
`VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD`	`int`	`256`	Token threshold for shared expert stream

For variable usage by FusedMoE and FusedMoEConfig, see vllm/model_executor/layers/fused_moe/layer.py557 and vllm/model_executor/layers/fused_moe/config.py9

Sources: vllm/envs.py56-58 vllm/envs.py139-141 vllm/envs.py824-832

FlashInfer

Variable	Type	Default	Purpose
`VLLM_USE_FLASHINFER_SAMPLER`	`bool\|None`	`None`	Force FlashInfer sampler on/off
`VLLM_HAS_FLASHINFER_CUBIN`	`bool`	`False`	Override cubin detection (signals preinstalled cubin)
`VLLM_USE_FLASHINFER_MOE_FP16`	`bool`	`False`	Enable FlashInfer FP16 MoE kernel
`VLLM_USE_FLASHINFER_MOE_FP8`	`bool`	`False`	Enable FlashInfer FP8 MoE kernel
`VLLM_USE_FLASHINFER_MOE_FP4`	`bool`	`False`	Enable FlashInfer FP4 MoE kernel
`VLLM_USE_FLASHINFER_MOE_INT4`	`bool`	`False`	Enable FlashInfer INT4 MoE kernel
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8`	`bool`	`False`	Enable FlashInfer MXFP4/MXFP8 TRTLLM MoE (SM100)
`VLLM_USE_FLASHINFER_MOE_MXFP4_BF16`	`bool`	`False`	Enable FlashInfer MXFP4/BF16 MoE (SM90)
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS`	`bool`	`False`	Enable FlashInfer MXFP4/MXFP8 CUTLASS MoE (SM100)
`VLLM_FLASHINFER_MOE_BACKEND`	`"throughput"\|"latency"\|"masked_gemm"`	`"latency"`	FlashInfer MoE backend mode

These variables are checked in get_mxfp4_backend() vllm/model_executor/layers/quantization/mxfp4.py108-183 and has_flashinfer_cubin() vllm/utils/flashinfer.py39-46

Sources: vllm/envs.py163-172 vllm/envs.py204-208

ROCm / AMD Platform

Variable	Type	Default	Purpose
`VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE`	`int`	`256`	Memory chunk size (MB) for sleeping memory on ROCm
`VLLM_ROCM_USE_AITER`	`bool`	`False`	Master switch for AITER library on ROCm
`VLLM_ROCM_USE_AITER_PAGED_ATTN`	`bool`	`False`	AITER paged attention
`VLLM_ROCM_USE_AITER_LINEAR`	`bool`	`True`	AITER linear kernels
`VLLM_ROCM_USE_AITER_MOE`	`bool`	`True`	AITER MoE kernels
`VLLM_ROCM_USE_AITER_RMSNORM`	`bool`	`True`	AITER RMSNorm kernels
`VLLM_ROCM_USE_AITER_MLA`	`bool`	`True`	AITER MLA kernels
`VLLM_ROCM_USE_AITER_MHA`	`bool`	`True`	AITER MHA kernels
`VLLM_ROCM_USE_AITER_FP8BMM`	`bool`	`True`	AITER FP8 batched MM
`VLLM_ROCM_USE_AITER_FP4BMM`	`bool`	`True`	AITER FP4 batched MM
`VLLM_ROCM_USE_AITER_TRITON_ROPE`	`bool`	`True`	AITER Triton RoPE
`VLLM_ROCM_USE_AITER_TRITON_GEMM`	`bool`	`True`	AITER Triton GEMM
`VLLM_ROCM_USE_AITER_FP4_ASM_GEMM`	`bool`	`False`	AITER FP4 ASM GEMM
`VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION`	`bool`	`False`	AITER unified attention
`VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS`	`bool`	`False`	AITER shared experts fusion
`VLLM_ROCM_USE_SKINNY_GEMM`	`bool`	`True`	Use skinny GEMM kernels on ROCm
`VLLM_ROCM_FP8_PADDING`	`bool`	`True`	Enable FP8 weight padding on ROCm
`VLLM_ROCM_MOE_PADDING`	`bool`	`True`	Enable MoE weight padding on ROCm
`VLLM_ROCM_CUSTOM_PAGED_ATTN`	`bool`	`True`	Use custom paged attention on ROCm
`VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT`	`bool`	`False`	Shuffle KV cache layout on ROCm
`VLLM_ROCM_FP8_MFMA_PAGE_ATTN`	`bool`	`False`	FP8 MFMA paged attention on ROCm
`VLLM_ROCM_QUICK_REDUCE_QUANTIZATION`	`"FP"\|"INT8"\|"INT6"\|"INT4"\|"NONE"`	`"NONE"`	Quick reduce quantization scheme
`VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16`	`bool`	`True`	Cast BF16 to FP16 in quick reduce
`VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB`	`int\|None`	`None`	Max tensor size for quick reduce

Sources: vllm/envs.py101-119 vllm/envs.py187-192

TPU / XLA

Variable	Type	Default	Purpose
`VLLM_XLA_USE_SPMD`	`bool`	`False`	Enable SPMD mode for TPU backend
`VLLM_XLA_CHECK_RECOMPILATION`	`bool`	`False`	Assert on XLA recompilation after each step
`VLLM_XLA_CACHE_PATH`	`str`	`~/.cache/vllm/xla_cache`	XLA persistent cache directory
`VLLM_TPU_BUCKET_PADDING_GAP`	`int`	`0`	TPU bucket padding gap for sequence lengths
`VLLM_TPU_MOST_MODEL_LEN`	`int\|None`	`None`	Most common model length hint for TPU
`VLLM_TPU_USING_PATHWAYS`	`bool`	`False`	Using Pathways runtime for TPU

Sources: vllm/envs.py810-823

CPU Backend

Variable	Type	Default	Purpose
`VLLM_CPU_KVCACHE_SPACE`	`int\|None`	`None`	KV cache space in GB for CPU backend (default 4 GB)
`VLLM_CPU_OMP_THREADS_BIND`	`str`	`"auto"`	OpenMP thread CPU binding spec (e.g., `"0-31"`)
`VLLM_CPU_NUM_OF_RESERVED_CPU`	`int\|None`	`None`	CPU cores reserved from OMP threads
`VLLM_CPU_SGL_KERNEL`	`bool`	`False`	Use SGL kernels (optimized for small batch) on CPU

Sources: vllm/envs.py696-711

LoRA and Plugins

Variable	Type	Default	Purpose
`VLLM_PLUGINS`	`list[str]\|None`	`None`	Comma-separated plugin names; `None` = all; `""` = none
`VLLM_ALLOW_RUNTIME_LORA_UPDATING`	`bool`	`False`	Enable hot-loading LoRA adapters at runtime
`VLLM_LORA_RESOLVER_CACHE_DIR`	`str\|None`	`None`	Local directory for LoRA adapter resolution
`VLLM_LORA_RESOLVER_HF_REPO_LIST`	`str\|None`	`None`	Comma-separated HF repos for LoRA resolution
`VLLM_LORA_DISABLE_PDL`	`bool`	`False`	Disable PDL for LoRA

Sources: vllm/envs.py862-880

Usage Statistics

Variable	Type	Default	Purpose
`VLLM_USAGE_STATS_SERVER`	`str`	`"https://stats.vllm.ai"`	Stats reporting server URL
`VLLM_NO_USAGE_STATS`	`bool`	`False`	Disable usage stats collection entirely
`VLLM_DO_NOT_TRACK`	`bool`	`False`	Respects `DO_NOT_TRACK` standard as well
`VLLM_USAGE_SOURCE`	`str`	`"production"`	Tag for the usage stats source

Sources: vllm/envs.py647-658

KV Cache Transfer / Disaggregated Serving

For context on the transfer system, see KV Cache Transfer and Disaggregated Serving.

Variable	Type	Default	Purpose
`VLLM_NIXL_SIDE_CHANNEL_HOST`	`str`	`"localhost"`	NIXL side channel host for KV transfer
`VLLM_NIXL_SIDE_CHANNEL_PORT`	`int`	`5600`	NIXL side channel port
`VLLM_NIXL_ABORT_REQUEST_TIMEOUT`	`int`	`480`	Seconds before aborting a NIXL request
`VLLM_MOONCAKE_BOOTSTRAP_PORT`	`int`	`8998`	Mooncake connector bootstrap port
`VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT`	`int`	`480`	Mooncake request abort timeout
`VLLM_MORIIO_CONNECTOR_READ_MODE`	`bool`	`False`	Moriio connector read mode
`VLLM_MORIIO_QP_PER_TRANSFER`	`int`	`1`	Moriio QPs per transfer
`VLLM_MORIIO_POST_BATCH_SIZE`	`int`	`-1`	Moriio post batch size
`VLLM_MORIIO_NUM_WORKERS`	`int`	`1`	Moriio number of workers
`VLLM_DEEPEP_BUFFER_SIZE_MB`	`int`	`1024`	DeepEP buffer size in MB
`VLLM_DEEPEP_HIGH_THROUGHPUT_FORCE_INTRA_NODE`	`bool`	`False`	Force intra-node for DeepEP HT
`VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL`	`bool`	`False`	Use MNNVL for DeepEP LL

Sources: vllm/envs.py177-199 vllm/envs.py221-224

Communication and Serialization

Variable	Type	Default	Purpose
`VLLM_MSGPACK_ZERO_COPY_THRESHOLD`	`int`	`256`	Byte threshold for zero-copy msgpack serialization
`VLLM_ALLOW_INSECURE_SERIALIZATION`	`bool`	`False`	Allow insecure pickle-based serialization
`VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME`	`str`	`"VLLM_OBJECT_STORAGE_SHM_BUFFER"`	Shared memory buffer name for object storage

Sources: vllm/envs.py173-175 vllm/envs.py220

Debug and Profiling

Variable	Type	Default	Purpose
`VLLM_GC_DEBUG`	`str`	`""`	Garbage collection debug string
`VLLM_DEBUG_WORKSPACE`	`bool`	`False`	Enable debug workspace
`VLLM_COMPUTE_NANS_IN_LOGITS`	`bool`	`False`	Allow NaN computation in logits
`VLLM_CUSTOM_SCOPES_FOR_PROFILING`	`bool`	`False`	Enable custom profiling scopes
`VLLM_NVTX_SCOPES_FOR_PROFILING`	`bool`	`False`	Enable NVTX profiling scopes
`VLLM_DEBUG_MFU_METRICS`	`bool`	`False`	Debug model FLOPs utilization metrics
`VLLM_LOG_MODEL_INSPECTION`	`bool`	`False`	Log model layer inspection
`VLLM_DISABLED_KERNELS`	`list[str]`	`[]`	Kernel names to disable at runtime
`VLLM_USE_OINK_OPS`	`bool`	`False`	Enable OINK ops
`VLLM_DISABLE_LOG_LOGO`	`bool`	`False`	Suppress the vLLM logo at startup

Sources: vllm/envs.py98-100 vllm/envs.py186 vllm/envs.py217-218

CUDA Compatibility and Misc

Variable	Type	Default	Purpose
`CUDA_HOME`	`str\|None`	`None`	CUDA toolkit home directory
`VLLM_CUDART_SO_PATH`	`str\|None`	`None`	Path to `libcudart.so`
`VLLM_ENABLE_CUDA_COMPATIBILITY`	`bool`	`False`	Enable CUDA compatibility mode
`VLLM_CUDA_COMPATIBILITY_PATH`	`str\|None`	`None`	Path for CUDA compatibility libraries
`VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY`	`bool`	`False`	Disable pinned memory for weight offloading
`VLLM_WEIGHT_OFFLOADING_DISABLE_UVA`	`bool`	`False`	Disable unified virtual addressing for weight offloading
`VLLM_USE_FBGEMM`	`bool`	`False`	Enable FBGEMM library
`VLLM_SLEEP_WHEN_IDLE`	`bool`	`False`	Put workers to sleep when idle
`VLLM_KV_EVENTS_USE_INT_BLOCK_HASHES`	`bool`	`True`	Use integer block hashes for KV events
`VLLM_TOOL_PARSE_REGEX_TIMEOUT_SECONDS`	`int`	`1`	Tool parser regex timeout
`VLLM_TOOL_JSON_ERROR_AUTOMATIC_RETRY`	`bool`	`False`	Retry on JSON parse errors in tool calls
`VLLM_ELASTIC_EP_SCALE_UP_LAUNCH`	`bool`	`False`	Enable elastic EP scale-up launch
`VLLM_ELASTIC_EP_DRAIN_REQUESTS`	`bool`	`False`	Drain requests on elastic EP scale events
`VLLM_V1_USE_OUTLINES_CACHE`	`bool`	`False`	Use Outlines structured output cache

Sources: vllm/envs.py131 vllm/envs.py182-184 vllm/envs.py199-214

`VLLM_PORT` Special Handling

Sources: vllm/envs.py416-442

`use_aot_compile()` Logic

VLLM_USE_AOT_COMPILE has non-trivial default logic vllm/envs.py280-295:

Sources: vllm/envs.py280-295

Documentation Generation

Sources: vllm/envs.py466-473

Environment Variables System

Overview

Module Structure

The TYPE_CHECKING Block

The environment_variables Dict

Module __getattr__

Helper Utilities

Validated Choices Pattern

Access Pattern Across the Codebase

Variable Reference by Category

Installation / Build Time

Paths and Caching

Distributed Execution

Ray-Specific

Logging

Engine and API Server

Model Loading

Multimodal

Attention

Compilation and torch.compile

Quantization

MoE (Mixture of Experts)

FlashInfer

ROCm / AMD Platform

TPU / XLA

CPU Backend

LoRA and Plugins

Usage Statistics

KV Cache Transfer / Disaggregated Serving

Communication and Serialization

Debug and Profiling

CUDA Compatibility and Misc

VLLM_PORT Special Handling

use_aot_compile() Logic

Documentation Generation

On this page

Environment Variables System

Overview

Module Structure

The TYPE_CHECKING Block

The environment_variables Dict

Module __getattr__

Helper Utilities

Validated Choices Pattern

Access Pattern Across the Codebase

Variable Reference by Category

Installation / Build Time

Paths and Caching

Distributed Execution

Ray-Specific

Logging

Engine and API Server

Model Loading

Multimodal

Attention

Compilation and torch.compile

Quantization

MoE (Mixture of Experts)

FlashInfer

ROCm / AMD Platform

TPU / XLA

CPU Backend

LoRA and Plugins

Usage Statistics

KV Cache Transfer / Disaggregated Serving

Communication and Serialization

Debug and Profiling

CUDA Compatibility and Misc

VLLM_PORT Special Handling

use_aot_compile() Logic

Documentation Generation

On this page

The `TYPE_CHECKING` Block

The `environment_variables` Dict

Module `getattr`

Compilation and `torch.compile`

`VLLM_PORT` Special Handling

`use_aot_compile()` Logic

The `TYPE_CHECKING` Block

The `environment_variables` Dict

Module `getattr`

Compilation and `torch.compile`

`VLLM_PORT` Special Handling

`use_aot_compile()` Logic