This page documents vllm/envs.py, vLLM's centralized environment variable registry. It covers the module's structure, the access mechanism, helper utilities, and a categorized reference of all VLLM_* variables. For information on how these variables feed into strongly-typed configuration objects at engine startup, see Configuration Objects. For compilation-specific settings, see Compilation Configuration.
vllm/envs.py serves as the single authoritative source for all environment variable definitions in vLLM. Rather than scattering os.environ.get(...) calls throughout the codebase, every variable is declared in one place with its type, default value, and validation logic. The rest of the codebase reads variables through a module-level __getattr__ that lazily evaluates the appropriate callable from the environment_variables dict.
Usage pattern throughout the codebase:
The file is organized into three layers that work together:
Module structure diagram: vllm/envs.py
Sources: vllm/envs.py1-250 vllm/envs.py466-530
TYPE_CHECKING Blockvllm/envs.py14-248 contains a block guarded by if TYPE_CHECKING:. This block declares all variables with their Python types and default values. It is never executed at runtime; its sole purpose is to give type checkers and IDE tooling accurate type information when code does import vllm.envs as envs; envs.VLLM_CACHE_ROOT.
environment_variables Dictvllm/envs.py473-881 defines the dict:
Each entry is a zero-argument callable. The lambdas are evaluated each time the variable is accessed, so environment changes after import are reflected.
__getattr__The module defines a __getattr__ function that intercepts attribute access (e.g., envs.VLLM_CACHE_ROOT), looks up the name in environment_variables, and calls the associated lambda. This is what makes the import vllm.envs as envs pattern work.
Several utility functions are defined to reduce repetition in the variable lambdas.
Utility functions in vllm/envs.py
| Function | Location | Purpose |
|---|---|---|
get_default_cache_root() | vllm/envs.py250-254 | Returns XDG_CACHE_HOME or ~/.cache |
get_default_config_root() | vllm/envs.py257-261 | Returns XDG_CONFIG_HOME or ~/.config |
maybe_convert_int(value) | vllm/envs.py264-267 | Converts string to int or returns None |
maybe_convert_bool(value) | vllm/envs.py270-273 | Converts "0"/"1" string to bool or None |
disable_compile_cache() | vllm/envs.py276-277 | Reads VLLM_DISABLE_COMPILE_CACHE as bool |
use_aot_compile() | vllm/envs.py280-295 | Computes AOT compile default based on torch version |
env_with_choices(...) | vllm/envs.py298-340 | Validated single-value enum env var |
env_list_with_choices(...) | vllm/envs.py343-395 | Validated comma-separated list env var |
env_set_with_choices(...) | vllm/envs.py398-413 | Like env_list_with_choices but returns set |
get_vllm_port() | vllm/envs.py416-442 | Parses VLLM_PORT with Kubernetes URI detection |
get_env_or_set_default(...) | vllm/envs.py445-463 | Returns env var or generates+writes a default |
Sources: vllm/envs.py250-463
env_with_choices is used for variables with a fixed set of allowed values. For example:
If an invalid value is set, a ValueError is raised with a message listing the valid options.
How vllm/envs.py is consumed
Sources: vllm/model_executor/layers/quantization/fp8.py10-11 vllm/model_executor/layers/quantization/mxfp4.py8 vllm/model_executor/layers/fused_moe/layer.py11 vllm/model_executor/layers/fused_moe/config.py9 vllm/utils/flashinfer.py20 vllm/model_executor/layers/quantization/quark/quark_moe.py9
These variables are read during setup.py or at package build time and affect compilation, not inference.
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_TARGET_DEVICE | str | "cuda" | Target hardware: cuda, rocm, cpu |
VLLM_MAIN_CUDA_VERSION | str | "12.9" | CUDA version string for build |
VLLM_FLOAT32_MATMUL_PRECISION | "highest"|"high"|"medium" | "highest" | torch.set_float32_matmul_precision mode in workers |
MAX_JOBS | str|None | None | Parallel compilation job count |
NVCC_THREADS | str|None | None | Threads per nvcc invocation |
VLLM_USE_PRECOMPILED | bool | False | Load precompiled .so binaries |
VLLM_SKIP_PRECOMPILED_VERSION_SUFFIX | bool | False | Omit +precompiled from version string |
VLLM_DOCKER_BUILD_CONTEXT | bool | False | Force precompiled in Docker context |
CMAKE_BUILD_TYPE | "Debug"|"Release"|"RelWithDebInfo"|None | None | CMake build type |
VERBOSE | bool | False | Verbose build output |
VLLM_CONFIG_ROOT | str | ~/.config/vllm | Config file root (also affects install paths) |
Sources: vllm/envs.py474-530
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_CACHE_ROOT | str | ~/.cache/vllm | Root for all vLLM cache files; respects XDG_CACHE_HOME |
VLLM_ASSETS_CACHE | str | ~/.cache/vllm/assets | Downloaded assets cache |
VLLM_ASSETS_CACHE_MODEL_CLEAN | bool | False | Clean model files from assets cache on exit |
VLLM_XLA_CACHE_PATH | str | ~/.cache/vllm/xla_cache | XLA persistent cache (TPU only) |
VLLM_RPC_BASE_PATH | str | tempfile.gettempdir() | IPC socket base path for multiprocessing mode |
VLLM_LORA_RESOLVER_CACHE_DIR | str|None | None | Local directory for unrecognized LoRA adapters |
VLLM_TUNED_CONFIG_FOLDER | str|None | None | Folder for pre-tuned kernel configs |
Sources: vllm/envs.py531-545 vllm/envs.py739-750
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_HOST_IP | str | "" | IP of the current node in multi-node setups |
VLLM_PORT | int|None | None | Base communication port; incremented for additional ports |
LOCAL_RANK | int | 0 | Local rank within a node for GPU assignment |
CUDA_VISIBLE_DEVICES | str|None | None | GPU visibility control |
VLLM_NCCL_SO_PATH | str|None | None | Path to a specific NCCL .so file |
LD_LIBRARY_PATH | str|None | None | Fallback for NCCL library discovery |
VLLM_NCCL_INCLUDE_PATH | str|None | None | NCCL include path |
VLLM_PP_LAYER_PARTITION | str|None | None | Manual pipeline stage partition spec |
VLLM_DP_RANK | int | 0 | Data parallel rank |
VLLM_DP_RANK_LOCAL | int | -1 | Local data parallel rank |
VLLM_DP_SIZE | int | 1 | Data parallel world size |
VLLM_DP_MASTER_IP | str | "" | Master IP for DP coordination |
VLLM_DP_MASTER_PORT | int | 0 | Master port for DP coordination |
VLLM_WORKER_MULTIPROC_METHOD | "fork"|"spawn" | "fork" | Worker process spawn method |
VLLM_DISABLE_PYNCCL | bool | False | Disable PyNCCL (use NCCL via torch.distributed instead) |
VLLM_SKIP_P2P_CHECK | bool | False | Skip GPU peer-to-peer connectivity check |
VLLM_ALLREDUCE_USE_SYMM_MEM | bool | True | Use symmetric memory for all-reduce |
VLLM_ALLREDUCE_USE_FLASHINFER | bool | False | Use FlashInfer for all-reduce |
VLLM_USE_NCCL_SYMM_MEM | bool | False | Use NCCL symmetric memory |
VLLM_RINGBUFFER_WARNING_INTERVAL | int | 60 | Seconds between ring buffer full warnings |
VLLM_LOOPBACK_IP | str | "" | Loopback IP override |
Sources: vllm/envs.py540-627
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE | "auto"|"nccl"|"shm" | "auto" | Channel type for Ray Compiled DAG pipeline parallelism |
VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM | bool | False | GPU communication overlap in Ray Compiled DAG |
VLLM_USE_RAY_WRAPPED_PP_COMM | bool | True | Use vLLM's communicator wrapper with Ray Compiled DAG |
VLLM_RAY_PER_WORKER_GPUS | float | 1.0 | GPUs allocated per Ray worker |
VLLM_RAY_BUNDLE_INDICES | str | "" | Ray bundle index assignments |
VLLM_RAY_DP_PACK_STRATEGY | "strict"|"fill"|"span" | "strict" | DP packing strategy for Ray |
VLLM_RAY_EXTRA_ENV_VAR_PREFIXES_TO_COPY | str | "" | Env var prefixes to propagate to Ray workers |
VLLM_RAY_EXTRA_ENV_VARS_TO_COPY | str | "" | Specific env vars to propagate to Ray workers |
Sources: vllm/envs.py713-744
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_CONFIGURE_LOGGING | bool | True | If False, vLLM does not configure logging at all |
VLLM_LOGGING_LEVEL | str | "INFO" | Default log level (uppercased) |
VLLM_LOGGING_PREFIX | str | "" | String prepended to all log messages |
VLLM_LOGGING_STREAM | str | "ext://sys.stdout" | Log output stream |
VLLM_LOGGING_CONFIG_PATH | str|None | None | Path to a JSON logging config file |
VLLM_LOGGING_COLOR | str | "auto" | Color output: "auto", "1" (always), "0" (never) |
NO_COLOR | bool | False | Standard ANSI color disable flag |
VLLM_LOG_STATS_INTERVAL | float | 10.0 | Seconds between stats log emissions |
VLLM_LOG_BATCHSIZE_INTERVAL | float | -1 | Seconds between batch size logs; -1 = disabled |
VLLM_TRACE_FUNCTION | int | 0 | Enable function call tracing when 1 |
VLLM_DEBUG_LOG_API_SERVER_RESPONSE | bool | False | Log full API server responses (debug) |
Sources: vllm/envs.py659-686
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_ENGINE_ITERATION_TIMEOUT_S | int | 60 | Seconds before an engine iteration is considered hung |
VLLM_ENGINE_READY_TIMEOUT_S | int | 600 | Seconds to wait for engine core to become ready |
VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS | int | 300 | Timeout for execute_model calls |
VLLM_RPC_TIMEOUT | int | 10000 | ZeroMQ client timeout in milliseconds |
VLLM_HTTP_TIMEOUT_KEEP_ALIVE | int | 5 | HTTP keep-alive timeout in seconds |
VLLM_API_KEY | str|None | None | Bearer token for API authentication |
VLLM_KEEP_ALIVE_ON_ENGINE_DEATH | bool | False | Keep HTTP server alive after engine failure |
VLLM_V1_OUTPUT_PROC_CHUNK_SIZE | int | 128 | Output processing chunk size in v1 engine |
VLLM_ENABLE_V1_MULTIPROCESSING | bool | True | Enable multiprocessing in v1 engine |
VLLM_SERVER_DEV_MODE | bool | False | Enable developer mode on the server |
VLLM_ENABLE_RESPONSES_API_STORE | bool | False | Enable persistent Responses API storage |
VLLM_MQ_MAX_CHUNK_BYTES_MB | int | 16 | Max message queue chunk size in MB |
VLLM_DISABLE_REQUEST_ID_RANDOMIZATION | bool | False | Disable UUID request IDs (for testing) |
Sources: vllm/envs.py627-637 vllm/envs.py856-868
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_USE_MODELSCOPE | bool | False | Load models from ModelScope instead of HuggingFace Hub |
VLLM_MODEL_REDIRECT_PATH | str|None | None | Redirect model paths (local override) |
VLLM_ALLOW_LONG_MAX_MODEL_LEN | bool | False | Allow max_model_len greater than model's config maximum |
S3_ACCESS_KEY_ID | str|None | None | S3 access key (for tensorizer) |
S3_SECRET_ACCESS_KEY | str|None | None | S3 secret key (for tensorizer) |
S3_ENDPOINT_URL | str|None | None | S3 endpoint URL (for tensorizer) |
Sources: vllm/envs.py555-646
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_IMAGE_FETCH_TIMEOUT | int | 5 | HTTP timeout (seconds) for image fetching |
VLLM_VIDEO_FETCH_TIMEOUT | int | 30 | HTTP timeout (seconds) for video fetching |
VLLM_AUDIO_FETCH_TIMEOUT | int | 10 | HTTP timeout (seconds) for audio fetching |
VLLM_MEDIA_URL_ALLOW_REDIRECTS | bool | True | Follow HTTP redirects for media URLs |
VLLM_MEDIA_LOADING_THREAD_COUNT | int | 8 | Thread pool size for media byte loading |
VLLM_MAX_AUDIO_CLIP_FILESIZE_MB | int | 25 | Maximum audio file size for STT requests |
VLLM_VIDEO_LOADER_BACKEND | str | "opencv" | Video I/O backend: "opencv" or "identity" |
VLLM_MEDIA_CONNECTOR | str | "http" | Media connector implementation |
VLLM_MM_HASHER_ALGORITHM | "blake3"|"sha256"|"sha512" | "blake3" | Hash algorithm for multimodal content dedup |
Sources: vllm/envs.py751-810
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_MLA_DISABLE | bool | False | Force-disable Multi-head Latent Attention (MLA) |
VLLM_KV_CACHE_LAYOUT | "NHD"|"HND"|None | None | Force a specific KV cache layout |
VLLM_ALLOW_CHUNKED_LOCAL_ATTN_WITH_HYBRID_KV_CACHE | bool | True | Allow chunked local attention with hybrid KV cache |
VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE | int | 394 * 1024 * 1024 | FlashInfer workspace buffer size in bytes |
VLLM_FLASHINFER_ALLREDUCE_BACKEND | "auto"|"trtllm"|"mnnvl" | "auto" | FlashInfer all-reduce backend selection |
Q_SCALE_CONSTANT | int | 200 | Query scale constant for attention quantization |
K_SCALE_CONSTANT | int | 200 | Key scale constant for attention quantization |
V_SCALE_CONSTANT | int | 100 | Value scale constant for attention quantization |
Sources: vllm/envs.py122-126 vllm/envs.py184-185
torch.compileFor full documentation of compilation behavior, see Compilation Configuration. These variables directly influence that subsystem.
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_DISABLE_COMPILE_CACHE | bool | False | Disable the compilation artifact cache |
VLLM_USE_AOT_COMPILE | bool | computed | Ahead-of-time compile during warmup |
VLLM_USE_BYTECODE_HOOK | bool | True | Enable bytecode hook in TorchCompileWithNoGuardsWrapper |
VLLM_FORCE_AOT_LOAD | bool | False | Require AOT artifacts to exist; hard error otherwise |
VLLM_USE_MEGA_AOT_ARTIFACT | bool | False | Load compiled models from mega AOT artifact |
VLLM_USE_STANDALONE_COMPILE | bool | True | Enable Inductor standalone compile |
VLLM_ENABLE_PREGRAD_PASSES | bool | False | Enable Inductor pre-grad passes (normally skipped) |
VLLM_PATTERN_MATCH_DEBUG | str|None | None | fx.Node name to debug in custom passes |
VLLM_DEBUG_DUMP_PATH | str|None | None | Directory to dump fx graphs |
VLLM_COMPILE_CACHE_SAVE_FORMAT | "binary"|"unpacked" | "binary" | Format for cached compile artifacts |
VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE | bool | True | Enable Inductor max autotune |
VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING | bool | True | Enable coordinate descent tuning |
VLLM_ENABLE_CUDAGRAPH_GC | bool | False | Enable CUDA graph garbage collection |
VLLM_ENABLE_PREGRAD_PASSES | bool | False | Enable pre-grad passes (off by default for compile speed) |
VLLM_USE_AOT_COMPILE has computed logic: it defaults to True on torch >= 2.10.0 if disable_compile_cache() is False and vllm_is_batch_invariant() is False. See vllm/envs.py280-295
Sources: vllm/envs.py579-622 vllm/envs.py121-122
Quantization environment variable mapping
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_USE_TRITON_AWQ | bool | False | Use Triton AWQ kernels instead of default |
VLLM_USE_DEEP_GEMM | bool | True | Enable DeepGEMM library for FP8 GEMMs |
VLLM_MOE_USE_DEEP_GEMM | bool | True | Enable DeepGEMM for MoE FP8 GEMMs |
VLLM_USE_DEEP_GEMM_E8M0 | bool | True | Enable E8M0 scale format in DeepGEMM |
VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES | bool | True | Use TMA-aligned scales in DeepGEMM |
VLLM_DEEP_GEMM_WARMUP | "skip"|"full"|"relax" | "relax" | DeepGEMM JIT warmup strategy |
VLLM_MARLIN_USE_ATOMIC_ADD | bool | False | Use atomic add in Marlin kernel |
VLLM_MARLIN_INPUT_DTYPE | "int8"|"fp8"|None | None | Override Marlin input dtype |
VLLM_MXFP4_USE_MARLIN | bool|None | None | Force MXFP4 to use Marlin backend |
VLLM_NVFP4_GEMM_BACKEND | str|None | None | Override NVFP4 GEMM backend |
VLLM_USE_NVFP4_CT_EMULATIONS | bool | False | Use NVFP4 chip-level emulation |
VLLM_XGRAMMAR_CACHE_MB | int | 0 | xGrammar cache size in MB; 0 = disabled |
Q_SCALE_CONSTANT | int | 200 | Q attention quantization scale |
K_SCALE_CONSTANT | int | 200 | K attention quantization scale |
V_SCALE_CONSTANT | int | 100 | V attention quantization scale |
VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER | bool | True | Use FlashInfer for block-scale FP8 GEMM |
VLLM_DEEPEPLL_NVFP4_DISPATCH | bool | False | NVFP4 dispatch via DeepEP LL |
Sources: vllm/envs.py94-100 vllm/envs.py145-168 vllm/envs.py153-161
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_FUSED_MOE_CHUNK_SIZE | int | 16384 | Token chunk size for fused MoE kernel |
VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING | bool | True | Enable activation chunking in fused MoE |
VLLM_USE_FUSED_MOE_GROUPED_TOPK | bool | True | Enable grouped top-k in fused MoE |
VLLM_MOE_DP_CHUNK_SIZE | int | 256 | Max tokens per MoE data-parallel chunk |
VLLM_ENABLE_MOE_DP_CHUNK | bool | True | Enable MoE data-parallel chunking |
VLLM_RANDOMIZE_DP_DUMMY_INPUTS | bool | False | Randomize dummy inputs in DP (testing) |
VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE | int | 163840 | Max tokens per expert for FP4 MoE kernels |
VLLM_DBO_COMM_SMS | int | 20 | SMs for DBO communication |
VLLM_DISABLE_SHARED_EXPERTS_STREAM | bool | False | Disable shared expert stream overlap |
VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD | int | 256 | Token threshold for shared expert stream |
For variable usage by FusedMoE and FusedMoEConfig, see vllm/model_executor/layers/fused_moe/layer.py557 and vllm/model_executor/layers/fused_moe/config.py9
Sources: vllm/envs.py56-58 vllm/envs.py139-141 vllm/envs.py824-832
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_USE_FLASHINFER_SAMPLER | bool|None | None | Force FlashInfer sampler on/off |
VLLM_HAS_FLASHINFER_CUBIN | bool | False | Override cubin detection (signals preinstalled cubin) |
VLLM_USE_FLASHINFER_MOE_FP16 | bool | False | Enable FlashInfer FP16 MoE kernel |
VLLM_USE_FLASHINFER_MOE_FP8 | bool | False | Enable FlashInfer FP8 MoE kernel |
VLLM_USE_FLASHINFER_MOE_FP4 | bool | False | Enable FlashInfer FP4 MoE kernel |
VLLM_USE_FLASHINFER_MOE_INT4 | bool | False | Enable FlashInfer INT4 MoE kernel |
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 | bool | False | Enable FlashInfer MXFP4/MXFP8 TRTLLM MoE (SM100) |
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16 | bool | False | Enable FlashInfer MXFP4/BF16 MoE (SM90) |
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS | bool | False | Enable FlashInfer MXFP4/MXFP8 CUTLASS MoE (SM100) |
VLLM_FLASHINFER_MOE_BACKEND | "throughput"|"latency"|"masked_gemm" | "latency" | FlashInfer MoE backend mode |
These variables are checked in get_mxfp4_backend() vllm/model_executor/layers/quantization/mxfp4.py108-183 and has_flashinfer_cubin() vllm/utils/flashinfer.py39-46
Sources: vllm/envs.py163-172 vllm/envs.py204-208
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE | int | 256 | Memory chunk size (MB) for sleeping memory on ROCm |
VLLM_ROCM_USE_AITER | bool | False | Master switch for AITER library on ROCm |
VLLM_ROCM_USE_AITER_PAGED_ATTN | bool | False | AITER paged attention |
VLLM_ROCM_USE_AITER_LINEAR | bool | True | AITER linear kernels |
VLLM_ROCM_USE_AITER_MOE | bool | True | AITER MoE kernels |
VLLM_ROCM_USE_AITER_RMSNORM | bool | True | AITER RMSNorm kernels |
VLLM_ROCM_USE_AITER_MLA | bool | True | AITER MLA kernels |
VLLM_ROCM_USE_AITER_MHA | bool | True | AITER MHA kernels |
VLLM_ROCM_USE_AITER_FP8BMM | bool | True | AITER FP8 batched MM |
VLLM_ROCM_USE_AITER_FP4BMM | bool | True | AITER FP4 batched MM |
VLLM_ROCM_USE_AITER_TRITON_ROPE | bool | True | AITER Triton RoPE |
VLLM_ROCM_USE_AITER_TRITON_GEMM | bool | True | AITER Triton GEMM |
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM | bool | False | AITER FP4 ASM GEMM |
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION | bool | False | AITER unified attention |
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS | bool | False | AITER shared experts fusion |
VLLM_ROCM_USE_SKINNY_GEMM | bool | True | Use skinny GEMM kernels on ROCm |
VLLM_ROCM_FP8_PADDING | bool | True | Enable FP8 weight padding on ROCm |
VLLM_ROCM_MOE_PADDING | bool | True | Enable MoE weight padding on ROCm |
VLLM_ROCM_CUSTOM_PAGED_ATTN | bool | True | Use custom paged attention on ROCm |
VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT | bool | False | Shuffle KV cache layout on ROCm |
VLLM_ROCM_FP8_MFMA_PAGE_ATTN | bool | False | FP8 MFMA paged attention on ROCm |
VLLM_ROCM_QUICK_REDUCE_QUANTIZATION | "FP"|"INT8"|"INT6"|"INT4"|"NONE" | "NONE" | Quick reduce quantization scheme |
VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16 | bool | True | Cast BF16 to FP16 in quick reduce |
VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB | int|None | None | Max tensor size for quick reduce |
Sources: vllm/envs.py101-119 vllm/envs.py187-192
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_XLA_USE_SPMD | bool | False | Enable SPMD mode for TPU backend |
VLLM_XLA_CHECK_RECOMPILATION | bool | False | Assert on XLA recompilation after each step |
VLLM_XLA_CACHE_PATH | str | ~/.cache/vllm/xla_cache | XLA persistent cache directory |
VLLM_TPU_BUCKET_PADDING_GAP | int | 0 | TPU bucket padding gap for sequence lengths |
VLLM_TPU_MOST_MODEL_LEN | int|None | None | Most common model length hint for TPU |
VLLM_TPU_USING_PATHWAYS | bool | False | Using Pathways runtime for TPU |
Sources: vllm/envs.py810-823
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_CPU_KVCACHE_SPACE | int|None | None | KV cache space in GB for CPU backend (default 4 GB) |
VLLM_CPU_OMP_THREADS_BIND | str | "auto" | OpenMP thread CPU binding spec (e.g., "0-31") |
VLLM_CPU_NUM_OF_RESERVED_CPU | int|None | None | CPU cores reserved from OMP threads |
VLLM_CPU_SGL_KERNEL | bool | False | Use SGL kernels (optimized for small batch) on CPU |
Sources: vllm/envs.py696-711
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_PLUGINS | list[str]|None | None | Comma-separated plugin names; None = all; "" = none |
VLLM_ALLOW_RUNTIME_LORA_UPDATING | bool | False | Enable hot-loading LoRA adapters at runtime |
VLLM_LORA_RESOLVER_CACHE_DIR | str|None | None | Local directory for LoRA adapter resolution |
VLLM_LORA_RESOLVER_HF_REPO_LIST | str|None | None | Comma-separated HF repos for LoRA resolution |
VLLM_LORA_DISABLE_PDL | bool | False | Disable PDL for LoRA |
Sources: vllm/envs.py862-880
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_USAGE_STATS_SERVER | str | "https://stats.vllm.ai" | Stats reporting server URL |
VLLM_NO_USAGE_STATS | bool | False | Disable usage stats collection entirely |
VLLM_DO_NOT_TRACK | bool | False | Respects DO_NOT_TRACK standard as well |
VLLM_USAGE_SOURCE | str | "production" | Tag for the usage stats source |
Sources: vllm/envs.py647-658
For context on the transfer system, see KV Cache Transfer and Disaggregated Serving.
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_NIXL_SIDE_CHANNEL_HOST | str | "localhost" | NIXL side channel host for KV transfer |
VLLM_NIXL_SIDE_CHANNEL_PORT | int | 5600 | NIXL side channel port |
VLLM_NIXL_ABORT_REQUEST_TIMEOUT | int | 480 | Seconds before aborting a NIXL request |
VLLM_MOONCAKE_BOOTSTRAP_PORT | int | 8998 | Mooncake connector bootstrap port |
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT | int | 480 | Mooncake request abort timeout |
VLLM_MORIIO_CONNECTOR_READ_MODE | bool | False | Moriio connector read mode |
VLLM_MORIIO_QP_PER_TRANSFER | int | 1 | Moriio QPs per transfer |
VLLM_MORIIO_POST_BATCH_SIZE | int | -1 | Moriio post batch size |
VLLM_MORIIO_NUM_WORKERS | int | 1 | Moriio number of workers |
VLLM_DEEPEP_BUFFER_SIZE_MB | int | 1024 | DeepEP buffer size in MB |
VLLM_DEEPEP_HIGH_THROUGHPUT_FORCE_INTRA_NODE | bool | False | Force intra-node for DeepEP HT |
VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL | bool | False | Use MNNVL for DeepEP LL |
Sources: vllm/envs.py177-199 vllm/envs.py221-224
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_MSGPACK_ZERO_COPY_THRESHOLD | int | 256 | Byte threshold for zero-copy msgpack serialization |
VLLM_ALLOW_INSECURE_SERIALIZATION | bool | False | Allow insecure pickle-based serialization |
VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME | str | "VLLM_OBJECT_STORAGE_SHM_BUFFER" | Shared memory buffer name for object storage |
Sources: vllm/envs.py173-175 vllm/envs.py220
| Variable | Type | Default | Purpose |
|---|---|---|---|
VLLM_GC_DEBUG | str | "" | Garbage collection debug string |
VLLM_DEBUG_WORKSPACE | bool | False | Enable debug workspace |
VLLM_COMPUTE_NANS_IN_LOGITS | bool | False | Allow NaN computation in logits |
VLLM_CUSTOM_SCOPES_FOR_PROFILING | bool | False | Enable custom profiling scopes |
VLLM_NVTX_SCOPES_FOR_PROFILING | bool | False | Enable NVTX profiling scopes |
VLLM_DEBUG_MFU_METRICS | bool | False | Debug model FLOPs utilization metrics |
VLLM_LOG_MODEL_INSPECTION | bool | False | Log model layer inspection |
VLLM_DISABLED_KERNELS | list[str] | [] | Kernel names to disable at runtime |
VLLM_USE_OINK_OPS | bool | False | Enable OINK ops |
VLLM_DISABLE_LOG_LOGO | bool | False | Suppress the vLLM logo at startup |
Sources: vllm/envs.py98-100 vllm/envs.py186 vllm/envs.py217-218
| Variable | Type | Default | Purpose |
|---|---|---|---|
CUDA_HOME | str|None | None | CUDA toolkit home directory |
VLLM_CUDART_SO_PATH | str|None | None | Path to libcudart.so |
VLLM_ENABLE_CUDA_COMPATIBILITY | bool | False | Enable CUDA compatibility mode |
VLLM_CUDA_COMPATIBILITY_PATH | str|None | None | Path for CUDA compatibility libraries |
VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY | bool | False | Disable pinned memory for weight offloading |
VLLM_WEIGHT_OFFLOADING_DISABLE_UVA | bool | False | Disable unified virtual addressing for weight offloading |
VLLM_USE_FBGEMM | bool | False | Enable FBGEMM library |
VLLM_SLEEP_WHEN_IDLE | bool | False | Put workers to sleep when idle |
VLLM_KV_EVENTS_USE_INT_BLOCK_HASHES | bool | True | Use integer block hashes for KV events |
VLLM_TOOL_PARSE_REGEX_TIMEOUT_SECONDS | int | 1 | Tool parser regex timeout |
VLLM_TOOL_JSON_ERROR_AUTOMATIC_RETRY | bool | False | Retry on JSON parse errors in tool calls |
VLLM_ELASTIC_EP_SCALE_UP_LAUNCH | bool | False | Enable elastic EP scale-up launch |
VLLM_ELASTIC_EP_DRAIN_REQUESTS | bool | False | Drain requests on elastic EP scale events |
VLLM_V1_USE_OUTLINES_CACHE | bool | False | Use Outlines structured output cache |
Sources: vllm/envs.py131 vllm/envs.py182-184 vllm/envs.py199-214
VLLM_PORT Special HandlingVLLM_PORT has dedicated parsing logic in get_vllm_port() vllm/envs.py416-442 If VLLM_PORT is set to a URI string (a common Kubernetes service discovery mishap), it raises a ValueError with a descriptive message rather than silently failing.
Sources: vllm/envs.py416-442
use_aot_compile() LogicVLLM_USE_AOT_COMPILE has non-trivial default logic vllm/envs.py280-295:
Sources: vllm/envs.py280-295
The comment block at vllm/envs.py466-469 contains # --8<-- [start:env-vars-definition]. This marker is read by the documentation generator to extract the environment variable list for the official docs. The variables between [start:env-vars-definition] and the corresponding end marker are automatically included in the published documentation at https://docs.vllm.ai/en/stable/serving/env_vars.html.
Sources: vllm/envs.py466-473
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.