Build System and Deployment

Relevant source files

This page documents how vLLM is built, packaged, and deployed. It covers the Python packaging configuration, the CMake build system for CUDA/HIP extensions, Docker image construction, and dependency management.

For information about environment variables that affect runtime behavior, see Environment Variables System. For information about torch.compile integration and compilation modes, see Compilation Configuration. For platform-specific runtime details (CUDA, ROCm, CPU, TPU), see Platform Support.

Overview

vLLM has two distinct build phases:

C++/CUDA extension build — A CMake-driven compilation of GPU kernels and custom ops into shared libraries (.so files). This is the expensive step that requires CUDA/ROCm toolchains and handles architecture-specific code generation.
Python wheel build — A standard setuptools build that packages the Python source along with the compiled .so files into a distributable wheel.

The Docker build further separates these phases into parallel stages to minimize rebuild time.

Sources: setup.py1-50 CMakeLists.txt1-30 docker/Dockerfile1-50

Python Packaging

pyproject.toml

pyproject.toml is the authoritative packaging configuration.

Field	Value
Package name	`vllm`
Build backend	`setuptools.build_meta`
Versioning	`setuptools-scm` (derived from git tags)
Python requires	`>=3.10,<3.14`
Console entrypoint	`vllm = "vllm.entrypoints.cli.main:main"`
License	Apache-2.0

Build-time dependencies (pyproject.toml3-13):

cmake>=3.26.1
ninja
packaging>=24.2
setuptools>=77.0.3,<81.0.0
setuptools-scm>=8.0
torch==2.10.0
wheel
jinja2
grpcio-tools==1.78.0

Plugin entry points (pyproject.toml45-48):

lora_filesystem_resolver — vllm.plugins.lora_resolvers.filesystem_resolver
lora_hf_hub_resolver — vllm.plugins.lora_resolvers.hf_hub_resolver

Sources: pyproject.toml1-55

setup.py

setup.py orchestrates the build by bridging Python's setuptools with CMake. Key classes:

Class	Purpose
`CMakeExtension`	Declares a C++ extension backed by a CMake project instead of raw source files
`cmake_build_ext`	Custom `build_ext` command; invokes `cmake` configure and build steps
`precompiled_build_ext`	Skips extension compilation; used when pre-built `.so` files are injected via `VLLM_PRECOMPILED_WHEEL_LOCATION`
`precompiled_wheel_utils`	Fetches and extracts `.so` files from a previously-built wheel (nightly builds)
`BuildPyAndGenerateGrpc`	Extends `build_py` to compile `vllm/grpc/vllm_engine.proto` into `_pb2.py` stubs
`DevelopAndGenerateGrpc`	Same, for `pip install -e` (editable) mode

Target device detection (setup.py43-64):

The VLLM_TARGET_DEVICE environment variable controls what gets compiled. If unset, setup.py auto-detects:

torch.version.hip is not None → "rocm"
torch.version.cuda is not None → "cuda"
macOS → "cpu"

Compiler caching (setup.py236-249):

cmake_build_ext.configure checks for sccache first, then ccache. If found, it adds -DCMAKE_C_COMPILER_LAUNCHER, -DCMAKE_CXX_COMPILER_LAUNCHER, -DCMAKE_CUDA_COMPILER_LAUNCHER, and -DCMAKE_HIP_COMPILER_LAUNCHER to the CMake invocation.

Job parallelism (setup.py172-209):

compute_num_jobs reads MAX_JOBS (env) and NVCC_THREADS (env) to determine build concurrency. When NVCC_THREADS is set, num_jobs is reduced proportionally to avoid overloading the system.

Sources: setup.py27-400

CMake Build System

The top-level CMakeLists.txt builds all C++/CUDA extensions. The cmake/utils.cmake file provides shared macro and function definitions.

Extension Targets

Primary extension: _C

Registered as torch.ops._C at runtime. Contains the majority of GPU kernels.

Core source files (CMakeLists.txt282-306):

Source file	Content
`csrc/cache_kernels.cu`	KV cache copy/swap operations
`csrc/attention/paged_attention_v1.cu`	PagedAttention v1
`csrc/attention/paged_attention_v2.cu`	PagedAttention v2
`csrc/pos_encoding_kernels.cu`	Rotary embedding
`csrc/layernorm_kernels.cu`	RMS norm
`csrc/sampler.cu`	Token sampling
`csrc/quantization/gptq/q_gemm.cu`	GPTQ GEMM
`csrc/torch_bindings.cpp`	PyTorch op registration (TORCH_LIBRARY_EXPAND)

CUDA-only additions include CUTLASS-backed scaled matrix multiplies, AWQ GEMM kernels, Marlin quantization kernels, and sparse GEMM (CMakeLists.txt341-350).

Secondary extension: _moe_C

Built from csrc/moe/, contains MoE-specific GPU operations. Registered separately via csrc/moe/torch_bindings.cpp.

Extension: cumem_allocator

Built from csrc/cumem_allocator.cpp, links against CUDA::cuda_driver (or amdhip64 for ROCm). Handles custom CUDA memory allocator registration.

External extension: vllm_flash_attn

Built via a CMake ExternalProject or FetchContent for the vllm-flash-attn package. Python stubs are copied to vllm/vllm_flash_attn/ during the build.

Architecture Handling

CUDA architecture sets (CMakeLists.txt93-103):

CUDA 13.0+: 7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0
CUDA 12.8+: 7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0
Older:      7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0

HIP/ROCm architecture set (CMakeLists.txt40):

gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151

Kernel files are selectively compiled for subsets of these architectures using set_gencode_flags_for_srcs (defined in cmake/utils.cmake). For example, Marlin kernels only compile for 8.0+PTX and above, and FP8 Marlin only for 8.9 and 12.0.

cmake/utils.cmake Functions

Function/Macro	Purpose
`find_python_from_executable`	Locates the Python interpreter matching a specific path
`append_cmake_prefix_path`	Adds PyTorch's CMake prefix to `CMAKE_PREFIX_PATH`
`get_torch_gpu_compiler_flags`	Retrieves GPU compilation flags from the active PyTorch installation
`clear_cuda_arches`	Strips architecture flags from `CMAKE_CUDA_FLAGS` for per-file control
`extract_unique_cuda_archs_ascending`	Parses the arch list from flags into a sorted list
`cuda_archs_loose_intersection`	Computes the intersection of two arch lists
`set_gencode_flags_for_srcs`	Assigns per-file CUDA gencode flags
`define_extension_target`	Defines a CMake target for a PyTorch extension library

Sources: cmake/utils.cmake1-200 CMakeLists.txt1-600

CMake Build Flow Diagram

CMake extension build process

Sources: CMakeLists.txt1-300 setup.py159-360

Dependency Management

Requirements File Structure

File	Purpose
`requirements/common.txt`	Runtime deps shared across all platforms
`requirements/cuda.txt`	CUDA platform: includes `common.txt`, adds `torch`, `flashinfer-python`, `ray`
`requirements/rocm.txt`	ROCm platform: includes `common.txt`, adds AMD-specific packages
`requirements/build.txt`	Build-time only: `cmake`, `ninja`, `setuptools`, `torch`, `grpcio-tools`
`requirements/test.txt`	Full pinned test dependency lockfile (auto-generated by `uv pip compile`)
`requirements/test.in`	Source file for `test.txt` (human-maintained)
`requirements/rocm-build.txt`	ROCm build-time deps, points to ROCm PyTorch index
`requirements/rocm-test.txt`	ROCm test deps
`requirements/nightly_torch_test.txt`	Test deps for nightly PyTorch builds
`requirements/kv_connectors.txt`	Optional KV connector deps (lmcache, nixl)

Key Pinned Versions

From requirements/cuda.txt and docker/Dockerfile:

Package	Pinned Version
`torch`	2.10.0
`torchvision`	0.25.0
`torchaudio`	2.10.0
`flashinfer-python`	0.6.4
`numba`	0.61.2

Version Management for Docker

docker/versions.json is auto-generated from ARG defaults in docker/Dockerfile. To update a version:

Edit the ARG default in docker/Dockerfile
Run python tools/generate_versions_json.py
Use docker buildx bake -f docker/docker-bake.hcl -f docker/versions.json for bake-based builds

Notable Runtime Dependencies (common.txt)

Selected key entries from requirements/common.txt:

Package	Role
`transformers >=4.56.0,<5`	Model configs and tokenizers
`fastapi[standard] >=0.115.0`	HTTP API server
`xgrammar ==0.1.29`	Structured output grammar engine
`outlines_core ==0.2.11`	Structured output (outlines backend)
`compressed-tensors ==0.13.0`	Quantization format support
`pyzmq >=25.0.0`	IPC between engine processes
`msgspec`	Zero-copy serialization
`ray[cgraph] >=2.48.0`	Distributed execution (in cuda.txt)
`ninja`	Required by Triton JIT compilation
`flashinfer-python`	FlashInfer attention backend

Sources: requirements/common.txt1-60 requirements/cuda.txt1-18 requirements/build.txt1-14 docker/versions.json1-50

Docker Multi-Stage Build (CUDA)

The main docker/Dockerfile uses a multi-stage build to maximize layer caching and allow parallel compilation steps.

Build Arguments

Argument	Default	Purpose
`CUDA_VERSION`	`12.9.1`	CUDA version for base image selection
`PYTHON_VERSION`	`3.12`	Python version to install
`BUILD_BASE_IMAGE`	`nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04`	Compilation base (ubuntu20.04 for glibc compat)
`FINAL_BASE_IMAGE`	`nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04`	Runtime base
`PYTORCH_NIGHTLY`	unset	If `1`, installs nightly PyTorch
`torch_cuda_arch_list`	`7.0 7.5 8.0 8.9 9.0 10.0 12.0`	Architectures compiled into kernels
`max_jobs`	`2`	Ninja parallel jobs
`nvcc_threads`	`8`	nvcc internal thread count
`USE_SCCACHE`	unset	Enables Mozilla sccache for build caching
`VLLM_USE_PRECOMPILED`	unset	Injects pre-built csrc wheel instead of recompiling
`INSTALL_KV_CONNECTORS`	`false`	Installs optional KV connector packages (nixl, lmcache)
`FLASHINFER_VERSION`	`0.6.4`	FlashInfer cubin/jit-cache version
`DEEPGEMM_GIT_REF`	`477618cd`	DeepGEMM commit to build
`DEEPEP_COMMIT_HASH`	`73b6ea4`	DeepEP commit to build

Stage Dependency Diagram

Docker Dockerfile multi-stage build

Sources: docker/Dockerfile90-820

Stage Details

base stage (docker/Dockerfile92-188):

Installs GCC 10 (to avoid a GCC 9 CUTLASS bug)
Installs uv for fast package management
Creates /opt/venv Python virtual environment
Installs PyTorch and CUDA requirements via uv pip install -r requirements/cuda.txt
Records exact torch/torchvision/torchaudio versions to torch_lib_versions.txt for version pinning in downstream stages

csrc-build stage (docker/Dockerfile191-308):

Copies only the files needed to compile C++ extensions: pyproject.toml, setup.py, CMakeLists.txt, cmake/, csrc/, vllm/envs.py, vllm/__init__.py
Builds a wheel via python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38
Supports sccache with S3 backend for CI caching (USE_SCCACHE=1)
Sets SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0+csrc.build (version doesn't matter; only .so files are used)
Uses ccache as fallback, stored at /root/.cache/ccache

extensions-build stage (docker/Dockerfile311-352):

Runs in parallel with csrc-build (both derive from base)
Builds DeepGEMM wheel via tools/install_deepgemm.sh, targeting TORCH_CUDA_ARCH_LIST="9.0a 10.0a"
Builds DeepEP (expert parallelism kernels) and NVSHMEM via tools/ep_kernels/install_python_libraries.sh

build stage (docker/Dockerfile354-437):

Copies the csrc .whl from csrc-build into /precompiled-wheels/
Copies the full source tree (.)
Sets VLLM_PRECOMPILED_WHEEL_LOCATION to point at the csrc wheel
Calls setup.py bdist_wheel again; this time the precompiled_build_ext class skips recompilation and extracts the .so files from the pre-built wheel
Runs check-wheel-size.py to validate wheel size stays under VLLM_MAX_SIZE_MB (default 500 MB)
Copies DeepGEMM and DeepEP wheels from extensions-build for later installation

vllm-base stage (docker/Dockerfile486-693):

Uses the leaner FINAL_BASE_IMAGE (nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04)
Installs Python via deadsnakes PPA
Installs cuda-nvcc, cuda-cudart, cuda-nvrtc packages for JIT compilation at runtime (FlashInfer, DeepGEMM, EP kernels all JIT-compile)
Installs libnccl-dev (required by pynccl_allocator.py)
Installs FlashInfer pre-compiled cubin cache: flashinfer-cubin + flashinfer-jit-cache
Installs gdrcopy, bitsandbytes, timm, runai-model-streamer
Installs the vLLM wheel from build stage
Installs DeepGEMM and DeepEP wheels

Sources: docker/Dockerfile90-820

ROCm-Specific Build

The ROCm build uses two separate Dockerfiles: a base image builder and a vLLM builder.

Dockerfile.rocm_base

docker/Dockerfile.rocm_base builds the ROCm base image from scratch. It compiles:

Component	Source
PyTorch	`github.com/ROCm/pytorch.git` at pinned commit
Triton	`github.com/ROCm/triton.git` at pinned commit
TorchVision	`github.com/pytorch/vision.git` at pinned tag
TorchAudio	`github.com/pytorch/audio.git` at pinned tag
FlashAttention	`github.com/Dao-AILab/flash-attention.git`
AITER	`github.com/ROCm/aiter.git` (AMD inference library)

The base ROCm architecture list is set via PYTORCH_ROCM_ARCH:

gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151

Dockerfile.rocm

docker/Dockerfile.rocm builds vLLM for ROCm. It has two modes controlled by REMOTE_VLLM:

REMOTE_VLLM=0: copies local source (COPY ./ vllm/)
REMOTE_VLLM=1: clones from VLLM_REPO at VLLM_BRANCH

Build step (docker/Dockerfile.rocm101-104):

python3 -m pip install -r requirements/rocm.txt
python3 setup.py clean --all
python3 setup.py bdist_wheel --dist-dir=dist

setup.py auto-detects ROCm (VLLM_TARGET_DEVICE="rocm") from torch.version.hip.

The ROCm Dockerfile also optionally builds RIXL (ROCm's analog to NIXL for KV cache transfer) and UCX for network communication.

ROCm vs CUDA CMake Differences

Aspect	CUDA	ROCm
GPU language	`CUDA`	`HIP`
Compiler	`nvcc` via `CUDA_HOME`	`hipcc` / clang++ via `ROCM_PATH`
Architecture variable	`TORCH_CUDA_ARCH_LIST`	`PYTORCH_ROCM_ARCH`
Debug flags	Standard	Adds `-O0 -ggdb3` for `Debug` builds
Arch override	`clear_cuda_arches` + extraction	`override_gpu_arches`
CUTLASS	Fetched from GitHub	Not used (ROCm uses hipBLAS/AITER)

Sources: docker/Dockerfile.rocm1-200 docker/Dockerfile.rocm_base1-100 CMakeLists.txt120-172

Key Environment Variables Affecting the Build

Variable	Source	Effect
`VLLM_TARGET_DEVICE`	`setup.py`, `CMakeLists.txt`	`"cuda"`, `"rocm"`, `"cpu"`, `"empty"`
`MAX_JOBS`	`setup.py:175`	Ninja parallel job count
`NVCC_THREADS`	`setup.py:196`	nvcc internal thread count
`CMAKE_BUILD_TYPE`	`setup.py:225`	`Release`, `RelWithDebInfo`, `Debug`
`TORCH_CUDA_ARCH_LIST`	CMake, Docker `ARG`	Semicolon-separated CUDA arch list
`VLLM_CUTLASS_SRC_DIR`	`CMakeLists.txt:315`	Override CUTLASS source path
`FETCHCONTENT_BASE_DIR`	`CMakeLists.txt:263`	Cache dir for CMake FetchContent downloads
`VLLM_USE_PRECOMPILED`	`setup.py`	Use pre-built `.so` files from wheel
`VLLM_PRECOMPILED_WHEEL_LOCATION`	`setup.py`	Path to csrc-only wheel
`USE_SCCACHE`	Docker `ARG`	Enable Mozilla sccache
`CCACHE_DIR`	Docker `ENV`	ccache directory (`/root/.cache/ccache`)
`PYTORCH_NIGHTLY`	Docker `ARG`	Install nightly PyTorch

Sources: setup.py42-65 docker/Dockerfile235-295

Build Artifact Flow

Artifact flow from source to runtime

Sources: docker/Dockerfile263-693 setup.py299-400

Installation for Development

For local development builds (without Docker):

# Install build deps
pip install -r requirements/build.txt

# Build and install in editable mode (compiles C++ extensions)
pip install -e . --no-build-isolation

# Or for CPU-only (no CUDA compilation)
VLLM_TARGET_DEVICE=cpu pip install -e . --no-build-isolation

The editable install invokes cmake_build_ext via setup.py, which:

Creates a build/ temp directory
Runs cmake configure with detected Python and PyTorch paths
Builds all extension targets via Ninja
Runs cmake --install to copy .so files into the vllm/ directory

After a successful build, subsequent pip install -e . runs skip CMake re-configuration if the CMakeLists.txt directory hasn't changed (tracked in cmake_build_ext.did_config).

The vllm CLI entry point (vllm serve, vllm run, etc.) is installed via [project.scripts] in pyproject.toml, resolving to vllm.entrypoints.cli.main:main.

For detailed installation instructions including uv usage and prerequisites, see Installation and Setup.

Sources: pyproject.toml42-43 setup.py214-298

Build System and Deployment

Relevant source files

Overview

vLLM has two distinct build phases:

C++/CUDA extension build — A CMake-driven compilation of GPU kernels and custom ops into shared libraries (.so files). This is the expensive step that requires CUDA/ROCm toolchains and handles architecture-specific code generation.
Python wheel build — A standard setuptools build that packages the Python source along with the compiled .so files into a distributable wheel.

The Docker build further separates these phases into parallel stages to minimize rebuild time.

Sources: setup.py1-50 CMakeLists.txt1-30 docker/Dockerfile1-50

Python Packaging

pyproject.toml

pyproject.toml is the authoritative packaging configuration.

Field	Value
Package name	`vllm`
Build backend	`setuptools.build_meta`
Versioning	`setuptools-scm` (derived from git tags)
Python requires	`>=3.10,<3.14`
Console entrypoint	`vllm = "vllm.entrypoints.cli.main:main"`
License	Apache-2.0

Build-time dependencies (pyproject.toml3-13):

cmake>=3.26.1
ninja
packaging>=24.2
setuptools>=77.0.3,<81.0.0
setuptools-scm>=8.0
torch==2.10.0
wheel
jinja2
grpcio-tools==1.78.0

Plugin entry points (pyproject.toml45-48):

lora_filesystem_resolver — vllm.plugins.lora_resolvers.filesystem_resolver
lora_hf_hub_resolver — vllm.plugins.lora_resolvers.hf_hub_resolver

Sources: pyproject.toml1-55

setup.py

setup.py orchestrates the build by bridging Python's setuptools with CMake. Key classes:

Class	Purpose
`CMakeExtension`	Declares a C++ extension backed by a CMake project instead of raw source files
`cmake_build_ext`	Custom `build_ext` command; invokes `cmake` configure and build steps
`precompiled_build_ext`	Skips extension compilation; used when pre-built `.so` files are injected via `VLLM_PRECOMPILED_WHEEL_LOCATION`
`precompiled_wheel_utils`	Fetches and extracts `.so` files from a previously-built wheel (nightly builds)
`BuildPyAndGenerateGrpc`	Extends `build_py` to compile `vllm/grpc/vllm_engine.proto` into `_pb2.py` stubs
`DevelopAndGenerateGrpc`	Same, for `pip install -e` (editable) mode

Target device detection (setup.py43-64):

The VLLM_TARGET_DEVICE environment variable controls what gets compiled. If unset, setup.py auto-detects:

torch.version.hip is not None → "rocm"
torch.version.cuda is not None → "cuda"
macOS → "cpu"

Compiler caching (setup.py236-249):

Job parallelism (setup.py172-209):

compute_num_jobs reads MAX_JOBS (env) and NVCC_THREADS (env) to determine build concurrency. When NVCC_THREADS is set, num_jobs is reduced proportionally to avoid overloading the system.

Sources: setup.py27-400

CMake Build System

The top-level CMakeLists.txt builds all C++/CUDA extensions. The cmake/utils.cmake file provides shared macro and function definitions.

Extension Targets

Primary extension: _C

Registered as torch.ops._C at runtime. Contains the majority of GPU kernels.

Core source files (CMakeLists.txt282-306):

Source file	Content
`csrc/cache_kernels.cu`	KV cache copy/swap operations
`csrc/attention/paged_attention_v1.cu`	PagedAttention v1
`csrc/attention/paged_attention_v2.cu`	PagedAttention v2
`csrc/pos_encoding_kernels.cu`	Rotary embedding
`csrc/layernorm_kernels.cu`	RMS norm
`csrc/sampler.cu`	Token sampling
`csrc/quantization/gptq/q_gemm.cu`	GPTQ GEMM
`csrc/torch_bindings.cpp`	PyTorch op registration (TORCH_LIBRARY_EXPAND)

CUDA-only additions include CUTLASS-backed scaled matrix multiplies, AWQ GEMM kernels, Marlin quantization kernels, and sparse GEMM (CMakeLists.txt341-350).

Secondary extension: _moe_C

Built from csrc/moe/, contains MoE-specific GPU operations. Registered separately via csrc/moe/torch_bindings.cpp.

Extension: cumem_allocator

Built from csrc/cumem_allocator.cpp, links against CUDA::cuda_driver (or amdhip64 for ROCm). Handles custom CUDA memory allocator registration.

External extension: vllm_flash_attn

Built via a CMake ExternalProject or FetchContent for the vllm-flash-attn package. Python stubs are copied to vllm/vllm_flash_attn/ during the build.

Architecture Handling

CUDA architecture sets (CMakeLists.txt93-103):

CUDA 13.0+: 7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0
CUDA 12.8+: 7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0
Older:      7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0

HIP/ROCm architecture set (CMakeLists.txt40):

gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151

cmake/utils.cmake Functions

Function/Macro	Purpose
`find_python_from_executable`	Locates the Python interpreter matching a specific path
`append_cmake_prefix_path`	Adds PyTorch's CMake prefix to `CMAKE_PREFIX_PATH`
`get_torch_gpu_compiler_flags`	Retrieves GPU compilation flags from the active PyTorch installation
`clear_cuda_arches`	Strips architecture flags from `CMAKE_CUDA_FLAGS` for per-file control
`extract_unique_cuda_archs_ascending`	Parses the arch list from flags into a sorted list
`cuda_archs_loose_intersection`	Computes the intersection of two arch lists
`set_gencode_flags_for_srcs`	Assigns per-file CUDA gencode flags
`define_extension_target`	Defines a CMake target for a PyTorch extension library

Sources: cmake/utils.cmake1-200 CMakeLists.txt1-600

CMake Build Flow Diagram

CMake extension build process

Sources: CMakeLists.txt1-300 setup.py159-360

Dependency Management

Requirements File Structure

File	Purpose
`requirements/common.txt`	Runtime deps shared across all platforms
`requirements/cuda.txt`	CUDA platform: includes `common.txt`, adds `torch`, `flashinfer-python`, `ray`
`requirements/rocm.txt`	ROCm platform: includes `common.txt`, adds AMD-specific packages
`requirements/build.txt`	Build-time only: `cmake`, `ninja`, `setuptools`, `torch`, `grpcio-tools`
`requirements/test.txt`	Full pinned test dependency lockfile (auto-generated by `uv pip compile`)
`requirements/test.in`	Source file for `test.txt` (human-maintained)
`requirements/rocm-build.txt`	ROCm build-time deps, points to ROCm PyTorch index
`requirements/rocm-test.txt`	ROCm test deps
`requirements/nightly_torch_test.txt`	Test deps for nightly PyTorch builds
`requirements/kv_connectors.txt`	Optional KV connector deps (lmcache, nixl)

Key Pinned Versions

From requirements/cuda.txt and docker/Dockerfile:

Package	Pinned Version
`torch`	2.10.0
`torchvision`	0.25.0
`torchaudio`	2.10.0
`flashinfer-python`	0.6.4
`numba`	0.61.2

Version Management for Docker

docker/versions.json is auto-generated from ARG defaults in docker/Dockerfile. To update a version:

Edit the ARG default in docker/Dockerfile
Run python tools/generate_versions_json.py
Use docker buildx bake -f docker/docker-bake.hcl -f docker/versions.json for bake-based builds

Notable Runtime Dependencies (common.txt)

Selected key entries from requirements/common.txt:

Package	Role
`transformers >=4.56.0,<5`	Model configs and tokenizers
`fastapi[standard] >=0.115.0`	HTTP API server
`xgrammar ==0.1.29`	Structured output grammar engine
`outlines_core ==0.2.11`	Structured output (outlines backend)
`compressed-tensors ==0.13.0`	Quantization format support
`pyzmq >=25.0.0`	IPC between engine processes
`msgspec`	Zero-copy serialization
`ray[cgraph] >=2.48.0`	Distributed execution (in cuda.txt)
`ninja`	Required by Triton JIT compilation
`flashinfer-python`	FlashInfer attention backend

Sources: requirements/common.txt1-60 requirements/cuda.txt1-18 requirements/build.txt1-14 docker/versions.json1-50

Docker Multi-Stage Build (CUDA)

The main docker/Dockerfile uses a multi-stage build to maximize layer caching and allow parallel compilation steps.

Build Arguments

Argument	Default	Purpose
`CUDA_VERSION`	`12.9.1`	CUDA version for base image selection
`PYTHON_VERSION`	`3.12`	Python version to install
`BUILD_BASE_IMAGE`	`nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04`	Compilation base (ubuntu20.04 for glibc compat)
`FINAL_BASE_IMAGE`	`nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04`	Runtime base
`PYTORCH_NIGHTLY`	unset	If `1`, installs nightly PyTorch
`torch_cuda_arch_list`	`7.0 7.5 8.0 8.9 9.0 10.0 12.0`	Architectures compiled into kernels
`max_jobs`	`2`	Ninja parallel jobs
`nvcc_threads`	`8`	nvcc internal thread count
`USE_SCCACHE`	unset	Enables Mozilla sccache for build caching
`VLLM_USE_PRECOMPILED`	unset	Injects pre-built csrc wheel instead of recompiling
`INSTALL_KV_CONNECTORS`	`false`	Installs optional KV connector packages (nixl, lmcache)
`FLASHINFER_VERSION`	`0.6.4`	FlashInfer cubin/jit-cache version
`DEEPGEMM_GIT_REF`	`477618cd`	DeepGEMM commit to build
`DEEPEP_COMMIT_HASH`	`73b6ea4`	DeepEP commit to build

Stage Dependency Diagram

Docker Dockerfile multi-stage build

Sources: docker/Dockerfile90-820

Stage Details

base stage (docker/Dockerfile92-188):

Installs GCC 10 (to avoid a GCC 9 CUTLASS bug)
Installs uv for fast package management
Creates /opt/venv Python virtual environment
Installs PyTorch and CUDA requirements via uv pip install -r requirements/cuda.txt
Records exact torch/torchvision/torchaudio versions to torch_lib_versions.txt for version pinning in downstream stages

csrc-build stage (docker/Dockerfile191-308):

Copies only the files needed to compile C++ extensions: pyproject.toml, setup.py, CMakeLists.txt, cmake/, csrc/, vllm/envs.py, vllm/__init__.py
Builds a wheel via python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38
Supports sccache with S3 backend for CI caching (USE_SCCACHE=1)
Sets SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0+csrc.build (version doesn't matter; only .so files are used)
Uses ccache as fallback, stored at /root/.cache/ccache

extensions-build stage (docker/Dockerfile311-352):

Runs in parallel with csrc-build (both derive from base)
Builds DeepGEMM wheel via tools/install_deepgemm.sh, targeting TORCH_CUDA_ARCH_LIST="9.0a 10.0a"
Builds DeepEP (expert parallelism kernels) and NVSHMEM via tools/ep_kernels/install_python_libraries.sh

build stage (docker/Dockerfile354-437):

Copies the csrc .whl from csrc-build into /precompiled-wheels/
Copies the full source tree (.)
Sets VLLM_PRECOMPILED_WHEEL_LOCATION to point at the csrc wheel
Calls setup.py bdist_wheel again; this time the precompiled_build_ext class skips recompilation and extracts the .so files from the pre-built wheel
Runs check-wheel-size.py to validate wheel size stays under VLLM_MAX_SIZE_MB (default 500 MB)
Copies DeepGEMM and DeepEP wheels from extensions-build for later installation

vllm-base stage (docker/Dockerfile486-693):

Uses the leaner FINAL_BASE_IMAGE (nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04)
Installs Python via deadsnakes PPA
Installs cuda-nvcc, cuda-cudart, cuda-nvrtc packages for JIT compilation at runtime (FlashInfer, DeepGEMM, EP kernels all JIT-compile)
Installs libnccl-dev (required by pynccl_allocator.py)
Installs FlashInfer pre-compiled cubin cache: flashinfer-cubin + flashinfer-jit-cache
Installs gdrcopy, bitsandbytes, timm, runai-model-streamer
Installs the vLLM wheel from build stage
Installs DeepGEMM and DeepEP wheels

Sources: docker/Dockerfile90-820

ROCm-Specific Build

The ROCm build uses two separate Dockerfiles: a base image builder and a vLLM builder.

Dockerfile.rocm_base

docker/Dockerfile.rocm_base builds the ROCm base image from scratch. It compiles:

Component	Source
PyTorch	`github.com/ROCm/pytorch.git` at pinned commit
Triton	`github.com/ROCm/triton.git` at pinned commit
TorchVision	`github.com/pytorch/vision.git` at pinned tag
TorchAudio	`github.com/pytorch/audio.git` at pinned tag
FlashAttention	`github.com/Dao-AILab/flash-attention.git`
AITER	`github.com/ROCm/aiter.git` (AMD inference library)

The base ROCm architecture list is set via PYTORCH_ROCM_ARCH:

gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151

Dockerfile.rocm

docker/Dockerfile.rocm builds vLLM for ROCm. It has two modes controlled by REMOTE_VLLM:

REMOTE_VLLM=0: copies local source (COPY ./ vllm/)
REMOTE_VLLM=1: clones from VLLM_REPO at VLLM_BRANCH

Build step (docker/Dockerfile.rocm101-104):

python3 -m pip install -r requirements/rocm.txt
python3 setup.py clean --all
python3 setup.py bdist_wheel --dist-dir=dist

setup.py auto-detects ROCm (VLLM_TARGET_DEVICE="rocm") from torch.version.hip.

The ROCm Dockerfile also optionally builds RIXL (ROCm's analog to NIXL for KV cache transfer) and UCX for network communication.

ROCm vs CUDA CMake Differences

Aspect	CUDA	ROCm
GPU language	`CUDA`	`HIP`
Compiler	`nvcc` via `CUDA_HOME`	`hipcc` / clang++ via `ROCM_PATH`
Architecture variable	`TORCH_CUDA_ARCH_LIST`	`PYTORCH_ROCM_ARCH`
Debug flags	Standard	Adds `-O0 -ggdb3` for `Debug` builds
Arch override	`clear_cuda_arches` + extraction	`override_gpu_arches`
CUTLASS	Fetched from GitHub	Not used (ROCm uses hipBLAS/AITER)

Sources: docker/Dockerfile.rocm1-200 docker/Dockerfile.rocm_base1-100 CMakeLists.txt120-172

Key Environment Variables Affecting the Build

Variable	Source	Effect
`VLLM_TARGET_DEVICE`	`setup.py`, `CMakeLists.txt`	`"cuda"`, `"rocm"`, `"cpu"`, `"empty"`
`MAX_JOBS`	`setup.py:175`	Ninja parallel job count
`NVCC_THREADS`	`setup.py:196`	nvcc internal thread count
`CMAKE_BUILD_TYPE`	`setup.py:225`	`Release`, `RelWithDebInfo`, `Debug`
`TORCH_CUDA_ARCH_LIST`	CMake, Docker `ARG`	Semicolon-separated CUDA arch list
`VLLM_CUTLASS_SRC_DIR`	`CMakeLists.txt:315`	Override CUTLASS source path
`FETCHCONTENT_BASE_DIR`	`CMakeLists.txt:263`	Cache dir for CMake FetchContent downloads
`VLLM_USE_PRECOMPILED`	`setup.py`	Use pre-built `.so` files from wheel
`VLLM_PRECOMPILED_WHEEL_LOCATION`	`setup.py`	Path to csrc-only wheel
`USE_SCCACHE`	Docker `ARG`	Enable Mozilla sccache
`CCACHE_DIR`	Docker `ENV`	ccache directory (`/root/.cache/ccache`)
`PYTORCH_NIGHTLY`	Docker `ARG`	Install nightly PyTorch

Sources: setup.py42-65 docker/Dockerfile235-295

Build Artifact Flow

Artifact flow from source to runtime

Sources: docker/Dockerfile263-693 setup.py299-400

Installation for Development

For local development builds (without Docker):

# Install build deps
pip install -r requirements/build.txt

# Build and install in editable mode (compiles C++ extensions)
pip install -e . --no-build-isolation

# Or for CPU-only (no CUDA compilation)
VLLM_TARGET_DEVICE=cpu pip install -e . --no-build-isolation

The editable install invokes cmake_build_ext via setup.py, which:

Creates a build/ temp directory
Runs cmake configure with detected Python and PyTorch paths
Builds all extension targets via Ninja
Runs cmake --install to copy .so files into the vllm/ directory

After a successful build, subsequent pip install -e . runs skip CMake re-configuration if the CMakeLists.txt directory hasn't changed (tracked in cmake_build_ext.did_config).

The vllm CLI entry point (vllm serve, vllm run, etc.) is installed via [project.scripts] in pyproject.toml, resolving to vllm.entrypoints.cli.main:main.

For detailed installation instructions including uv usage and prerequisites, see Installation and Setup.

Sources: pyproject.toml42-43 setup.py214-298

Build System and Deployment

Overview

Python Packaging

pyproject.toml

setup.py

CMake Build System

Extension Targets

Architecture Handling

cmake/utils.cmake Functions

CMake Build Flow Diagram

Dependency Management

Requirements File Structure

Key Pinned Versions

Version Management for Docker

Notable Runtime Dependencies (common.txt)

Docker Multi-Stage Build (CUDA)

Build Arguments

Stage Dependency Diagram

Stage Details

ROCm-Specific Build

Dockerfile.rocm_base

Dockerfile.rocm

ROCm vs CUDA CMake Differences

Key Environment Variables Affecting the Build

Build Artifact Flow

Installation for Development

On this page

Build System and Deployment

Overview

Python Packaging

pyproject.toml

setup.py

CMake Build System

Extension Targets

Architecture Handling

cmake/utils.cmake Functions

CMake Build Flow Diagram

Dependency Management

Requirements File Structure

Key Pinned Versions

Version Management for Docker

Notable Runtime Dependencies (common.txt)

Docker Multi-Stage Build (CUDA)

Build Arguments

Stage Dependency Diagram

Stage Details

ROCm-Specific Build

Dockerfile.rocm_base

Dockerfile.rocm

ROCm vs CUDA CMake Differences

Key Environment Variables Affecting the Build

Build Artifact Flow

Installation for Development

On this page