This page documents how vLLM is built, packaged, and deployed. It covers the Python packaging configuration, the CMake build system for CUDA/HIP extensions, Docker image construction, and dependency management.
For information about environment variables that affect runtime behavior, see Environment Variables System. For information about torch.compile integration and compilation modes, see Compilation Configuration. For platform-specific runtime details (CUDA, ROCm, CPU, TPU), see Platform Support.
vLLM has two distinct build phases:
.so files). This is the expensive step that requires CUDA/ROCm toolchains and handles architecture-specific code generation.setuptools build that packages the Python source along with the compiled .so files into a distributable wheel.The Docker build further separates these phases into parallel stages to minimize rebuild time.
Sources: setup.py1-50 CMakeLists.txt1-30 docker/Dockerfile1-50
pyproject.toml is the authoritative packaging configuration.
| Field | Value |
|---|---|
| Package name | vllm |
| Build backend | setuptools.build_meta |
| Versioning | setuptools-scm (derived from git tags) |
| Python requires | >=3.10,<3.14 |
| Console entrypoint | vllm = "vllm.entrypoints.cli.main:main" |
| License | Apache-2.0 |
Build-time dependencies (pyproject.toml3-13):
cmake>=3.26.1
ninja
packaging>=24.2
setuptools>=77.0.3,<81.0.0
setuptools-scm>=8.0
torch==2.10.0
wheel
jinja2
grpcio-tools==1.78.0
Plugin entry points (pyproject.toml45-48):
lora_filesystem_resolver — vllm.plugins.lora_resolvers.filesystem_resolverlora_hf_hub_resolver — vllm.plugins.lora_resolvers.hf_hub_resolverSources: pyproject.toml1-55
setup.py orchestrates the build by bridging Python's setuptools with CMake. Key classes:
| Class | Purpose |
|---|---|
CMakeExtension | Declares a C++ extension backed by a CMake project instead of raw source files |
cmake_build_ext | Custom build_ext command; invokes cmake configure and build steps |
precompiled_build_ext | Skips extension compilation; used when pre-built .so files are injected via VLLM_PRECOMPILED_WHEEL_LOCATION |
precompiled_wheel_utils | Fetches and extracts .so files from a previously-built wheel (nightly builds) |
BuildPyAndGenerateGrpc | Extends build_py to compile vllm/grpc/vllm_engine.proto into _pb2.py stubs |
DevelopAndGenerateGrpc | Same, for pip install -e (editable) mode |
Target device detection (setup.py43-64):
The VLLM_TARGET_DEVICE environment variable controls what gets compiled. If unset, setup.py auto-detects:
torch.version.hip is not None → "rocm"torch.version.cuda is not None → "cuda""cpu"Compiler caching (setup.py236-249):
cmake_build_ext.configure checks for sccache first, then ccache. If found, it adds -DCMAKE_C_COMPILER_LAUNCHER, -DCMAKE_CXX_COMPILER_LAUNCHER, -DCMAKE_CUDA_COMPILER_LAUNCHER, and -DCMAKE_HIP_COMPILER_LAUNCHER to the CMake invocation.
Job parallelism (setup.py172-209):
compute_num_jobs reads MAX_JOBS (env) and NVCC_THREADS (env) to determine build concurrency. When NVCC_THREADS is set, num_jobs is reduced proportionally to avoid overloading the system.
Sources: setup.py27-400
The top-level CMakeLists.txt builds all C++/CUDA extensions. The cmake/utils.cmake file provides shared macro and function definitions.
Primary extension: _C
Registered as torch.ops._C at runtime. Contains the majority of GPU kernels.
Core source files (CMakeLists.txt282-306):
| Source file | Content |
|---|---|
csrc/cache_kernels.cu | KV cache copy/swap operations |
csrc/attention/paged_attention_v1.cu | PagedAttention v1 |
csrc/attention/paged_attention_v2.cu | PagedAttention v2 |
csrc/pos_encoding_kernels.cu | Rotary embedding |
csrc/layernorm_kernels.cu | RMS norm |
csrc/sampler.cu | Token sampling |
csrc/quantization/gptq/q_gemm.cu | GPTQ GEMM |
csrc/torch_bindings.cpp | PyTorch op registration (TORCH_LIBRARY_EXPAND) |
CUDA-only additions include CUTLASS-backed scaled matrix multiplies, AWQ GEMM kernels, Marlin quantization kernels, and sparse GEMM (CMakeLists.txt341-350).
Secondary extension: _moe_C
Built from csrc/moe/, contains MoE-specific GPU operations. Registered separately via csrc/moe/torch_bindings.cpp.
Extension: cumem_allocator
Built from csrc/cumem_allocator.cpp, links against CUDA::cuda_driver (or amdhip64 for ROCm). Handles custom CUDA memory allocator registration.
External extension: vllm_flash_attn
Built via a CMake ExternalProject or FetchContent for the vllm-flash-attn package. Python stubs are copied to vllm/vllm_flash_attn/ during the build.
CUDA architecture sets (CMakeLists.txt93-103):
CUDA 13.0+: 7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0
CUDA 12.8+: 7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0
Older: 7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0
HIP/ROCm architecture set (CMakeLists.txt40):
gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151
Kernel files are selectively compiled for subsets of these architectures using set_gencode_flags_for_srcs (defined in cmake/utils.cmake). For example, Marlin kernels only compile for 8.0+PTX and above, and FP8 Marlin only for 8.9 and 12.0.
| Function/Macro | Purpose |
|---|---|
find_python_from_executable | Locates the Python interpreter matching a specific path |
append_cmake_prefix_path | Adds PyTorch's CMake prefix to CMAKE_PREFIX_PATH |
get_torch_gpu_compiler_flags | Retrieves GPU compilation flags from the active PyTorch installation |
clear_cuda_arches | Strips architecture flags from CMAKE_CUDA_FLAGS for per-file control |
extract_unique_cuda_archs_ascending | Parses the arch list from flags into a sorted list |
cuda_archs_loose_intersection | Computes the intersection of two arch lists |
set_gencode_flags_for_srcs | Assigns per-file CUDA gencode flags |
define_extension_target | Defines a CMake target for a PyTorch extension library |
Sources: cmake/utils.cmake1-200 CMakeLists.txt1-600
CMake extension build process
Sources: CMakeLists.txt1-300 setup.py159-360
| File | Purpose |
|---|---|
requirements/common.txt | Runtime deps shared across all platforms |
requirements/cuda.txt | CUDA platform: includes common.txt, adds torch, flashinfer-python, ray |
requirements/rocm.txt | ROCm platform: includes common.txt, adds AMD-specific packages |
requirements/build.txt | Build-time only: cmake, ninja, setuptools, torch, grpcio-tools |
requirements/test.txt | Full pinned test dependency lockfile (auto-generated by uv pip compile) |
requirements/test.in | Source file for test.txt (human-maintained) |
requirements/rocm-build.txt | ROCm build-time deps, points to ROCm PyTorch index |
requirements/rocm-test.txt | ROCm test deps |
requirements/nightly_torch_test.txt | Test deps for nightly PyTorch builds |
requirements/kv_connectors.txt | Optional KV connector deps (lmcache, nixl) |
From requirements/cuda.txt and docker/Dockerfile:
| Package | Pinned Version |
|---|---|
torch | 2.10.0 |
torchvision | 0.25.0 |
torchaudio | 2.10.0 |
flashinfer-python | 0.6.4 |
numba | 0.61.2 |
docker/versions.json is auto-generated from ARG defaults in docker/Dockerfile. To update a version:
ARG default in docker/Dockerfilepython tools/generate_versions_json.pydocker buildx bake -f docker/docker-bake.hcl -f docker/versions.json for bake-based buildsSelected key entries from requirements/common.txt:
| Package | Role |
|---|---|
transformers >=4.56.0,<5 | Model configs and tokenizers |
fastapi[standard] >=0.115.0 | HTTP API server |
xgrammar ==0.1.29 | Structured output grammar engine |
outlines_core ==0.2.11 | Structured output (outlines backend) |
compressed-tensors ==0.13.0 | Quantization format support |
pyzmq >=25.0.0 | IPC between engine processes |
msgspec | Zero-copy serialization |
ray[cgraph] >=2.48.0 | Distributed execution (in cuda.txt) |
ninja | Required by Triton JIT compilation |
flashinfer-python | FlashInfer attention backend |
Sources: requirements/common.txt1-60 requirements/cuda.txt1-18 requirements/build.txt1-14 docker/versions.json1-50
The main docker/Dockerfile uses a multi-stage build to maximize layer caching and allow parallel compilation steps.
| Argument | Default | Purpose |
|---|---|---|
CUDA_VERSION | 12.9.1 | CUDA version for base image selection |
PYTHON_VERSION | 3.12 | Python version to install |
BUILD_BASE_IMAGE | nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 | Compilation base (ubuntu20.04 for glibc compat) |
FINAL_BASE_IMAGE | nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04 | Runtime base |
PYTORCH_NIGHTLY | unset | If 1, installs nightly PyTorch |
torch_cuda_arch_list | 7.0 7.5 8.0 8.9 9.0 10.0 12.0 | Architectures compiled into kernels |
max_jobs | 2 | Ninja parallel jobs |
nvcc_threads | 8 | nvcc internal thread count |
USE_SCCACHE | unset | Enables Mozilla sccache for build caching |
VLLM_USE_PRECOMPILED | unset | Injects pre-built csrc wheel instead of recompiling |
INSTALL_KV_CONNECTORS | false | Installs optional KV connector packages (nixl, lmcache) |
FLASHINFER_VERSION | 0.6.4 | FlashInfer cubin/jit-cache version |
DEEPGEMM_GIT_REF | 477618cd | DeepGEMM commit to build |
DEEPEP_COMMIT_HASH | 73b6ea4 | DeepEP commit to build |
Docker Dockerfile multi-stage build
Sources: docker/Dockerfile90-820
base stage (docker/Dockerfile92-188):
uv for fast package management/opt/venv Python virtual environmentuv pip install -r requirements/cuda.txttorch_lib_versions.txt for version pinning in downstream stagescsrc-build stage (docker/Dockerfile191-308):
pyproject.toml, setup.py, CMakeLists.txt, cmake/, csrc/, vllm/envs.py, vllm/__init__.pypython3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38USE_SCCACHE=1)SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0+csrc.build (version doesn't matter; only .so files are used)ccache as fallback, stored at /root/.cache/ccacheextensions-build stage (docker/Dockerfile311-352):
csrc-build (both derive from base)tools/install_deepgemm.sh, targeting TORCH_CUDA_ARCH_LIST="9.0a 10.0a"tools/ep_kernels/install_python_libraries.shbuild stage (docker/Dockerfile354-437):
.whl from csrc-build into /precompiled-wheels/.)VLLM_PRECOMPILED_WHEEL_LOCATION to point at the csrc wheelsetup.py bdist_wheel again; this time the precompiled_build_ext class skips recompilation and extracts the .so files from the pre-built wheelcheck-wheel-size.py to validate wheel size stays under VLLM_MAX_SIZE_MB (default 500 MB)extensions-build for later installationvllm-base stage (docker/Dockerfile486-693):
FINAL_BASE_IMAGE (nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04)cuda-nvcc, cuda-cudart, cuda-nvrtc packages for JIT compilation at runtime (FlashInfer, DeepGEMM, EP kernels all JIT-compile)libnccl-dev (required by pynccl_allocator.py)flashinfer-cubin + flashinfer-jit-cachegdrcopy, bitsandbytes, timm, runai-model-streamerbuild stageSources: docker/Dockerfile90-820
The ROCm build uses two separate Dockerfiles: a base image builder and a vLLM builder.
docker/Dockerfile.rocm_base builds the ROCm base image from scratch. It compiles:
| Component | Source |
|---|---|
| PyTorch | github.com/ROCm/pytorch.git at pinned commit |
| Triton | github.com/ROCm/triton.git at pinned commit |
| TorchVision | github.com/pytorch/vision.git at pinned tag |
| TorchAudio | github.com/pytorch/audio.git at pinned tag |
| FlashAttention | github.com/Dao-AILab/flash-attention.git |
| AITER | github.com/ROCm/aiter.git (AMD inference library) |
The base ROCm architecture list is set via PYTORCH_ROCM_ARCH:
gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151
docker/Dockerfile.rocm builds vLLM for ROCm. It has two modes controlled by REMOTE_VLLM:
REMOTE_VLLM=0: copies local source (COPY ./ vllm/)REMOTE_VLLM=1: clones from VLLM_REPO at VLLM_BRANCHBuild step (docker/Dockerfile.rocm101-104):
python3 -m pip install -r requirements/rocm.txt
python3 setup.py clean --all
python3 setup.py bdist_wheel --dist-dir=dist
setup.py auto-detects ROCm (VLLM_TARGET_DEVICE="rocm") from torch.version.hip.
The ROCm Dockerfile also optionally builds RIXL (ROCm's analog to NIXL for KV cache transfer) and UCX for network communication.
| Aspect | CUDA | ROCm |
|---|---|---|
| GPU language | CUDA | HIP |
| Compiler | nvcc via CUDA_HOME | hipcc / clang++ via ROCM_PATH |
| Architecture variable | TORCH_CUDA_ARCH_LIST | PYTORCH_ROCM_ARCH |
| Debug flags | Standard | Adds -O0 -ggdb3 for Debug builds |
| Arch override | clear_cuda_arches + extraction | override_gpu_arches |
| CUTLASS | Fetched from GitHub | Not used (ROCm uses hipBLAS/AITER) |
Sources: docker/Dockerfile.rocm1-200 docker/Dockerfile.rocm_base1-100 CMakeLists.txt120-172
| Variable | Source | Effect |
|---|---|---|
VLLM_TARGET_DEVICE | setup.py, CMakeLists.txt | "cuda", "rocm", "cpu", "empty" |
MAX_JOBS | setup.py:175 | Ninja parallel job count |
NVCC_THREADS | setup.py:196 | nvcc internal thread count |
CMAKE_BUILD_TYPE | setup.py:225 | Release, RelWithDebInfo, Debug |
TORCH_CUDA_ARCH_LIST | CMake, Docker ARG | Semicolon-separated CUDA arch list |
VLLM_CUTLASS_SRC_DIR | CMakeLists.txt:315 | Override CUTLASS source path |
FETCHCONTENT_BASE_DIR | CMakeLists.txt:263 | Cache dir for CMake FetchContent downloads |
VLLM_USE_PRECOMPILED | setup.py | Use pre-built .so files from wheel |
VLLM_PRECOMPILED_WHEEL_LOCATION | setup.py | Path to csrc-only wheel |
USE_SCCACHE | Docker ARG | Enable Mozilla sccache |
CCACHE_DIR | Docker ENV | ccache directory (/root/.cache/ccache) |
PYTORCH_NIGHTLY | Docker ARG | Install nightly PyTorch |
Sources: setup.py42-65 docker/Dockerfile235-295
Artifact flow from source to runtime
Sources: docker/Dockerfile263-693 setup.py299-400
For local development builds (without Docker):
# Install build deps
pip install -r requirements/build.txt
# Build and install in editable mode (compiles C++ extensions)
pip install -e . --no-build-isolation
# Or for CPU-only (no CUDA compilation)
VLLM_TARGET_DEVICE=cpu pip install -e . --no-build-isolation
The editable install invokes cmake_build_ext via setup.py, which:
build/ temp directorycmake configure with detected Python and PyTorch pathscmake --install to copy .so files into the vllm/ directoryAfter a successful build, subsequent pip install -e . runs skip CMake re-configuration if the CMakeLists.txt directory hasn't changed (tracked in cmake_build_ext.did_config).
The vllm CLI entry point (vllm serve, vllm run, etc.) is installed via [project.scripts] in pyproject.toml, resolving to vllm.entrypoints.cli.main:main.
For detailed installation instructions including uv usage and prerequisites, see Installation and Setup.
Sources: pyproject.toml42-43 setup.py214-298
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.