This page documents vLLM's multi-platform support system, covering the Platform abstraction layer and the concrete implementations for CUDA, ROCm, CPU, TPU, and XPU. It describes how vLLM adapts its behavior to each hardware target, including attention backend selection, configuration validation, memory management, and distributed communication.
For details on how attention backends are implemented per-platform, see Attention Backends. For how configuration objects interact with platform checks, see Configuration Objects.
vLLM runs on several hardware targets. Rather than scattering hardware-specific logic throughout the codebase, all per-platform behavior is encapsulated in classes that inherit from a common Platform base class defined in vllm/platforms/interface.py. The active platform is a singleton used throughout the engine at runtime.
The supported platforms and their corresponding files are:
| Platform | Class | File |
|---|---|---|
| NVIDIA CUDA | CudaPlatform (NvmlCudaPlatform / NonNvmlCudaPlatform) | vllm/platforms/cuda.py |
| AMD ROCm | RocmPlatform | vllm/platforms/rocm.py |
| CPU | CpuPlatform | vllm/platforms/cpu.py |
| Intel XPU | XPUPlatform | vllm/platforms/xpu.py |
| Google TPU | TpuPlatform | vllm/platforms/tpu.py |
| Out-of-tree / Unspecified | UnspecifiedPlatform | vllm/platforms/interface.py |
Diagram: Platform Class Hierarchy
Sources: vllm/platforms/interface.py100-723 vllm/platforms/cuda.py112-721 vllm/platforms/rocm.py309-826 vllm/platforms/cpu.py72-510 vllm/platforms/xpu.py30-310
The Platform base class in vllm/platforms/interface.py100-718 defines the interface that all concrete platforms must implement or can optionally override.
| Attribute | Purpose |
|---|---|
device_name | Human-readable name (e.g., "cuda", "rocm") |
device_type | PyTorch device type string (e.g., "cuda", "cpu") |
dispatch_key | PyTorch dispatcher key (e.g., "CUDA", "CPU", "XPU") |
dist_backend | Distributed backend (e.g., "nccl", "gloo", "xccl") |
device_control_env_var | Env var controlling device visibility (e.g., CUDA_VISIBLE_DEVICES) |
ray_device_key | Resource key for Ray scheduling (e.g., "GPU") |
ray_noset_device_env_vars | Env vars that prevent Ray from overriding device visibility |
supported_quantization | List of supported quantization format strings |
simple_compile_backend | torch.compile backend for standalone functions (default: "inductor") |
The following describes the primary interface methods on Platform:
Diagram: Platform Interface Methods and Their Roles
Sources: vllm/platforms/interface.py191-718
PlatformEnum and DeviceCapabilityPlatformEnum vllm/platforms/interface.py36-46 is a simple Python enum with values CUDA, ROCM, TPU, XPU, CPU, OOT, and UNSPECIFIED. Platform identity checks (is_cuda(), is_rocm(), etc.) compare self._enum against these values.
DeviceCapability vllm/platforms/interface.py58-97 is a NamedTuple with major and minor integer fields. It supports comparison operators directly, making capability checks like has_device_capability(80) (meaning SM 8.0 / compute capability 8.0) straightforward.
Sources: vllm/platforms/interface.py36-97
The CUDA platform is split into three classes in vllm/platforms/cuda.py:
CudaPlatformBase — shared CUDA logicNvmlCudaPlatform — uses pynvml for device queries without initializing the CUDA contextNonNvmlCudaPlatform — fallback using torch.cuda APIsThe active CudaPlatform alias is resolved at module load time:
NVML (via pynvml) is preferred because it queries device properties without initializing the CUDA context. This is important when Ray workers need to set CUDA_VISIBLE_DEVICES after module import. If pynvml.nvmlInit() fails (e.g., on Jetson), NonNvmlCudaPlatform falls back to torch.cuda APIs.
NvmlCudaPlatform wraps NVML calls with a with_nvml_context decorator vllm/platforms/cuda.py100-109 that initializes and shuts down NVML around each call.
CudaPlatformBase.get_attn_backend_cls() vllm/platforms/cuda.py342-412 uses a priority-ordered backend list from _get_backend_priorities() vllm/platforms/cuda.py48-97
Backend priorities differ by device generation:
| Condition | Non-MLA Priority Order | MLA Priority Order |
|---|---|---|
| Blackwell (SM 10.x) | FlashInfer → FlashAttention → Triton → FlexAttention | FlashInfer MLA → CutlassMLA → FlashAttnMLA → FlashMLA → Triton MLA |
| Other CUDA GPUs | FlashAttention → FlashInfer → Triton → FlexAttention | FlashAttnMLA → FlashMLA → FlashInfer MLA → Triton MLA |
Each candidate backend's validate_configuration() classmethod is called to check feasibility before it is selected.
check_and_update_config() vllm/platforms/cuda.py168-297 automatically sets cache_config.block_size based on the chosen MLA backend:
| Backend | Block Size |
|---|---|
FLASHMLA | 64 |
CUTLASS_MLA | 128 |
FLASHINFER_MLA | 64 (or 32) |
FLASHMLA_SPARSE | 64 |
FP8 is available on CUDA devices with compute capability ≥ 8.9 (Ada Lovelace / Hopper and newer):
Sources: vllm/platforms/cuda.py1-721
RocmPlatform vllm/platforms/rocm.py309-826 targets AMD GPUs via the HIP/ROCm software stack. Although device_type is "cuda" (ROCm exposes a CUDA-compatible API surface), device_name is "rocm" and _enum is PlatformEnum.ROCM.
The ROCm platform resolves the GCN architecture string once at module load:
_get_gcn_arch() vllm/platforms/rocm.py124-139 first queries via amdsmi (AMD System Management Interface, no CUDA init required), falling back to torch.cuda.get_device_properties. Several boolean flags are derived from _GCN_ARCH:
| Flag | Meaning |
|---|---|
_ON_GFX9 | gfx9 family (MI200/MI300 series) |
_ON_GFX942 | MI300X/MI325X exactly |
_ON_GFX950 | MI350 series |
_ON_MI3XX | MI300 or MI350 series |
_ON_GFX1X | RDNA3/RDNA4 (gfx11xx/gfx12xx) |
These flags drive backend selection and capability checks without requiring a live CUDA context.
The with_amdsmi_context decorator vllm/platforms/rocm.py98-107 wraps amdsmi_init() / amdsmi_shut_down() calls around any function that queries AMD device info. It is used for get_device_name() and is_fully_connected().
is_fully_connected() vllm/platforms/rocm.py548-565 checks for XGMI (1-hop, type 2) connectivity between physical GPU pairs using amdsmi_topo_get_link_type.
RocmPlatform.get_attn_backend_cls() vllm/platforms/rocm.py353-469 uses an explicit priority chain controlled by environment variables and architecture flags:
Diagram: ROCm Attention Backend Selection Logic
Sources: vllm/platforms/rocm.py353-469
ROCm FP8 support is architecture-dependent:
| Method | Behavior |
|---|---|
supports_fp8() | True for gfx94x (MI300), gfx95x (MI350), gfx12x (RDNA4) |
is_fp8_fnuz() | True for gfx94x only (uses float8_e4m3fnuz instead of float8_e4m3fn) |
fp8_dtype() | Returns torch.float8_e4m3fnuz on MI300, else torch.float8_e4m3fn |
supports_mx() | True for gfx95x (MI350) |
apply_config_platform_defaults() vllm/platforms/rocm.py585-631 appends to compilation_config.custom_ops based on which AITER operations are enabled:
| AITER Feature | Custom Op Added |
|---|---|
is_rmsnorm_enabled() + CUDA graphs active | +rms_norm |
is_linear_fp8_enabled() | +quant_fp8 |
is_fused_moe_enabled() | +grouped_topk |
is_triton_rotary_embed_enabled() | +rotary_embedding |
| (always) | +sparse_attn_indexer |
check_and_update_config() vllm/platforms/rocm.py633-678 enforces several constraints:
decode_context_parallel_size > 1 or prefill_context_parallel_size > 1) is incompatible with full CUDA graphs; mode is downgraded to PIECEWISE.VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION is set, cache_config.block_size is forced to 64; otherwise defaults to 16.parallel_config.worker_cls defaults to "vllm.v1.worker.gpu_worker.Worker".At module import time, _sync_hip_cuda_env_vars() vllm/platforms/rocm.py69-90 ensures HIP_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES are consistent, raising ValueError on genuine conflicts.
Sources: vllm/platforms/rocm.py1-826
CpuPlatform vllm/platforms/cpu.py72-510 runs inference on host CPUs using gloo for distributed communication.
The CPU backend exclusively uses CPU_ATTN:
MLA and sparse attention are explicitly unsupported and raise NotImplementedError. vllm/platforms/cpu.py128-140
get_device_total_memory() vllm/platforms/cpu.py143-166 determines KV cache space from VLLM_CPU_KVCACHE_SPACE. If unset, it defaults to 50% of total NUMA-node memory divided by the number of NUMA nodes.
Data type support varies by CPU architecture vllm/platforms/cpu.py81-121:
| Architecture | Supported dtypes |
|---|---|
| x86, aarch64 | bfloat16, float16, float32 |
| PowerPC | bfloat16, float32 |
| Apple Silicon (ARM/macOS, BF16 FEAT) | bfloat16, float16, float32 |
| Apple Silicon (ARM/macOS, no BF16) | float16, float32 |
| RISC-V | float32 only (scheduler bug workaround) |
check_and_update_config() vllm/platforms/cpu.py180-363 applies the following:
cache_config.block_size = 128 if not specified. Warns if not a multiple of 32.model_config.disable_cascade_attn = True)."mp" (multiprocessing) for world_size > 1.worker_cls = "vllm.v1.worker.cpu_worker.CPUWorker".VLLM_WORKER_MULTIPROC_METHOD=spawn.CpuPlatform provides two utility methods for CPU topology discovery:
get_allowed_cpu_core_node_list() vllm/platforms/cpu.py364-404 — parses lscpu output to find allowed logical CPUs and NUMA nodes.discover_numa_topology() vllm/platforms/cpu.py406-458 — discovers NUMA node→physical core mapping for KV transfer thread reservation.import_kernels() vllm/platforms/cpu.py488-509 selects between vllm._C (AVX-512), vllm._C_AVX2, or vllm._C based on CpuArchEnum and AVX-512 availability.
Sources: vllm/platforms/cpu.py1-510
XPUPlatform vllm/platforms/xpu.py30-310 targets Intel GPUs via the XPU backend. It uses xccl for distributed communication and ZE_AFFINITY_MASK as its device control environment variable.
get_attn_backend_cls() vllm/platforms/xpu.py48-87 forces the KV cache layout to "NHD" via set_kv_cache_layout("NHD") and selects backends as follows:
TRITON_MLATRITON_ATTN → TRITON_ATTNfloat32 dtype → falls back to TRITON_ATTN (FlashAttention doesn't support FP32 on XPU)FLASH_ATTN or default → FLASH_ATTNcheck_and_update_config() vllm/platforms/xpu.py160-222 disables CUDA graphs (CUDAGraphMode.NONE) if:
supports_xpu_graph() returns False).Falls back to PIECEWISE graph mode when FlashAttention is selected (FMHA sycl-tla kernels cannot be captured).
torch.float8_e4m3fn (Intel XPU does not use FNUZ).bfloat16 (raises ValueError).is_pin_memory_available() returns True.XpuCommunicator (requires xccl).Sources: vllm/platforms/xpu.py1-310
TpuPlatform vllm/platforms/tpu.py1-20 delegates entirely to the tpu_inference external package:
If tpu_inference is not installed, an error is logged and the platform is unavailable.
Sources: vllm/platforms/tpu.py1-20
Diagram: Platform Capability Matrix
Sources: vllm/platforms/cuda.py112-123 vllm/platforms/rocm.py309-322 vllm/platforms/cpu.py72-79 vllm/platforms/xpu.py30-40
| Capability | CUDA | ROCm | CPU | XPU |
|---|---|---|---|---|
| FP8 support | ≥ SM 8.9 | gfx94x/95x/12x | ❌ | ✅ |
| FP8 FNUZ variant | ❌ | gfx94x only | ❌ | ❌ |
| MX types | ❌ | gfx95x | ❌ | ❌ |
| Custom allreduce | ✅ | MI300/MI350 only | ❌ | ❌ |
| Hybrid KV cache | ✅ | ✅ | ✅ | ✅ |
| Static graph mode | ✅ | ✅ | ❌ | ✅ |
| Pin memory | ✅ (not WSL) | ✅ | ❌ | ✅ |
| MLA attention | ✅ | ✅ | ❌ | ✅ (Triton MLA) |
| Sparse attention | ✅ | ✅ | ❌ | ❌ |
| BF16 | ≥ SM 8.0 | ≥ capability 8.0 | Most archs | Not A770 |
Diagram: Platform Hooks in Engine Startup
Sources: vllm/platforms/interface.py380-443 vllm/platforms/cuda.py168-297 vllm/platforms/rocm.py585-678 vllm/platforms/cpu.py180-363
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.