Platform Support

Relevant source files

This page documents vLLM's multi-platform support system, covering the Platform abstraction layer and the concrete implementations for CUDA, ROCm, CPU, TPU, and XPU. It describes how vLLM adapts its behavior to each hardware target, including attention backend selection, configuration validation, memory management, and distributed communication.

For details on how attention backends are implemented per-platform, see Attention Backends. For how configuration objects interact with platform checks, see Configuration Objects.

Overview

vLLM runs on several hardware targets. Rather than scattering hardware-specific logic throughout the codebase, all per-platform behavior is encapsulated in classes that inherit from a common Platform base class defined in vllm/platforms/interface.py. The active platform is a singleton used throughout the engine at runtime.

The supported platforms and their corresponding files are:

Platform	Class	File
NVIDIA CUDA	`CudaPlatform` (`NvmlCudaPlatform` / `NonNvmlCudaPlatform`)	`vllm/platforms/cuda.py`
AMD ROCm	`RocmPlatform`	`vllm/platforms/rocm.py`
CPU	`CpuPlatform`	`vllm/platforms/cpu.py`
Intel XPU	`XPUPlatform`	`vllm/platforms/xpu.py`
Google TPU	`TpuPlatform`	`vllm/platforms/tpu.py`
Out-of-tree / Unspecified	`UnspecifiedPlatform`	`vllm/platforms/interface.py`

Diagram: Platform Class Hierarchy

Sources: vllm/platforms/interface.py100-723 vllm/platforms/cuda.py112-721 vllm/platforms/rocm.py309-826 vllm/platforms/cpu.py72-510 vllm/platforms/xpu.py30-310

Platform Abstraction Layer

The Platform base class in vllm/platforms/interface.py100-718 defines the interface that all concrete platforms must implement or can optionally override.

Key Attributes

Attribute	Purpose
`device_name`	Human-readable name (e.g., `"cuda"`, `"rocm"`)
`device_type`	PyTorch device type string (e.g., `"cuda"`, `"cpu"`)
`dispatch_key`	PyTorch dispatcher key (e.g., `"CUDA"`, `"CPU"`, `"XPU"`)
`dist_backend`	Distributed backend (e.g., `"nccl"`, `"gloo"`, `"xccl"`)
`device_control_env_var`	Env var controlling device visibility (e.g., `CUDA_VISIBLE_DEVICES`)
`ray_device_key`	Resource key for Ray scheduling (e.g., `"GPU"`)
`ray_noset_device_env_vars`	Env vars that prevent Ray from overriding device visibility
`supported_quantization`	List of supported quantization format strings
`simple_compile_backend`	`torch.compile` backend for standalone functions (default: `"inductor"`)

Key Methods

The following describes the primary interface methods on Platform:

Diagram: Platform Interface Methods and Their Roles

Sources: vllm/platforms/interface.py191-718

`PlatformEnum` and `DeviceCapability`

PlatformEnum vllm/platforms/interface.py36-46 is a simple Python enum with values CUDA, ROCM, TPU, XPU, CPU, OOT, and UNSPECIFIED. Platform identity checks (is_cuda(), is_rocm(), etc.) compare self._enum against these values.

DeviceCapability vllm/platforms/interface.py58-97 is a NamedTuple with major and minor integer fields. It supports comparison operators directly, making capability checks like has_device_capability(80) (meaning SM 8.0 / compute capability 8.0) straightforward.

Sources: vllm/platforms/interface.py36-97

CUDA Platform

The CUDA platform is split into three classes in vllm/platforms/cuda.py:

CudaPlatformBase — shared CUDA logic
NvmlCudaPlatform — uses pynvml for device queries without initializing the CUDA context
NonNvmlCudaPlatform — fallback using torch.cuda APIs

The active CudaPlatform alias is resolved at module load time:

vllm/platforms/cuda.py704-718

NVML vs. Non-NVML

NVML (via pynvml) is preferred because it queries device properties without initializing the CUDA context. This is important when Ray workers need to set CUDA_VISIBLE_DEVICES after module import. If pynvml.nvmlInit() fails (e.g., on Jetson), NonNvmlCudaPlatform falls back to torch.cuda APIs.

NvmlCudaPlatform wraps NVML calls with a with_nvml_context decorator vllm/platforms/cuda.py100-109 that initializes and shuts down NVML around each call.

Attention Backend Selection

CudaPlatformBase.get_attn_backend_cls() vllm/platforms/cuda.py342-412 uses a priority-ordered backend list from _get_backend_priorities() vllm/platforms/cuda.py48-97

Backend priorities differ by device generation:

Condition	Non-MLA Priority Order	MLA Priority Order
Blackwell (SM 10.x)	FlashInfer → FlashAttention → Triton → FlexAttention	FlashInfer MLA → CutlassMLA → FlashAttnMLA → FlashMLA → Triton MLA
Other CUDA GPUs	FlashAttention → FlashInfer → Triton → FlexAttention	FlashAttnMLA → FlashMLA → FlashInfer MLA → Triton MLA

Each candidate backend's validate_configuration() classmethod is called to check feasibility before it is selected.

MLA Block Size Auto-Configuration

check_and_update_config() vllm/platforms/cuda.py168-297 automatically sets cache_config.block_size based on the chosen MLA backend:

Backend	Block Size
`FLASHMLA`	64
`CUTLASS_MLA`	128
`FLASHINFER_MLA`	64 (or 32)
`FLASHMLA_SPARSE`	64

FP8 Support

FP8 is available on CUDA devices with compute capability ≥ 8.9 (Ada Lovelace / Hopper and newer):

vllm/platforms/cuda.py472-474

Sources: vllm/platforms/cuda.py1-721

ROCm Platform

RocmPlatform vllm/platforms/rocm.py309-826 targets AMD GPUs via the HIP/ROCm software stack. Although device_type is "cuda" (ROCm exposes a CUDA-compatible API surface), device_name is "rocm" and _enum is PlatformEnum.ROCM.

GCN Architecture Detection

The ROCm platform resolves the GCN architecture string once at module load:

vllm/platforms/rocm.py145

_get_gcn_arch() vllm/platforms/rocm.py124-139 first queries via amdsmi (AMD System Management Interface, no CUDA init required), falling back to torch.cuda.get_device_properties. Several boolean flags are derived from _GCN_ARCH:

Flag	Meaning
`_ON_GFX9`	gfx9 family (MI200/MI300 series)
`_ON_GFX942`	MI300X/MI325X exactly
`_ON_GFX950`	MI350 series
`_ON_MI3XX`	MI300 or MI350 series
`_ON_GFX1X`	RDNA3/RDNA4 (gfx11xx/gfx12xx)

These flags drive backend selection and capability checks without requiring a live CUDA context.

AMDSMI Integration

The with_amdsmi_context decorator vllm/platforms/rocm.py98-107 wraps amdsmi_init() / amdsmi_shut_down() calls around any function that queries AMD device info. It is used for get_device_name() and is_fully_connected().

is_fully_connected() vllm/platforms/rocm.py548-565 checks for XGMI (1-hop, type 2) connectivity between physical GPU pairs using amdsmi_topo_get_link_type.

Attention Backend Selection

RocmPlatform.get_attn_backend_cls() vllm/platforms/rocm.py353-469 uses an explicit priority chain controlled by environment variables and architecture flags:

Diagram: ROCm Attention Backend Selection Logic

Sources: vllm/platforms/rocm.py353-469

FP8 and MX Support

ROCm FP8 support is architecture-dependent:

Method	Behavior
`supports_fp8()`	`True` for gfx94x (MI300), gfx95x (MI350), gfx12x (RDNA4)
`is_fp8_fnuz()`	`True` for gfx94x only (uses `float8_e4m3fnuz` instead of `float8_e4m3fn`)
`fp8_dtype()`	Returns `torch.float8_e4m3fnuz` on MI300, else `torch.float8_e4m3fn`
`supports_mx()`	`True` for gfx95x (MI350)

vllm/platforms/rocm.py726-741

AITER Custom Op Defaults

apply_config_platform_defaults() vllm/platforms/rocm.py585-631 appends to compilation_config.custom_ops based on which AITER operations are enabled:

AITER Feature	Custom Op Added
`is_rmsnorm_enabled()` + CUDA graphs active	`+rms_norm`
`is_linear_fp8_enabled()`	`+quant_fp8`
`is_fused_moe_enabled()`	`+grouped_topk`
`is_triton_rotary_embed_enabled()`	`+rotary_embedding`
(always)	`+sparse_attn_indexer`

ROCm Config Validation

check_and_update_config() vllm/platforms/rocm.py633-678 enforces several constraints:

Decode/prefill context parallelism (decode_context_parallel_size > 1 or prefill_context_parallel_size > 1) is incompatible with full CUDA graphs; mode is downgraded to PIECEWISE.
If VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION is set, cache_config.block_size is forced to 64; otherwise defaults to 16.
parallel_config.worker_cls defaults to "vllm.v1.worker.gpu_worker.Worker".

HIP/CUDA Env Var Sync

At module import time, _sync_hip_cuda_env_vars() vllm/platforms/rocm.py69-90 ensures HIP_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES are consistent, raising ValueError on genuine conflicts.

Sources: vllm/platforms/rocm.py1-826

CPU Platform

CpuPlatform vllm/platforms/cpu.py72-510 runs inference on host CPUs using gloo for distributed communication.

Supported Attention Backend

The CPU backend exclusively uses CPU_ATTN:

MLA and sparse attention are explicitly unsupported and raise NotImplementedError. vllm/platforms/cpu.py128-140

Memory Sizing

get_device_total_memory() vllm/platforms/cpu.py143-166 determines KV cache space from VLLM_CPU_KVCACHE_SPACE. If unset, it defaults to 50% of total NUMA-node memory divided by the number of NUMA nodes.

Supported Data Types

Data type support varies by CPU architecture vllm/platforms/cpu.py81-121:

Architecture	Supported dtypes
x86, aarch64	`bfloat16`, `float16`, `float32`
PowerPC	`bfloat16`, `float32`
Apple Silicon (ARM/macOS, BF16 FEAT)	`bfloat16`, `float16`, `float32`
Apple Silicon (ARM/macOS, no BF16)	`float16`, `float32`
RISC-V	`float32` only (scheduler bug workaround)

Configuration Adjustments

check_and_update_config() vllm/platforms/cpu.py180-363 applies the following:

Sets cache_config.block_size = 128 if not specified. Warns if not a multiple of 32.
Disables cascade attention (model_config.disable_cascade_attn = True).
Disables async scheduling.
Forces chunked-prefill and prefix-caching incompatibility with FP8 KV cache.
Overrides distributed executor backend to "mp" (multiprocessing) for world_size > 1.
Sets worker_cls = "vllm.v1.worker.cpu_worker.CPUWorker".
Disables dual-batch overlap (DBO).
Sets OpenMP and Intel OpenMP environment variables for thread control.
Forces VLLM_WORKER_MULTIPROC_METHOD=spawn.

NUMA Topology

CpuPlatform provides two utility methods for CPU topology discovery:

get_allowed_cpu_core_node_list() vllm/platforms/cpu.py364-404 — parses lscpu output to find allowed logical CPUs and NUMA nodes.
discover_numa_topology() vllm/platforms/cpu.py406-458 — discovers NUMA node→physical core mapping for KV transfer thread reservation.

Kernel Import

import_kernels() vllm/platforms/cpu.py488-509 selects between vllm._C (AVX-512), vllm._C_AVX2, or vllm._C based on CpuArchEnum and AVX-512 availability.

Sources: vllm/platforms/cpu.py1-510

XPU Platform

XPUPlatform vllm/platforms/xpu.py30-310 targets Intel GPUs via the XPU backend. It uses xccl for distributed communication and ZE_AFFINITY_MASK as its device control environment variable.

Attention Backend

get_attn_backend_cls() vllm/platforms/xpu.py48-87 forces the KV cache layout to "NHD" via set_kv_cache_layout("NHD") and selects backends as follows:

MLA → TRITON_MLA
Explicit TRITON_ATTN → TRITON_ATTN
float32 dtype → falls back to TRITON_ATTN (FlashAttention doesn't support FP32 on XPU)
Explicit FLASH_ATTN or default → FLASH_ATTN

CUDA Graph Mode

check_and_update_config() vllm/platforms/xpu.py160-222 disables CUDA graphs (CUDAGraphMode.NONE) if:

The PyTorch build does not support XPU graphs (supports_xpu_graph() returns False).
Data-parallel world size > 1 (communication ops cannot be captured).

Falls back to PIECEWISE graph mode when FlashAttention is selected (FMHA sycl-tla kernels cannot be captured).

Notable Behaviors

FP8 dtype is always torch.float8_e4m3fn (Intel XPU does not use FNUZ).
Intel Arc A770 does not support bfloat16 (raises ValueError).
is_pin_memory_available() returns True.
Device communicator class is XpuCommunicator (requires xccl).
LoRA disables compilation mode entirely.

Sources: vllm/platforms/xpu.py1-310

TPU Platform

TpuPlatform vllm/platforms/tpu.py1-20 delegates entirely to the tpu_inference external package:

If tpu_inference is not installed, an error is logged and the platform is unavailable.

Sources: vllm/platforms/tpu.py1-20

Platform Capabilities at a Glance

Diagram: Platform Capability Matrix

Sources: vllm/platforms/cuda.py112-123 vllm/platforms/rocm.py309-322 vllm/platforms/cpu.py72-79 vllm/platforms/xpu.py30-40

Capability	CUDA	ROCm	CPU	XPU
FP8 support	≥ SM 8.9	gfx94x/95x/12x	❌	✅
FP8 FNUZ variant	❌	gfx94x only	❌	❌
MX types	❌	gfx95x	❌	❌
Custom allreduce	✅	MI300/MI350 only	❌	❌
Hybrid KV cache	✅	✅	✅	✅
Static graph mode	✅	✅	❌	✅
Pin memory	✅ (not WSL)	✅	❌	✅
MLA attention	✅	✅	❌	✅ (Triton MLA)
Sparse attention	✅	✅	❌	❌
BF16	≥ SM 8.0	≥ capability 8.0	Most archs	Not A770

Platform Lifecycle During Engine Initialization

Diagram: Platform Hooks in Engine Startup

Sources: vllm/platforms/interface.py380-443 vllm/platforms/cuda.py168-297 vllm/platforms/rocm.py585-678 vllm/platforms/cpu.py180-363

Platform Support

Relevant source files

For details on how attention backends are implemented per-platform, see Attention Backends. For how configuration objects interact with platform checks, see Configuration Objects.

Overview

The supported platforms and their corresponding files are:

Platform	Class	File
NVIDIA CUDA	`CudaPlatform` (`NvmlCudaPlatform` / `NonNvmlCudaPlatform`)	`vllm/platforms/cuda.py`
AMD ROCm	`RocmPlatform`	`vllm/platforms/rocm.py`
CPU	`CpuPlatform`	`vllm/platforms/cpu.py`
Intel XPU	`XPUPlatform`	`vllm/platforms/xpu.py`
Google TPU	`TpuPlatform`	`vllm/platforms/tpu.py`
Out-of-tree / Unspecified	`UnspecifiedPlatform`	`vllm/platforms/interface.py`

Diagram: Platform Class Hierarchy

Sources: vllm/platforms/interface.py100-723 vllm/platforms/cuda.py112-721 vllm/platforms/rocm.py309-826 vllm/platforms/cpu.py72-510 vllm/platforms/xpu.py30-310

Platform Abstraction Layer

The Platform base class in vllm/platforms/interface.py100-718 defines the interface that all concrete platforms must implement or can optionally override.

Key Attributes

Attribute	Purpose
`device_name`	Human-readable name (e.g., `"cuda"`, `"rocm"`)
`device_type`	PyTorch device type string (e.g., `"cuda"`, `"cpu"`)
`dispatch_key`	PyTorch dispatcher key (e.g., `"CUDA"`, `"CPU"`, `"XPU"`)
`dist_backend`	Distributed backend (e.g., `"nccl"`, `"gloo"`, `"xccl"`)
`device_control_env_var`	Env var controlling device visibility (e.g., `CUDA_VISIBLE_DEVICES`)
`ray_device_key`	Resource key for Ray scheduling (e.g., `"GPU"`)
`ray_noset_device_env_vars`	Env vars that prevent Ray from overriding device visibility
`supported_quantization`	List of supported quantization format strings
`simple_compile_backend`	`torch.compile` backend for standalone functions (default: `"inductor"`)

Key Methods

The following describes the primary interface methods on Platform:

Diagram: Platform Interface Methods and Their Roles

Sources: vllm/platforms/interface.py191-718

`PlatformEnum` and `DeviceCapability`

Sources: vllm/platforms/interface.py36-97

CUDA Platform

The CUDA platform is split into three classes in vllm/platforms/cuda.py:

CudaPlatformBase — shared CUDA logic
NvmlCudaPlatform — uses pynvml for device queries without initializing the CUDA context
NonNvmlCudaPlatform — fallback using torch.cuda APIs

The active CudaPlatform alias is resolved at module load time:

vllm/platforms/cuda.py704-718

NVML vs. Non-NVML

NvmlCudaPlatform wraps NVML calls with a with_nvml_context decorator vllm/platforms/cuda.py100-109 that initializes and shuts down NVML around each call.

Attention Backend Selection

CudaPlatformBase.get_attn_backend_cls() vllm/platforms/cuda.py342-412 uses a priority-ordered backend list from _get_backend_priorities() vllm/platforms/cuda.py48-97

Backend priorities differ by device generation:

Condition	Non-MLA Priority Order	MLA Priority Order
Blackwell (SM 10.x)	FlashInfer → FlashAttention → Triton → FlexAttention	FlashInfer MLA → CutlassMLA → FlashAttnMLA → FlashMLA → Triton MLA
Other CUDA GPUs	FlashAttention → FlashInfer → Triton → FlexAttention	FlashAttnMLA → FlashMLA → FlashInfer MLA → Triton MLA

Each candidate backend's validate_configuration() classmethod is called to check feasibility before it is selected.

MLA Block Size Auto-Configuration

check_and_update_config() vllm/platforms/cuda.py168-297 automatically sets cache_config.block_size based on the chosen MLA backend:

Backend	Block Size
`FLASHMLA`	64
`CUTLASS_MLA`	128
`FLASHINFER_MLA`	64 (or 32)
`FLASHMLA_SPARSE`	64

FP8 Support

FP8 is available on CUDA devices with compute capability ≥ 8.9 (Ada Lovelace / Hopper and newer):

vllm/platforms/cuda.py472-474

Sources: vllm/platforms/cuda.py1-721

ROCm Platform

GCN Architecture Detection

The ROCm platform resolves the GCN architecture string once at module load:

vllm/platforms/rocm.py145

Flag	Meaning
`_ON_GFX9`	gfx9 family (MI200/MI300 series)
`_ON_GFX942`	MI300X/MI325X exactly
`_ON_GFX950`	MI350 series
`_ON_MI3XX`	MI300 or MI350 series
`_ON_GFX1X`	RDNA3/RDNA4 (gfx11xx/gfx12xx)

These flags drive backend selection and capability checks without requiring a live CUDA context.

AMDSMI Integration

is_fully_connected() vllm/platforms/rocm.py548-565 checks for XGMI (1-hop, type 2) connectivity between physical GPU pairs using amdsmi_topo_get_link_type.

Attention Backend Selection

RocmPlatform.get_attn_backend_cls() vllm/platforms/rocm.py353-469 uses an explicit priority chain controlled by environment variables and architecture flags:

Diagram: ROCm Attention Backend Selection Logic

Sources: vllm/platforms/rocm.py353-469

FP8 and MX Support

ROCm FP8 support is architecture-dependent:

Method	Behavior
`supports_fp8()`	`True` for gfx94x (MI300), gfx95x (MI350), gfx12x (RDNA4)
`is_fp8_fnuz()`	`True` for gfx94x only (uses `float8_e4m3fnuz` instead of `float8_e4m3fn`)
`fp8_dtype()`	Returns `torch.float8_e4m3fnuz` on MI300, else `torch.float8_e4m3fn`
`supports_mx()`	`True` for gfx95x (MI350)

vllm/platforms/rocm.py726-741

AITER Custom Op Defaults

apply_config_platform_defaults() vllm/platforms/rocm.py585-631 appends to compilation_config.custom_ops based on which AITER operations are enabled:

AITER Feature	Custom Op Added
`is_rmsnorm_enabled()` + CUDA graphs active	`+rms_norm`
`is_linear_fp8_enabled()`	`+quant_fp8`
`is_fused_moe_enabled()`	`+grouped_topk`
`is_triton_rotary_embed_enabled()`	`+rotary_embedding`
(always)	`+sparse_attn_indexer`

ROCm Config Validation

check_and_update_config() vllm/platforms/rocm.py633-678 enforces several constraints:

Decode/prefill context parallelism (decode_context_parallel_size > 1 or prefill_context_parallel_size > 1) is incompatible with full CUDA graphs; mode is downgraded to PIECEWISE.
If VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION is set, cache_config.block_size is forced to 64; otherwise defaults to 16.
parallel_config.worker_cls defaults to "vllm.v1.worker.gpu_worker.Worker".

HIP/CUDA Env Var Sync

At module import time, _sync_hip_cuda_env_vars() vllm/platforms/rocm.py69-90 ensures HIP_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES are consistent, raising ValueError on genuine conflicts.

Sources: vllm/platforms/rocm.py1-826

CPU Platform

CpuPlatform vllm/platforms/cpu.py72-510 runs inference on host CPUs using gloo for distributed communication.

Supported Attention Backend

The CPU backend exclusively uses CPU_ATTN:

MLA and sparse attention are explicitly unsupported and raise NotImplementedError. vllm/platforms/cpu.py128-140

Memory Sizing

Supported Data Types

Data type support varies by CPU architecture vllm/platforms/cpu.py81-121:

Architecture	Supported dtypes
x86, aarch64	`bfloat16`, `float16`, `float32`
PowerPC	`bfloat16`, `float32`
Apple Silicon (ARM/macOS, BF16 FEAT)	`bfloat16`, `float16`, `float32`
Apple Silicon (ARM/macOS, no BF16)	`float16`, `float32`
RISC-V	`float32` only (scheduler bug workaround)

Configuration Adjustments

check_and_update_config() vllm/platforms/cpu.py180-363 applies the following:

Sets cache_config.block_size = 128 if not specified. Warns if not a multiple of 32.
Disables cascade attention (model_config.disable_cascade_attn = True).
Disables async scheduling.
Forces chunked-prefill and prefix-caching incompatibility with FP8 KV cache.
Overrides distributed executor backend to "mp" (multiprocessing) for world_size > 1.
Sets worker_cls = "vllm.v1.worker.cpu_worker.CPUWorker".
Disables dual-batch overlap (DBO).
Sets OpenMP and Intel OpenMP environment variables for thread control.
Forces VLLM_WORKER_MULTIPROC_METHOD=spawn.

NUMA Topology

CpuPlatform provides two utility methods for CPU topology discovery:

get_allowed_cpu_core_node_list() vllm/platforms/cpu.py364-404 — parses lscpu output to find allowed logical CPUs and NUMA nodes.
discover_numa_topology() vllm/platforms/cpu.py406-458 — discovers NUMA node→physical core mapping for KV transfer thread reservation.

Kernel Import

import_kernels() vllm/platforms/cpu.py488-509 selects between vllm._C (AVX-512), vllm._C_AVX2, or vllm._C based on CpuArchEnum and AVX-512 availability.

Sources: vllm/platforms/cpu.py1-510

XPU Platform

XPUPlatform vllm/platforms/xpu.py30-310 targets Intel GPUs via the XPU backend. It uses xccl for distributed communication and ZE_AFFINITY_MASK as its device control environment variable.

Attention Backend

get_attn_backend_cls() vllm/platforms/xpu.py48-87 forces the KV cache layout to "NHD" via set_kv_cache_layout("NHD") and selects backends as follows:

MLA → TRITON_MLA
Explicit TRITON_ATTN → TRITON_ATTN
float32 dtype → falls back to TRITON_ATTN (FlashAttention doesn't support FP32 on XPU)
Explicit FLASH_ATTN or default → FLASH_ATTN

CUDA Graph Mode

check_and_update_config() vllm/platforms/xpu.py160-222 disables CUDA graphs (CUDAGraphMode.NONE) if:

The PyTorch build does not support XPU graphs (supports_xpu_graph() returns False).
Data-parallel world size > 1 (communication ops cannot be captured).

Falls back to PIECEWISE graph mode when FlashAttention is selected (FMHA sycl-tla kernels cannot be captured).

Notable Behaviors

FP8 dtype is always torch.float8_e4m3fn (Intel XPU does not use FNUZ).
Intel Arc A770 does not support bfloat16 (raises ValueError).
is_pin_memory_available() returns True.
Device communicator class is XpuCommunicator (requires xccl).
LoRA disables compilation mode entirely.

Sources: vllm/platforms/xpu.py1-310

TPU Platform

TpuPlatform vllm/platforms/tpu.py1-20 delegates entirely to the tpu_inference external package:

If tpu_inference is not installed, an error is logged and the platform is unavailable.

Sources: vllm/platforms/tpu.py1-20

Platform Capabilities at a Glance

Diagram: Platform Capability Matrix

Sources: vllm/platforms/cuda.py112-123 vllm/platforms/rocm.py309-322 vllm/platforms/cpu.py72-79 vllm/platforms/xpu.py30-40

Capability	CUDA	ROCm	CPU	XPU
FP8 support	≥ SM 8.9	gfx94x/95x/12x	❌	✅
FP8 FNUZ variant	❌	gfx94x only	❌	❌
MX types	❌	gfx95x	❌	❌
Custom allreduce	✅	MI300/MI350 only	❌	❌
Hybrid KV cache	✅	✅	✅	✅
Static graph mode	✅	✅	❌	✅
Pin memory	✅ (not WSL)	✅	❌	✅
MLA attention	✅	✅	❌	✅ (Triton MLA)
Sparse attention	✅	✅	❌	❌
BF16	≥ SM 8.0	≥ capability 8.0	Most archs	Not A770

Platform Lifecycle During Engine Initialization

Diagram: Platform Hooks in Engine Startup

Sources: vllm/platforms/interface.py380-443 vllm/platforms/cuda.py168-297 vllm/platforms/rocm.py585-678 vllm/platforms/cpu.py180-363

Platform Support

Overview

Platform Abstraction Layer

Key Attributes

Key Methods

PlatformEnum and DeviceCapability

CUDA Platform

NVML vs. Non-NVML

Attention Backend Selection

MLA Block Size Auto-Configuration

FP8 Support

ROCm Platform

GCN Architecture Detection

AMDSMI Integration

Attention Backend Selection

FP8 and MX Support

AITER Custom Op Defaults

ROCm Config Validation

HIP/CUDA Env Var Sync

CPU Platform

Supported Attention Backend

Memory Sizing

Supported Data Types

Configuration Adjustments

NUMA Topology

Kernel Import

XPU Platform

Attention Backend

CUDA Graph Mode

Notable Behaviors

TPU Platform

Platform Capabilities at a Glance

Platform Lifecycle During Engine Initialization

On this page

Platform Support

Overview

Platform Abstraction Layer

Key Attributes

Key Methods

PlatformEnum and DeviceCapability

CUDA Platform

NVML vs. Non-NVML

Attention Backend Selection

MLA Block Size Auto-Configuration

FP8 Support

ROCm Platform

GCN Architecture Detection

AMDSMI Integration

Attention Backend Selection

FP8 and MX Support

AITER Custom Op Defaults

ROCm Config Validation

HIP/CUDA Env Var Sync

CPU Platform

Supported Attention Backend

Memory Sizing

Supported Data Types

Configuration Adjustments

NUMA Topology

Kernel Import

XPU Platform

Attention Backend

CUDA Graph Mode

Notable Behaviors

TPU Platform

Platform Capabilities at a Glance

Platform Lifecycle During Engine Initialization

On this page

`PlatformEnum` and `DeviceCapability`

`PlatformEnum` and `DeviceCapability`