This document describes vLLM's support for non-GPU hardware platforms: CPU (with Intel IPEX), TPU (Google Tensor Processing Units), and XPU (Intel Data Center GPUs). These platform implementations provide specialized backends for attention, memory management, and model execution on alternative hardware architectures.
For CUDA and ROCm GPU platform implementations, see pages 10.2 and 10.3. For the platform abstraction interface and selection mechanism, see page 10.1.
All platform implementations inherit from the abstract Platform class defined in vllm/platforms/interface.py Each platform must implement device-specific methods and provide configuration for attention backends, distributed communication, and worker classes.
Sources: vllm/platforms/interface.py100-692 vllm/platforms/cpu.py71-422 vllm/platforms/tpu.py37-296 vllm/platforms/xpu.py24-256
The CpuPlatform class provides support for CPU-based inference with optional Intel Extension for PyTorch (IPEX) acceleration. It is designed primarily for development, testing, and deployment scenarios where GPU resources are unavailable.
| Property | Value |
|---|---|
| Enum | PlatformEnum.CPU |
| Device Type | "cpu" |
| Dispatch Key | "CPU" |
| Distributed Backend | "gloo" |
| Device Control Env Var | CPU_VISIBLE_MEMORY_NODES |
| Worker Class | vllm.v1.worker.cpu_worker.CPUWorker |
Sources: vllm/platforms/cpu.py71-78
CPU platform support for data types varies by architecture:
torch.bfloat16, torch.float16, torch.float32torch.bfloat16, torch.float32torch.bfloat16, torch.float16, torch.float32torch.float16, torch.float32torch.float32 only (workaround for scheduler bug)Sources: vllm/platforms/cpu.py79-121
CPU platform uses the CPU_ATTN backend exclusively. This backend is selected regardless of user configuration, as other backends are GPU-specific.
Sources: vllm/platforms/cpu.py127-138
CPU memory management uses a different strategy than GPU platforms:
VLLM_CPU_KVCACHE_SPACE environment variable or defaults to 50% of available memory per NUMA node.Sources: vllm/platforms/cpu.py141-164 vllm/platforms/cpu.py184-211
The check_and_update_config() method applies CPU-specific configuration changes:
| Configuration | Update |
|---|---|
| Block Size | Set to 128 if not specified |
| Cache Dtype | Force to "auto" (no FP8 KV cache support) |
| Async Scheduling | Set scheduler_config.async_scheduling = False |
| Distributed Backend | Force to "mp" (multiprocessing) if world_size > 1 |
| Worker Class | Set to vllm.v1.worker.cpu_worker.CPUWorker |
| CUDA Graphs | Set cudagraph_capture_sizes = [] (not supported) |
| Compilation Mode | Convert VLLM_COMPILE to DYNAMO_TRACE_ONCE with inductor backend (or eager in VLLM_CPU_CI_ENV) |
| DBO (Dual-Batch Overlap) | Disabled |
| Cascade Attention | Disabled (model_config.disable_cascade_attn = True) |
| MLA | Disable chunked prefill and adjust max_num_batched_tokens |
Sources: vllm/platforms/cpu.py180-263
The CPU platform sets several environment variables during initialization:
Sources: vllm/platforms/cpu.py268-342
On Linux systems, CPU platform provides utilities to manage CPU core allocation across NUMA nodes.
CPU core topology helpers in CpuPlatform
The LogicalCPUInfo dataclass vllm/platforms/cpu.py42-68 tracks CPU topology with fields id, physical_core, and numa_node.
discover_numa_topology() vllm/platforms/cpu.py407-458 is a separate utility that inspects /sys/devices/system/node and /sys/devices/system/cpu to find per-NUMA-node physical core sets, used by the NIXL KV connector to pin cores for start_kv_load() during disaggregated prefilling.
Sources: vllm/platforms/cpu.py42-68 vllm/platforms/cpu.py365-404 vllm/platforms/cpu.py407-458
CPU platform uses a specialized communicator for distributed operations:
vllm.distributed.device_communicators.cpu_communicator.CpuCommunicatoruse_custom_allreduce() returns False)Sources: vllm/platforms/cpu.py405-409
True but with limited implementationis_pin_memory_available() returns False"mp" distributed executor backend is supportedNotImplementedErrorNotImplementedErrorSources: vllm/platforms/cpu.py127-138 vllm/platforms/cpu.py397-421
The TpuPlatform class is provided by the external tpu_inference package. The file vllm/platforms/tpu.py is a thin shim that imports TpuPlatform from that package at runtime.
The tpu_inference package must be installed separately to use TPU inference in vLLM. All platform logic (attention backend selection, device communication, configuration updates, etc.) is implemented inside that package.
Sources: vllm/platforms/tpu.py1-21
tpu_inference)| Property | Value |
|---|---|
| Enum | PlatformEnum.TPU |
| Attention Backend | Pallas (JAX/XLA kernel language) |
| Distributed Backend | "gloo" |
| Compilation Backend | "openxla" |
| Worker Class | vllm.v1.worker.tpu_worker.TPUWorker |
| Inference Mode | torch.no_grad() (XLA does not support inference_mode) |
Notable behavioral constraints imposed by the TPU platform include:
bfloat16 dtype is efficient on TPU hardware.DYNAMO_TRACE_ONCE with the openxla backend.use_sync_weight_loader() returns True) is required for XLA operation ordering.Sources: vllm/platforms/tpu.py1-21
The XPUPlatform class provides support for Intel Data Center GPUs (formerly known as Intel Arc GPUs for data centers) using Intel Extension for PyTorch (IPEX).
| Property | Value |
|---|---|
| Enum | PlatformEnum.XPU |
| Device Type | "xpu" |
| Dispatch Key | "XPU" |
| Distributed Backend | "xccl" (Intel oneCCL, via xccl backend) |
| Device Control Env Var | ZE_AFFINITY_MASK |
| Worker Class | vllm.v1.worker.xpu_worker.XPUWorker |
| Ray Device Key | "GPU" |
Sources: vllm/platforms/xpu.py30-39
XPU platform does not import the standard vllm._C module (CUDA-specific kernels), only importing MoE kernels:
Sources: vllm/platforms/xpu.py36-39
XPU supports Flash Attention, Triton, and Triton MLA backends. Flash Attention is the default for standard attention; MLA automatically uses Triton MLA.
XPU get_attn_backend_cls() dispatch logic
Important: XPU attention kernels only support NHD (num_heads × head_size × depth) KV cache layout. set_kv_cache_layout("NHD") is called unconditionally at the top of get_attn_backend_cls().
Sources: vllm/platforms/xpu.py48-87
The check_and_update_config() method applies XPU-specific settings:
| Configuration | Update | Reason |
|---|---|---|
| Block Size | Set to 64 if not specified | Optimized for chunked prefill on XPU |
| Compile Sizes | Set to [] if None | Initialize compilation configuration |
| Default Attention Backend | Set to FLASH_ATTN if not specified | Flash Attention is XPU's primary backend |
| CUDA Graph Mode | NONE if supports_xpu_graph() is False or world_size_across_dp > 1; otherwise PIECEWISE when backend is FLASH_ATTN | Flash Attention SYCL-TLA kernels cannot be captured in full graph mode |
| Worker Class | Set to vllm.v1.worker.xpu_worker.XPUWorker if "auto" | Allows custom workers (e.g., vllm-omni workers) |
| LoRA Config | Set compilation mode to NONE if LoRA enabled | LoRA not compatible with compilation |
| KV Transfer Config | Set enable_permute_local_kv = True | Required for KV transfer |
| MLA | Disable chunked prefill if MLA enabled | MLA requires special handling on non-GPU platforms |
Sources: vllm/platforms/xpu.py160-222
XPU provides methods to query device properties:
Sources: vllm/platforms/xpu.py117-131
XPU supports both native XPU and Triton-based LoRA implementations:
Sources: vllm/platforms/xpu.py121-126
XPU generally supports standard floating-point types, but with specific restrictions:
XPU uses the OCP FP8 standard dtype:
The is_data_center_gpu() helper checks get_device_name() for the string "data center gpu", enabling data-center-specific code paths.
Sources: vllm/platforms/xpu.py243-250
XPU uses PyTorch XPU memory functions:
Block transfer operations are implemented similarly to CUDA:
Sources: vllm/platforms/xpu.py194-255
vllm.distributed.device_communicators.xpu_communicator.XpuCommunicatorxccl (Intel oneCCL, via PyTorch's xccl backend)xccl is not available in the current PyTorch build.Sources: vllm/platforms/xpu.py252-261
Like TPU, XPU uses torch.no_grad() instead of torch.inference_mode():
Sources: vllm/platforms/xpu.py134-135
get_device_capability() returns None; the XPU capability format differs from CUDA's major/minor schemeNotImplementedError)NONE when LoRA adapters are activecheck_if_supports_dtype() raises ValueError for bfloat16 on A770max_num_batched_tokens adjustmentsupports_xpu_graph() (PyTorch version check); data-parallel multi-instance setups disable XPU graphsSources: vllm/platforms/xpu.py48-222
The following table summarizes key differences between the three alternative platforms:
| Feature | CPU | TPU | XPU |
|---|---|---|---|
| Attention Backend | CPU_ATTN | Pallas (via tpu_inference) | FLASH_ATTN / TRITON_ATTN / TRITON_MLA |
| Distributed Backend | gloo | gloo | xccl |
| Worker Class | CPUWorker | TPUWorker | XPUWorker |
| Compilation | DYNAMO_TRACE_ONCE (inductor) | DYNAMO_TRACE_ONCE (openxla) | PIECEWISE (when XPU graphs available) |
| Static Graph Mode | No | No | Yes (CUDAGraphWrapper) |
| Block Size Default | 128 | Computed by Pallas | 64 |
| KV Cache Layout | Any | Any | NHD only |
| FP8 dtype | Not supported | e4m3fn | e4m3fn |
| Hybrid KV Cache | Yes | No | Yes |
| Pin Memory | No | No | Yes |
| MLA Support | No (raises error) | Via Pallas backend | TRITON_MLA (chunked prefill disabled) |
| Sparse Attention | No | No | No |
| Custom AllReduce | No | No | No |
| Inference Mode | torch.no_grad() | torch.no_grad() | torch.no_grad() |
| Device Control Env | CPU_VISIBLE_MEMORY_NODES | via tpu_inference | ZE_AFFINITY_MASK |
Sources: vllm/platforms/cpu.py72-509 vllm/platforms/tpu.py1-21 vllm/platforms/xpu.py30-309
Platform selection happens automatically at vLLM initialization based on available hardware and environment variables. The selection process is managed by the platform detection system (see Platform Abstraction).
Once a platform is selected, its check_and_update_config() method is called to apply platform-specific configuration constraints and optimizations.
Sources: vllm/platforms/interface.py404-414
Each platform returns specific class paths for key components:
| Component | Class Path |
|---|---|
| Attention Backend | vllm.v1.attention.backends.cpu_attn.CPUAttentionBackend |
| Punica Wrapper | vllm.lora.punica_wrapper.punica_cpu.PunicaWrapperCPU |
| Device Communicator | vllm.distributed.device_communicators.cpu_communicator.CpuCommunicator |
| Component | Class Path |
|---|---|
| Attention Backend | vllm.v1.attention.backends.pallas.PallasAttentionBackend |
| Punica Wrapper | vllm.lora.punica_wrapper.punica_tpu.PunicaWrapperTPU |
| Device Communicator | vllm.distributed.device_communicators.tpu_communicator.TpuCommunicator |
| Component | Class Path |
|---|---|
| Attention Backend | vllm.v1.attention.backends.flash_attn.FlashAttentionBackend (default) |
| Punica Wrapper | vllm.lora.punica_wrapper.punica_xpu.PunicaWrapperXPU (default)or vllm.lora.punica_wrapper.punica_gpu.PunicaWrapperGPU (with Triton) |
| Device Communicator | vllm.distributed.device_communicators.xpu_communicator.XpuCommunicator |
Sources: vllm/platforms/cpu.py127-409 vllm/platforms/tpu.py58-222 vllm/platforms/xpu.py42-211
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.