CPU, TPU, and XPU Platforms

Relevant source files

This document describes vLLM's support for non-GPU hardware platforms: CPU (with Intel IPEX), TPU (Google Tensor Processing Units), and XPU (Intel Data Center GPUs). These platform implementations provide specialized backends for attention, memory management, and model execution on alternative hardware architectures.

For CUDA and ROCm GPU platform implementations, see pages 10.2 and 10.3. For the platform abstraction interface and selection mechanism, see page 10.1.

Platform Architecture

All platform implementations inherit from the abstract Platform class defined in vllm/platforms/interface.py Each platform must implement device-specific methods and provide configuration for attention backends, distributed communication, and worker classes.

Sources: vllm/platforms/interface.py100-692 vllm/platforms/cpu.py71-422 vllm/platforms/tpu.py37-296 vllm/platforms/xpu.py24-256

CPU Platform

The CpuPlatform class provides support for CPU-based inference with optional Intel Extension for PyTorch (IPEX) acceleration. It is designed primarily for development, testing, and deployment scenarios where GPU resources are unavailable.

Platform Configuration

Property	Value
Enum	`PlatformEnum.CPU`
Device Type	`"cpu"`
Dispatch Key	`"CPU"`
Distributed Backend	`"gloo"`
Device Control Env Var	`CPU_VISIBLE_MEMORY_NODES`
Worker Class	`vllm.v1.worker.cpu_worker.CPUWorker`

Sources: vllm/platforms/cpu.py71-78

Supported Data Types

CPU platform support for data types varies by architecture:

x86/AArch64: torch.bfloat16, torch.float16, torch.float32
PowerPC: torch.bfloat16, torch.float32
ARM (macOS with FEAT_BF16): torch.bfloat16, torch.float16, torch.float32
ARM (macOS without FEAT_BF16): torch.float16, torch.float32
RISC-V: torch.float32 only (workaround for scheduler bug)

Sources: vllm/platforms/cpu.py79-121

Attention Backend

CPU platform uses the CPU_ATTN backend exclusively. This backend is selected regardless of user configuration, as other backends are GPU-specific.

Sources: vllm/platforms/cpu.py127-138

Memory Management

CPU memory management uses a different strategy than GPU platforms:

KV Cache Space Calculation: Memory is allocated based on VLLM_CPU_KVCACHE_SPACE environment variable or defaults to 50% of available memory per NUMA node.
Block Size: Default block size is 128, optimized for CPU cache lines (preferred to be multiples of 32).
No FP8 KV Cache: CPU backend does not support KV cache quantization.

Sources: vllm/platforms/cpu.py141-164 vllm/platforms/cpu.py184-211

Configuration Updates

The check_and_update_config() method applies CPU-specific configuration changes:

Configuration	Update
Block Size	Set to 128 if not specified
Cache Dtype	Force to `"auto"` (no FP8 KV cache support)
Async Scheduling	Set `scheduler_config.async_scheduling = False`
Distributed Backend	Force to `"mp"` (multiprocessing) if world_size > 1
Worker Class	Set to `vllm.v1.worker.cpu_worker.CPUWorker`
CUDA Graphs	Set `cudagraph_capture_sizes = []` (not supported)
Compilation Mode	Convert `VLLM_COMPILE` to `DYNAMO_TRACE_ONCE` with `inductor` backend (or `eager` in `VLLM_CPU_CI_ENV`)
DBO (Dual-Batch Overlap)	Disabled
Cascade Attention	Disabled (`model_config.disable_cascade_attn = True`)
MLA	Disable chunked prefill and adjust `max_num_batched_tokens`

Sources: vllm/platforms/cpu.py180-263

Environment Variable Configuration

The CPU platform sets several environment variables during initialization:

Sources: vllm/platforms/cpu.py268-342

NUMA Node and CPU Core Management

On Linux systems, CPU platform provides utilities to manage CPU core allocation across NUMA nodes.

CPU core topology helpers in CpuPlatform

The LogicalCPUInfo dataclass vllm/platforms/cpu.py42-68 tracks CPU topology with fields id, physical_core, and numa_node.

discover_numa_topology() vllm/platforms/cpu.py407-458 is a separate utility that inspects /sys/devices/system/node and /sys/devices/system/cpu to find per-NUMA-node physical core sets, used by the NIXL KV connector to pin cores for start_kv_load() during disaggregated prefilling.

Sources: vllm/platforms/cpu.py42-68 vllm/platforms/cpu.py365-404 vllm/platforms/cpu.py407-458

Device Communicator

CPU platform uses a specialized communicator for distributed operations:

Class: vllm.distributed.device_communicators.cpu_communicator.CpuCommunicator
Backend: Gloo (since NCCL is GPU-only)
Custom AllReduce: Not supported (use_custom_allreduce() returns False)

Sources: vllm/platforms/cpu.py405-409

Limitations

No CUDA Graphs: Static graph mode is not supported
No Hybrid KV Cache: Returns True but with limited implementation
No Pin Memory: is_pin_memory_available() returns False
Multiprocessing Only: Only "mp" distributed executor backend is supported
No MLA Support: Multi-Latent Attention raises NotImplementedError
No Sparse Attention: Raises NotImplementedError
Chunked Prefill Restrictions: Not compatible with FP8 KV cache

Sources: vllm/platforms/cpu.py127-138 vllm/platforms/cpu.py397-421

TPU Platform

The TpuPlatform class is provided by the external tpu_inference package. The file vllm/platforms/tpu.py is a thin shim that imports TpuPlatform from that package at runtime.

The tpu_inference package must be installed separately to use TPU inference in vLLM. All platform logic (attention backend selection, device communication, configuration updates, etc.) is implemented inside that package.

Sources: vllm/platforms/tpu.py1-21

Key Characteristics (provided by `tpu_inference`)

Property	Value
Enum	`PlatformEnum.TPU`
Attention Backend	Pallas (JAX/XLA kernel language)
Distributed Backend	`"gloo"`
Compilation Backend	`"openxla"`
Worker Class	`vllm.v1.worker.tpu_worker.TPUWorker`
Inference Mode	`torch.no_grad()` (XLA does not support `inference_mode`)

Notable behavioral constraints imposed by the TPU platform include:

Only bfloat16 dtype is efficient on TPU hardware.
CUDA graphs are not applicable; compilation uses DYNAMO_TRACE_ONCE with the openxla backend.
Speculative decoding is not supported.
KV cache block size is determined dynamically by the Pallas attention backend.
Synchronized weight loading (use_sync_weight_loader() returns True) is required for XLA operation ordering.

Sources: vllm/platforms/tpu.py1-21

XPU Platform

The XPUPlatform class provides support for Intel Data Center GPUs (formerly known as Intel Arc GPUs for data centers) using Intel Extension for PyTorch (IPEX).

Platform Configuration

Property	Value
Enum	`PlatformEnum.XPU`
Device Type	`"xpu"`
Dispatch Key	`"XPU"`
Distributed Backend	`"xccl"` (Intel oneCCL, via `xccl` backend)
Device Control Env Var	`ZE_AFFINITY_MASK`
Worker Class	`vllm.v1.worker.xpu_worker.XPUWorker`
Ray Device Key	`"GPU"`

Sources: vllm/platforms/xpu.py30-39

Kernel Import Strategy

XPU platform does not import the standard vllm._C module (CUDA-specific kernels), only importing MoE kernels:

Sources: vllm/platforms/xpu.py36-39

Attention Backend Selection

XPU supports Flash Attention, Triton, and Triton MLA backends. Flash Attention is the default for standard attention; MLA automatically uses Triton MLA.

XPU get_attn_backend_cls() dispatch logic

Important: XPU attention kernels only support NHD (num_heads × head_size × depth) KV cache layout. set_kv_cache_layout("NHD") is called unconditionally at the top of get_attn_backend_cls().

Sources: vllm/platforms/xpu.py48-87

Configuration Updates

The check_and_update_config() method applies XPU-specific settings:

Configuration	Update	Reason
Block Size	Set to 64 if not specified	Optimized for chunked prefill on XPU
Compile Sizes	Set to `[]` if `None`	Initialize compilation configuration
Default Attention Backend	Set to `FLASH_ATTN` if not specified	Flash Attention is XPU's primary backend
CUDA Graph Mode	`NONE` if `supports_xpu_graph()` is False or `world_size_across_dp > 1`; otherwise `PIECEWISE` when backend is `FLASH_ATTN`	Flash Attention SYCL-TLA kernels cannot be captured in full graph mode
Worker Class	Set to `vllm.v1.worker.xpu_worker.XPUWorker` if `"auto"`	Allows custom workers (e.g., vllm-omni workers)
LoRA Config	Set compilation mode to `NONE` if LoRA enabled	LoRA not compatible with compilation
KV Transfer Config	Set `enable_permute_local_kv = True`	Required for KV transfer
MLA	Disable chunked prefill if MLA enabled	MLA requires special handling on non-GPU platforms

Sources: vllm/platforms/xpu.py160-222

Device Information

XPU provides methods to query device properties:

Sources: vllm/platforms/xpu.py117-131

Punica Wrapper Selection

XPU supports both native XPU and Triton-based LoRA implementations:

Sources: vllm/platforms/xpu.py121-126

Data Type Support

XPU generally supports standard floating-point types, but with specific restrictions:

XPU uses the OCP FP8 standard dtype:

The is_data_center_gpu() helper checks get_device_name() for the string "data center gpu", enabling data-center-specific code paths.

Sources: vllm/platforms/xpu.py243-250

Memory Management

XPU uses PyTorch XPU memory functions:

Block transfer operations are implemented similarly to CUDA:

Sources: vllm/platforms/xpu.py194-255

Device Communicator

Class: vllm.distributed.device_communicators.xpu_communicator.XpuCommunicator
Distributed Backend: xccl (Intel oneCCL, via PyTorch's xccl backend)
The communicator logs a warning if xccl is not available in the current PyTorch build.

Sources: vllm/platforms/xpu.py252-261

Inference Mode

Like TPU, XPU uses torch.no_grad() instead of torch.inference_mode():

Sources: vllm/platforms/xpu.py134-135

Limitations

Device Capability: get_device_capability() returns None; the XPU capability format differs from CUDA's major/minor scheme
KV Cache Layout: Only NHD layout is supported by XPU attention kernels
Sparse Attention: Not implemented (NotImplementedError)
LoRA Compilation: Compilation mode is forced to NONE when LoRA adapters are active
Intel Arc A770: Known bfloat16 accuracy issue — check_if_supports_dtype() raises ValueError for bfloat16 on A770
MLA: Requires chunked prefill to be disabled; forces max_num_batched_tokens adjustment
XPU Graphs: Availability depends on supports_xpu_graph() (PyTorch version check); data-parallel multi-instance setups disable XPU graphs

Sources: vllm/platforms/xpu.py48-222

Platform Comparison

The following table summarizes key differences between the three alternative platforms:

Feature	CPU	TPU	XPU
Attention Backend	`CPU_ATTN`	Pallas (via `tpu_inference`)	`FLASH_ATTN` / `TRITON_ATTN` / `TRITON_MLA`
Distributed Backend	`gloo`	`gloo`	`xccl`
Worker Class	`CPUWorker`	`TPUWorker`	`XPUWorker`
Compilation	`DYNAMO_TRACE_ONCE` (inductor)	`DYNAMO_TRACE_ONCE` (openxla)	`PIECEWISE` (when XPU graphs available)
Static Graph Mode	No	No	Yes (`CUDAGraphWrapper`)
Block Size Default	128	Computed by Pallas	64
KV Cache Layout	Any	Any	NHD only
FP8 dtype	Not supported	`e4m3fn`	`e4m3fn`
Hybrid KV Cache	Yes	No	Yes
Pin Memory	No	No	Yes
MLA Support	No (raises error)	Via Pallas backend	`TRITON_MLA` (chunked prefill disabled)
Sparse Attention	No	No	No
Custom AllReduce	No	No	No
Inference Mode	`torch.no_grad()`	`torch.no_grad()`	`torch.no_grad()`
Device Control Env	`CPU_VISIBLE_MEMORY_NODES`	via `tpu_inference`	`ZE_AFFINITY_MASK`

Sources: vllm/platforms/cpu.py72-509 vllm/platforms/tpu.py1-21 vllm/platforms/xpu.py30-309

Platform Selection

Platform selection happens automatically at vLLM initialization based on available hardware and environment variables. The selection process is managed by the platform detection system (see Platform Abstraction).

Once a platform is selected, its check_and_update_config() method is called to apply platform-specific configuration constraints and optimizations.

Sources: vllm/platforms/interface.py404-414

Backend Class Paths

Each platform returns specific class paths for key components:

CPU Platform

Component	Class Path
Attention Backend	`vllm.v1.attention.backends.cpu_attn.CPUAttentionBackend`
Punica Wrapper	`vllm.lora.punica_wrapper.punica_cpu.PunicaWrapperCPU`
Device Communicator	`vllm.distributed.device_communicators.cpu_communicator.CpuCommunicator`

TPU Platform

Component	Class Path
Attention Backend	`vllm.v1.attention.backends.pallas.PallasAttentionBackend`
Punica Wrapper	`vllm.lora.punica_wrapper.punica_tpu.PunicaWrapperTPU`
Device Communicator	`vllm.distributed.device_communicators.tpu_communicator.TpuCommunicator`

XPU Platform

Component	Class Path
Attention Backend	`vllm.v1.attention.backends.flash_attn.FlashAttentionBackend` (default)
Punica Wrapper	`vllm.lora.punica_wrapper.punica_xpu.PunicaWrapperXPU` (default) or `vllm.lora.punica_wrapper.punica_gpu.PunicaWrapperGPU` (with Triton)
Device Communicator	`vllm.distributed.device_communicators.xpu_communicator.XpuCommunicator`

Sources: vllm/platforms/cpu.py127-409 vllm/platforms/tpu.py58-222 vllm/platforms/xpu.py42-211

CPU, TPU, and XPU Platforms

Relevant source files

For CUDA and ROCm GPU platform implementations, see pages 10.2 and 10.3. For the platform abstraction interface and selection mechanism, see page 10.1.

Platform Architecture

Sources: vllm/platforms/interface.py100-692 vllm/platforms/cpu.py71-422 vllm/platforms/tpu.py37-296 vllm/platforms/xpu.py24-256

CPU Platform

Platform Configuration

Property	Value
Enum	`PlatformEnum.CPU`
Device Type	`"cpu"`
Dispatch Key	`"CPU"`
Distributed Backend	`"gloo"`
Device Control Env Var	`CPU_VISIBLE_MEMORY_NODES`
Worker Class	`vllm.v1.worker.cpu_worker.CPUWorker`

Sources: vllm/platforms/cpu.py71-78

Supported Data Types

CPU platform support for data types varies by architecture:

x86/AArch64: torch.bfloat16, torch.float16, torch.float32
PowerPC: torch.bfloat16, torch.float32
ARM (macOS with FEAT_BF16): torch.bfloat16, torch.float16, torch.float32
ARM (macOS without FEAT_BF16): torch.float16, torch.float32
RISC-V: torch.float32 only (workaround for scheduler bug)

Sources: vllm/platforms/cpu.py79-121

Attention Backend

CPU platform uses the CPU_ATTN backend exclusively. This backend is selected regardless of user configuration, as other backends are GPU-specific.

Sources: vllm/platforms/cpu.py127-138

Memory Management

CPU memory management uses a different strategy than GPU platforms:

KV Cache Space Calculation: Memory is allocated based on VLLM_CPU_KVCACHE_SPACE environment variable or defaults to 50% of available memory per NUMA node.
Block Size: Default block size is 128, optimized for CPU cache lines (preferred to be multiples of 32).
No FP8 KV Cache: CPU backend does not support KV cache quantization.

Sources: vllm/platforms/cpu.py141-164 vllm/platforms/cpu.py184-211

Configuration Updates

The check_and_update_config() method applies CPU-specific configuration changes:

Configuration	Update
Block Size	Set to 128 if not specified
Cache Dtype	Force to `"auto"` (no FP8 KV cache support)
Async Scheduling	Set `scheduler_config.async_scheduling = False`
Distributed Backend	Force to `"mp"` (multiprocessing) if world_size > 1
Worker Class	Set to `vllm.v1.worker.cpu_worker.CPUWorker`
CUDA Graphs	Set `cudagraph_capture_sizes = []` (not supported)
Compilation Mode	Convert `VLLM_COMPILE` to `DYNAMO_TRACE_ONCE` with `inductor` backend (or `eager` in `VLLM_CPU_CI_ENV`)
DBO (Dual-Batch Overlap)	Disabled
Cascade Attention	Disabled (`model_config.disable_cascade_attn = True`)
MLA	Disable chunked prefill and adjust `max_num_batched_tokens`

Sources: vllm/platforms/cpu.py180-263

Environment Variable Configuration

The CPU platform sets several environment variables during initialization:

Sources: vllm/platforms/cpu.py268-342

NUMA Node and CPU Core Management

On Linux systems, CPU platform provides utilities to manage CPU core allocation across NUMA nodes.

CPU core topology helpers in CpuPlatform

The LogicalCPUInfo dataclass vllm/platforms/cpu.py42-68 tracks CPU topology with fields id, physical_core, and numa_node.

Sources: vllm/platforms/cpu.py42-68 vllm/platforms/cpu.py365-404 vllm/platforms/cpu.py407-458

Device Communicator

CPU platform uses a specialized communicator for distributed operations:

Class: vllm.distributed.device_communicators.cpu_communicator.CpuCommunicator
Backend: Gloo (since NCCL is GPU-only)
Custom AllReduce: Not supported (use_custom_allreduce() returns False)

Sources: vllm/platforms/cpu.py405-409

Limitations

No CUDA Graphs: Static graph mode is not supported
No Hybrid KV Cache: Returns True but with limited implementation
No Pin Memory: is_pin_memory_available() returns False
Multiprocessing Only: Only "mp" distributed executor backend is supported
No MLA Support: Multi-Latent Attention raises NotImplementedError
No Sparse Attention: Raises NotImplementedError
Chunked Prefill Restrictions: Not compatible with FP8 KV cache

Sources: vllm/platforms/cpu.py127-138 vllm/platforms/cpu.py397-421

TPU Platform

The TpuPlatform class is provided by the external tpu_inference package. The file vllm/platforms/tpu.py is a thin shim that imports TpuPlatform from that package at runtime.

Sources: vllm/platforms/tpu.py1-21

Key Characteristics (provided by `tpu_inference`)

Property	Value
Enum	`PlatformEnum.TPU`
Attention Backend	Pallas (JAX/XLA kernel language)
Distributed Backend	`"gloo"`
Compilation Backend	`"openxla"`
Worker Class	`vllm.v1.worker.tpu_worker.TPUWorker`
Inference Mode	`torch.no_grad()` (XLA does not support `inference_mode`)

Notable behavioral constraints imposed by the TPU platform include:

Only bfloat16 dtype is efficient on TPU hardware.
CUDA graphs are not applicable; compilation uses DYNAMO_TRACE_ONCE with the openxla backend.
Speculative decoding is not supported.
KV cache block size is determined dynamically by the Pallas attention backend.
Synchronized weight loading (use_sync_weight_loader() returns True) is required for XLA operation ordering.

Sources: vllm/platforms/tpu.py1-21

XPU Platform

The XPUPlatform class provides support for Intel Data Center GPUs (formerly known as Intel Arc GPUs for data centers) using Intel Extension for PyTorch (IPEX).

Platform Configuration

Property	Value
Enum	`PlatformEnum.XPU`
Device Type	`"xpu"`
Dispatch Key	`"XPU"`
Distributed Backend	`"xccl"` (Intel oneCCL, via `xccl` backend)
Device Control Env Var	`ZE_AFFINITY_MASK`
Worker Class	`vllm.v1.worker.xpu_worker.XPUWorker`
Ray Device Key	`"GPU"`

Sources: vllm/platforms/xpu.py30-39

Kernel Import Strategy

XPU platform does not import the standard vllm._C module (CUDA-specific kernels), only importing MoE kernels:

Sources: vllm/platforms/xpu.py36-39

Attention Backend Selection

XPU supports Flash Attention, Triton, and Triton MLA backends. Flash Attention is the default for standard attention; MLA automatically uses Triton MLA.

XPU get_attn_backend_cls() dispatch logic

Important: XPU attention kernels only support NHD (num_heads × head_size × depth) KV cache layout. set_kv_cache_layout("NHD") is called unconditionally at the top of get_attn_backend_cls().

Sources: vllm/platforms/xpu.py48-87

Configuration Updates

The check_and_update_config() method applies XPU-specific settings:

Configuration	Update	Reason
Block Size	Set to 64 if not specified	Optimized for chunked prefill on XPU
Compile Sizes	Set to `[]` if `None`	Initialize compilation configuration
Default Attention Backend	Set to `FLASH_ATTN` if not specified	Flash Attention is XPU's primary backend
CUDA Graph Mode	`NONE` if `supports_xpu_graph()` is False or `world_size_across_dp > 1`; otherwise `PIECEWISE` when backend is `FLASH_ATTN`	Flash Attention SYCL-TLA kernels cannot be captured in full graph mode
Worker Class	Set to `vllm.v1.worker.xpu_worker.XPUWorker` if `"auto"`	Allows custom workers (e.g., vllm-omni workers)
LoRA Config	Set compilation mode to `NONE` if LoRA enabled	LoRA not compatible with compilation
KV Transfer Config	Set `enable_permute_local_kv = True`	Required for KV transfer
MLA	Disable chunked prefill if MLA enabled	MLA requires special handling on non-GPU platforms

Sources: vllm/platforms/xpu.py160-222

Device Information

XPU provides methods to query device properties:

Sources: vllm/platforms/xpu.py117-131

Punica Wrapper Selection

XPU supports both native XPU and Triton-based LoRA implementations:

Sources: vllm/platforms/xpu.py121-126

Data Type Support

XPU generally supports standard floating-point types, but with specific restrictions:

XPU uses the OCP FP8 standard dtype:

The is_data_center_gpu() helper checks get_device_name() for the string "data center gpu", enabling data-center-specific code paths.

Sources: vllm/platforms/xpu.py243-250

Memory Management

XPU uses PyTorch XPU memory functions:

Block transfer operations are implemented similarly to CUDA:

Sources: vllm/platforms/xpu.py194-255

Device Communicator

Class: vllm.distributed.device_communicators.xpu_communicator.XpuCommunicator
Distributed Backend: xccl (Intel oneCCL, via PyTorch's xccl backend)
The communicator logs a warning if xccl is not available in the current PyTorch build.

Sources: vllm/platforms/xpu.py252-261

Inference Mode

Like TPU, XPU uses torch.no_grad() instead of torch.inference_mode():

Sources: vllm/platforms/xpu.py134-135

Limitations

Device Capability: get_device_capability() returns None; the XPU capability format differs from CUDA's major/minor scheme
KV Cache Layout: Only NHD layout is supported by XPU attention kernels
Sparse Attention: Not implemented (NotImplementedError)
LoRA Compilation: Compilation mode is forced to NONE when LoRA adapters are active
Intel Arc A770: Known bfloat16 accuracy issue — check_if_supports_dtype() raises ValueError for bfloat16 on A770
MLA: Requires chunked prefill to be disabled; forces max_num_batched_tokens adjustment
XPU Graphs: Availability depends on supports_xpu_graph() (PyTorch version check); data-parallel multi-instance setups disable XPU graphs

Sources: vllm/platforms/xpu.py48-222

Platform Comparison

The following table summarizes key differences between the three alternative platforms:

Feature	CPU	TPU	XPU
Attention Backend	`CPU_ATTN`	Pallas (via `tpu_inference`)	`FLASH_ATTN` / `TRITON_ATTN` / `TRITON_MLA`
Distributed Backend	`gloo`	`gloo`	`xccl`
Worker Class	`CPUWorker`	`TPUWorker`	`XPUWorker`
Compilation	`DYNAMO_TRACE_ONCE` (inductor)	`DYNAMO_TRACE_ONCE` (openxla)	`PIECEWISE` (when XPU graphs available)
Static Graph Mode	No	No	Yes (`CUDAGraphWrapper`)
Block Size Default	128	Computed by Pallas	64
KV Cache Layout	Any	Any	NHD only
FP8 dtype	Not supported	`e4m3fn`	`e4m3fn`
Hybrid KV Cache	Yes	No	Yes
Pin Memory	No	No	Yes
MLA Support	No (raises error)	Via Pallas backend	`TRITON_MLA` (chunked prefill disabled)
Sparse Attention	No	No	No
Custom AllReduce	No	No	No
Inference Mode	`torch.no_grad()`	`torch.no_grad()`	`torch.no_grad()`
Device Control Env	`CPU_VISIBLE_MEMORY_NODES`	via `tpu_inference`	`ZE_AFFINITY_MASK`

Sources: vllm/platforms/cpu.py72-509 vllm/platforms/tpu.py1-21 vllm/platforms/xpu.py30-309

Platform Selection

Once a platform is selected, its check_and_update_config() method is called to apply platform-specific configuration constraints and optimizations.

Sources: vllm/platforms/interface.py404-414

Backend Class Paths

Each platform returns specific class paths for key components:

CPU Platform

Component	Class Path
Attention Backend	`vllm.v1.attention.backends.cpu_attn.CPUAttentionBackend`
Punica Wrapper	`vllm.lora.punica_wrapper.punica_cpu.PunicaWrapperCPU`
Device Communicator	`vllm.distributed.device_communicators.cpu_communicator.CpuCommunicator`

TPU Platform

Component	Class Path
Attention Backend	`vllm.v1.attention.backends.pallas.PallasAttentionBackend`
Punica Wrapper	`vllm.lora.punica_wrapper.punica_tpu.PunicaWrapperTPU`
Device Communicator	`vllm.distributed.device_communicators.tpu_communicator.TpuCommunicator`

XPU Platform

Component	Class Path
Attention Backend	`vllm.v1.attention.backends.flash_attn.FlashAttentionBackend` (default)
Punica Wrapper	`vllm.lora.punica_wrapper.punica_xpu.PunicaWrapperXPU` (default) or `vllm.lora.punica_wrapper.punica_gpu.PunicaWrapperGPU` (with Triton)
Device Communicator	`vllm.distributed.device_communicators.xpu_communicator.XpuCommunicator`

Sources: vllm/platforms/cpu.py127-409 vllm/platforms/tpu.py58-222 vllm/platforms/xpu.py42-211

CPU, TPU, and XPU Platforms

Platform Architecture

CPU Platform

Platform Configuration

Supported Data Types

Attention Backend

Memory Management

Configuration Updates

Environment Variable Configuration

NUMA Node and CPU Core Management

Device Communicator

Limitations

TPU Platform

Key Characteristics (provided by tpu_inference)

XPU Platform

Platform Configuration

Kernel Import Strategy

Attention Backend Selection

Configuration Updates

Device Information

Punica Wrapper Selection

Data Type Support

Memory Management

Device Communicator

Inference Mode

Limitations

Platform Comparison

Platform Selection

Backend Class Paths

CPU Platform

TPU Platform

XPU Platform

On this page

CPU, TPU, and XPU Platforms

Platform Architecture

CPU Platform

Platform Configuration

Supported Data Types

Attention Backend

Memory Management

Configuration Updates

Environment Variable Configuration

NUMA Node and CPU Core Management

Device Communicator

Limitations

TPU Platform

Key Characteristics (provided by tpu_inference)

XPU Platform

Platform Configuration

Kernel Import Strategy

Attention Backend Selection

Configuration Updates

Device Information

Punica Wrapper Selection

Data Type Support

Memory Management

Device Communicator

Inference Mode

Limitations

Platform Comparison

Platform Selection

Backend Class Paths

CPU Platform

TPU Platform

XPU Platform

On this page

Key Characteristics (provided by `tpu_inference`)

Key Characteristics (provided by `tpu_inference`)