This page documents llama.cpp's multi-GPU and distributed inference capabilities, covering how model weights and computations are split across multiple GPUs to enable inference of models larger than single-GPU VRAM or to accelerate inference through parallelization.
For single-GPU backend implementations, see GPU and Accelerator Backends. For backend scheduling and memory management, see Backend System and Registration.
llama.cpp supports two primary approaches to multi-device inference:
Diagram: Multi-GPU and Distributed Inference Architecture
Sources: ggml/src/ggml-cuda/ggml-cuda.cu776-1043 README.md289
llama.cpp uses a tensor_split array to determine how to distribute model weights across available GPUs. The split ratios define the proportion of model layers each device should handle.
Key Data Structures:
| Structure | Purpose | Location |
|---|---|---|
ggml_cuda_device_info::default_tensor_split | Default split ratios per device | ggml/src/ggml-cuda/ggml-cuda.cu231-232 |
ggml_backend_cuda_split_buffer_type_context::tensor_split | Per-buffer split configuration | ggml/src/ggml-cuda/ggml-cuda.cu777-788 |
Diagram: Tensor Split Ratio Calculation Flow
The default split is calculated proportionally to each GPU's VRAM:
device_vram / total_vramSources: ggml/src/ggml-cuda/ggml-cuda.cu195-313
Split buffers are a specialized buffer type that manages memory allocation and operations across multiple devices.
Diagram: Split Buffer Type Structure
Tensors are split along their row dimension (first dimension in row-major order):
Split Calculation Algorithm:
Diagram: Row-Based Splitting Algorithm
Key functions:
get_row_rounding() - Determines alignment requirements based on tensor typeSources: ggml/src/ggml-cuda/ggml-cuda.cu778-803 ggml/src/ggml-cuda/ggml-cuda.cu890-955
Matrix multiplication operations are distributed across GPUs based on the split configuration:
Diagram: Multi-GPU Matrix Multiplication Flow
The mul_mat_q path handles quantized weights split across devices:
Key Components:
| Component | Purpose | File |
|---|---|---|
ggml_cuda_mul_mat_q | Main entry point for batched quantized matmul | ggml/src/ggml-cuda/mmq.cu71-218 |
ggml_cuda_op_mul_mat_q | Per-device entry point for split execution | ggml/src/ggml-cuda/mmq.cu220-260 |
mmq_args | Arguments structure encoding row range and strides | ggml/src/ggml-cuda/mmq.cu150-156 |
mul_mat_q_case<TYPE> | Template specialization per quantization type | ggml/src/ggml-cuda/mmq.cu6-68 |
Split Execution:
ggml_cuda_op_mul_mat_q is called per-device with explicit row_low and row_high parameters:
src1) are broadcast to all devicesnrows_dst: on the main device equals full output size; on other devices equals only the local row rangeSources: ggml/src/ggml-cuda/mmq.cu71-218 ggml/src/ggml-cuda/mmq.cu220-260
The float path uses different kernels optimized for FP16/FP32:
Architecture-Specific Kernels:
| Architecture | Tile Size | Warp Config | File Reference |
|---|---|---|---|
| Volta (SM70) | 32×8 | Special mirrored layout | ggml/src/ggml-cuda/mmf.cuh69-73 |
| Turing+ (SM75+) | 16×8 | Tensor core MMA | ggml/src/ggml-cuda/mmf.cuh74-78 |
| AMD MFMA | 16×8 | MFMA instructions | ggml/src/ggml-cuda/mmf.cuh63-67 |
| AMD WMMA | 16×8 | WMMA instructions | ggml/src/ggml-cuda/mmf.cuh58-62 |
Sources: ggml/src/ggml-cuda/mmf.cu13-131 ggml/src/ggml-cuda/mmf.cuh48-165
When multiple GPUs need to share data, llama.cpp can use direct peer-to-peer (P2P) memory access:
Diagram: Peer-to-Peer Access Flow
Key Functions:
cudaDeviceCanAccessPeer() - Check if P2P is supported between two devicescudaDeviceEnablePeerAccess() - Enable P2P for a device paircudaMemcpyPeerAsync() - Direct GPU-to-GPU copyConditional Compilation:
The GGML_CUDA_NO_PEER_COPY macro can disable P2P copying:
Sources: ggml/src/ggml-cuda/ggml-cuda.cu643-662 ggml/src/ggml-cuda/ggml-cuda.cu1156-1234
When a tensor is allocated in a split buffer, it's divided across devices:
Diagram: Split Tensor Initialization Sequence
Sources: ggml/src/ggml-cuda/ggml-cuda.cu890-955
Set Tensor (Host → GPU Split):
Diagram: Set Tensor Operation Flow
Get Tensor (GPU Split → Host):
Similar flow but copies from multiple devices to host, concatenating results.
Sources: ggml/src/ggml-cuda/ggml-cuda.cu960-1033
These flags are parsed via common/arg.cpp and stored in common_params:
| Flag | Short | Description |
|---|---|---|
--gpu-layers N | -ngl N | Number of model layers to offload to GPU(s). Set to a large value to offload all layers. |
--tensor-split RATIO | -ts RATIO | Fraction of model layers to split across GPUs. Format: comma-separated ratios, e.g. 3,1 for 75%/25%. Values do not need to sum to 1. |
--main-gpu N | -mg N | Index of the primary GPU device that aggregates final results (default: 0). |
--rpc SERVERS | Comma-separated list of RPC server endpoints for distributed inference (e.g. host1:50052,host2:50052). |
Examples:
Sources: README.md289
Environment Variables:
| Variable | Purpose | Default |
|---|---|---|
CUDA_VISIBLE_DEVICES | Restrict visible GPUs | All devices |
GGML_CUDA_ENABLE_UNIFIED_MEMORY | Use CUDA Unified Memory | Disabled |
Runtime Configuration:
--tensor-split / -tsGGML_CUDA_MAX_DEVICES (defined in ggml/src/ggml-cuda/common.cuh)Split Buffer Creation:
The public API for creating split buffer types is ggml_backend_cuda_split_buffer_type(tensor_split), where tensor_split is a float[GGML_CUDA_MAX_DEVICES] array of ratios.
Sources: ggml/src/ggml-cuda/ggml-cuda.cu1309-1387 ggml/src/ggml-cuda/common.cuh151
Each device maintains its own memory pool, created per-device via ggml_backend_cuda_context::new_pool_for_device:
Pool Types:
Legacy Pool (ggml_cuda_pool_leg):
VMM Pool (ggml_cuda_pool_vmm):
cuMemCreate, cuMemMap, cuMemSetAccess)vmm_granularity > 0Sources: ggml/src/ggml-cuda/ggml-cuda.cu323-538 ggml/src/ggml-cuda/ggml-cuda.cu530-538
The RPC backend (ggml_backend_rpc) enables distributed inference across network-connected machines. The server-side binary is built from tools/rpc/ and the client is a GGML backend registered like any other hardware backend.
RPC System Architecture
Diagram: RPC Backend and ggml_backend_sched Integration
Key Properties:
| Property | Detail |
|---|---|
| Interface | ggml_backend — same interface as local backends |
| Transport | Binary protocol over TCP |
| Server binary | tools/rpc/rpc-server |
| Registration | ggml_backend_rpc_add_device(host, port) |
| Multiple servers | Supported — each server appears as a separate backend device |
| Mixed local+remote | Supported via ggml_backend_sched |
Usage:
Sources: README.md289 tools/rpc
Split Buffer Overhead:
| Operation | Overhead | Mitigation |
|---|---|---|
| Tensor Initialization | Minimal | One-time cost at load |
| Set/Get Tensor | 2-3× slower | P2P access when available |
| Compute | ~5% | Dominated by kernel time |
Factors Affecting Split Performance:
Recommendation:
Sources: ggml/src/ggml-cuda/ggml-cuda.cu195-313
Different quantization types have varying memory access patterns:
Quantization Support Matrix:
| Type | MMQ Support | Recommended for Multi-GPU |
|---|---|---|
| Q4_0, Q4_1 | ✓ | Yes - Low memory overhead |
| Q5_0, Q5_1 | ✓ | Yes - Good balance |
| Q8_0 | ✓ | Yes - Fast dequantization |
| Q2_K, Q3_K | ✓ | Yes - Maximum memory savings |
| Q4_K, Q5_K, Q6_K | ✓ | Yes - Best quality/size ratio |
| IQ1_S, IQ2_XXS, etc. | ✓ | Yes - Extreme compression |
| MXFP4 | ✓ (Blackwell+) | Yes - Native FP4 support |
Sources: ggml/src/ggml-cuda/mmq.cu262-366
Debug Information:
The split buffer context tracks:
Common Issues:
GGML_CUDA_NO_PEER_COPY and system topologyKey Metrics:
Sources: ggml/src/ggml-cuda/ggml-cuda.cu547-564
When working with split tensors programmatically:
Checking if a tensor is split:
Getting device for a row: The split buffer automatically routes operations to the correct device based on row ranges.
Sources: ggml/src/ggml-cuda/ggml-cuda.cu1285-1308
Applications can override default splits by creating custom buffer types:
Sources: ggml/src/ggml-cuda/ggml-cuda.cu1363-1387
Sources: ggml/src/ggml-cuda/ggml-cuda.cu1-98
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.