Multi-GPU and Distributed Inference

Relevant source files

Purpose and Scope

This page documents llama.cpp's multi-GPU and distributed inference capabilities, covering how model weights and computations are split across multiple GPUs to enable inference of models larger than single-GPU VRAM or to accelerate inference through parallelization.

For single-GPU backend implementations, see GPU and Accelerator Backends. For backend scheduling and memory management, see Backend System and Registration.

Architecture Overview

llama.cpp supports two primary approaches to multi-device inference:

Tensor Parallelism - Splitting model layers across multiple GPUs within a single machine
RPC (Remote Procedure Call) - Distributing inference across remote machines/devices

Multi-GPU System Architecture

Diagram: Multi-GPU and Distributed Inference Architecture

Sources: ggml/src/ggml-cuda/ggml-cuda.cu776-1043 README.md289

Tensor Split Configuration

Split Ratio Array

llama.cpp uses a tensor_split array to determine how to distribute model weights across available GPUs. The split ratios define the proportion of model layers each device should handle.

Key Data Structures:

Structure	Purpose	Location
`ggml_cuda_device_info::default_tensor_split`	Default split ratios per device	ggml/src/ggml-cuda/ggml-cuda.cu231-232
`ggml_backend_cuda_split_buffer_type_context::tensor_split`	Per-buffer split configuration	ggml/src/ggml-cuda/ggml-cuda.cu777-788

Split Calculation

Diagram: Tensor Split Ratio Calculation Flow

The default split is calculated proportionally to each GPU's VRAM:

During initialization, total VRAM across all devices is summed
Each device's split ratio = device_vram / total_vram
Split ratios are normalized so they sum to 1.0

Sources: ggml/src/ggml-cuda/ggml-cuda.cu195-313

Split Buffer Implementation

Buffer Type Hierarchy

Split buffers are a specialized buffer type that manages memory allocation and operations across multiple devices.

Diagram: Split Buffer Type Structure

Row-Based Splitting Strategy

Tensors are split along their row dimension (first dimension in row-major order):

Split Calculation Algorithm:

Diagram: Row-Based Splitting Algorithm

Key functions:

get_row_rounding() - Determines alignment requirements based on tensor type
Split boundaries are calculated to ensure each device gets a contiguous range of rows
Boundaries are rounded to satisfy alignment constraints

Sources: ggml/src/ggml-cuda/ggml-cuda.cu778-803 ggml/src/ggml-cuda/ggml-cuda.cu890-955

Multi-GPU Matrix Operations

Operation Distribution

Matrix multiplication operations are distributed across GPUs based on the split configuration:

Diagram: Multi-GPU Matrix Multiplication Flow

Quantized Matrix Multiplication (MMQ)

The mul_mat_q path handles quantized weights split across devices:

Key Components:

Component	Purpose	File
`ggml_cuda_mul_mat_q`	Main entry point for batched quantized matmul	ggml/src/ggml-cuda/mmq.cu71-218
`ggml_cuda_op_mul_mat_q`	Per-device entry point for split execution	ggml/src/ggml-cuda/mmq.cu220-260
`mmq_args`	Arguments structure encoding row range and strides	ggml/src/ggml-cuda/mmq.cu150-156
`mul_mat_q_case<TYPE>`	Template specialization per quantization type	ggml/src/ggml-cuda/mmq.cu6-68

Split Execution:

ggml_cuda_op_mul_mat_q is called per-device with explicit row_low and row_high parameters:

Each device computes only its assigned row range of the output matrix
Input activations (src1) are broadcast to all devices
nrows_dst: on the main device equals full output size; on other devices equals only the local row range
Results are gathered back to the main device buffer

Sources: ggml/src/ggml-cuda/mmq.cu71-218 ggml/src/ggml-cuda/mmq.cu220-260

Float Matrix Multiplication (MMF)

The float path uses different kernels optimized for FP16/FP32:

Architecture-Specific Kernels:

Architecture	Tile Size	Warp Config	File Reference
Volta (SM70)	32×8	Special mirrored layout	ggml/src/ggml-cuda/mmf.cuh69-73
Turing+ (SM75+)	16×8	Tensor core MMA	ggml/src/ggml-cuda/mmf.cuh74-78
AMD MFMA	16×8	MFMA instructions	ggml/src/ggml-cuda/mmf.cuh63-67
AMD WMMA	16×8	WMMA instructions	ggml/src/ggml-cuda/mmf.cuh58-62

Sources: ggml/src/ggml-cuda/mmf.cu13-131 ggml/src/ggml-cuda/mmf.cuh48-165

Peer-to-Peer Communication

GPU-to-GPU Memory Access

When multiple GPUs need to share data, llama.cpp can use direct peer-to-peer (P2P) memory access:

Diagram: Peer-to-Peer Access Flow

Key Functions:

cudaDeviceCanAccessPeer() - Check if P2P is supported between two devices
cudaDeviceEnablePeerAccess() - Enable P2P for a device pair
cudaMemcpyPeerAsync() - Direct GPU-to-GPU copy

Conditional Compilation:

The GGML_CUDA_NO_PEER_COPY macro can disable P2P copying:

When defined, forces copies through host memory
Useful for systems with unreliable P2P support

Sources: ggml/src/ggml-cuda/ggml-cuda.cu643-662 ggml/src/ggml-cuda/ggml-cuda.cu1156-1234

Split Buffer Memory Operations

Tensor Initialization

When a tensor is allocated in a split buffer, it's divided across devices:

Diagram: Split Tensor Initialization Sequence

Sources: ggml/src/ggml-cuda/ggml-cuda.cu890-955

Data Transfer Operations

Set Tensor (Host → GPU Split):

Diagram: Set Tensor Operation Flow

Get Tensor (GPU Split → Host):

Similar flow but copies from multiple devices to host, concatenating results.

Sources: ggml/src/ggml-cuda/ggml-cuda.cu960-1033

CLI Flags for Multi-GPU Inference

These flags are parsed via common/arg.cpp and stored in common_params:

Flag	Short	Description
`--gpu-layers N`	`-ngl N`	Number of model layers to offload to GPU(s). Set to a large value to offload all layers.
`--tensor-split RATIO`	`-ts RATIO`	Fraction of model layers to split across GPUs. Format: comma-separated ratios, e.g. `3,1` for 75%/25%. Values do not need to sum to 1.
`--main-gpu N`	`-mg N`	Index of the primary GPU device that aggregates final results (default: 0).
`--rpc SERVERS`		Comma-separated list of RPC server endpoints for distributed inference (e.g. `host1:50052,host2:50052`).

Examples:

Sources: README.md289

Multi-GPU Configuration

Device Selection and Setup

Environment Variables:

Variable	Purpose	Default
`CUDA_VISIBLE_DEVICES`	Restrict visible GPUs	All devices
`GGML_CUDA_ENABLE_UNIFIED_MEMORY`	Use CUDA Unified Memory	Disabled

Runtime Configuration:

Automatic Split: By default, tensors are split proportionally to device VRAM
Custom Split: Applications can provide explicit ratios via --tensor-split / -ts
Device Count: Maximum devices is GGML_CUDA_MAX_DEVICES (defined in ggml/src/ggml-cuda/common.cuh)

Split Buffer Creation:

The public API for creating split buffer types is ggml_backend_cuda_split_buffer_type(tensor_split), where tensor_split is a float[GGML_CUDA_MAX_DEVICES] array of ratios.

Sources: ggml/src/ggml-cuda/ggml-cuda.cu1309-1387 ggml/src/ggml-cuda/common.cuh151

Memory Pool Management

Each device maintains its own memory pool, created per-device via ggml_backend_cuda_context::new_pool_for_device:

Pool Types:

Legacy Pool (ggml_cuda_pool_leg):
- Fixed array of up to 256 preallocated buffers
- Buffer reuse via size-fit matching with 5% look-ahead padding
VMM Pool (ggml_cuda_pool_vmm):
- Uses CUDA Virtual Memory Management API (cuMemCreate, cuMemMap, cuMemSetAccess)
- Growable virtual address range up to 32 GB per device
- Activated when device reports vmm_granularity > 0

Sources: ggml/src/ggml-cuda/ggml-cuda.cu323-538 ggml/src/ggml-cuda/ggml-cuda.cu530-538

RPC Backend for Distributed Inference

Remote Procedure Call Architecture

The RPC backend (ggml_backend_rpc) enables distributed inference across network-connected machines. The server-side binary is built from tools/rpc/ and the client is a GGML backend registered like any other hardware backend.

RPC System Architecture

Diagram: RPC Backend and ggml_backend_sched Integration

Key Properties:

Property	Detail
Interface	`ggml_backend` — same interface as local backends
Transport	Binary protocol over TCP
Server binary	`tools/rpc/rpc-server`
Registration	`ggml_backend_rpc_add_device(host, port)`
Multiple servers	Supported — each server appears as a separate backend device
Mixed local+remote	Supported via `ggml_backend_sched`

Usage:

Sources: README.md289 tools/rpc

Performance Considerations

Memory Bandwidth

Split Buffer Overhead:

Operation	Overhead	Mitigation
Tensor Initialization	Minimal	One-time cost at load
Set/Get Tensor	2-3× slower	P2P access when available
Compute	~5%	Dominated by kernel time

Optimal Split Ratios

Factors Affecting Split Performance:

VRAM Balance: Uneven splits can cause memory pressure on smaller devices
Compute Balance: Faster GPUs should receive proportionally more work
Communication Cost: Minimize cross-device data transfers

Recommendation:

For homogeneous systems: Use default proportional split
For heterogeneous systems: Weight by both VRAM and compute capability

Sources: ggml/src/ggml-cuda/ggml-cuda.cu195-313

Quantization and Multi-GPU

Different quantization types have varying memory access patterns:

Quantization Support Matrix:

Type	MMQ Support	Recommended for Multi-GPU
Q4_0, Q4_1	✓	Yes - Low memory overhead
Q5_0, Q5_1	✓	Yes - Good balance
Q8_0	✓	Yes - Fast dequantization
Q2_K, Q3_K	✓	Yes - Maximum memory savings
Q4_K, Q5_K, Q6_K	✓	Yes - Best quality/size ratio
IQ1_S, IQ2_XXS, etc.	✓	Yes - Extreme compression
MXFP4	✓ (Blackwell+)	Yes - Native FP4 support

Sources: ggml/src/ggml-cuda/mmq.cu262-366

Debugging and Monitoring

Split Buffer Status

Debug Information:

The split buffer context tracks:

Number of active devices
Row split boundaries per device
Individual buffer pointers and sizes

Common Issues:

P2P Failures: Check GGML_CUDA_NO_PEER_COPY and system topology
Memory Fragmentation: Monitor per-device pool usage
Unbalanced Load: Verify split ratios match device capabilities

Performance Profiling

Key Metrics:

Device utilization per GPU
Memory transfer rates (host↔device, device↔device)
Kernel execution time per device
Stream synchronization overhead

Sources: ggml/src/ggml-cuda/ggml-cuda.cu547-564

Code Examples

Accessing Split Tensors

When working with split tensors programmatically:

Checking if a tensor is split:

Getting device for a row: The split buffer automatically routes operations to the correct device based on row ranges.

Sources: ggml/src/ggml-cuda/ggml-cuda.cu1285-1308

Custom Split Configuration

Applications can override default splits by creating custom buffer types:

Sources: ggml/src/ggml-cuda/ggml-cuda.cu1363-1387

Limitations and Future Work

Current Limitations

Layer Parallelism Only: Split is at tensor/layer granularity, not within a single layer
Fixed Split Ratios: Split ratios are static after buffer creation
Network Latency: RPC backend adds ~1-10ms latency per operation depending on network
P2P Requirements: Best performance requires NVLink or PCIe P2P support

Optimization Opportunities

Dynamic Load Balancing: Adjust split ratios based on runtime performance
Pipelining: Overlap computation on different devices
Hierarchical Splitting: Support for NUMA-aware multi-node clusters
Kernel Fusion: Reduce split overhead by fusing operations

Sources: ggml/src/ggml-cuda/ggml-cuda.cu1-98

Multi-GPU and Distributed Inference

Relevant source files

Purpose and Scope

For single-GPU backend implementations, see GPU and Accelerator Backends. For backend scheduling and memory management, see Backend System and Registration.

Architecture Overview

llama.cpp supports two primary approaches to multi-device inference:

Tensor Parallelism - Splitting model layers across multiple GPUs within a single machine
RPC (Remote Procedure Call) - Distributing inference across remote machines/devices

Multi-GPU System Architecture

Diagram: Multi-GPU and Distributed Inference Architecture

Sources: ggml/src/ggml-cuda/ggml-cuda.cu776-1043 README.md289

Tensor Split Configuration

Split Ratio Array

llama.cpp uses a tensor_split array to determine how to distribute model weights across available GPUs. The split ratios define the proportion of model layers each device should handle.

Key Data Structures:

Structure	Purpose	Location
`ggml_cuda_device_info::default_tensor_split`	Default split ratios per device	ggml/src/ggml-cuda/ggml-cuda.cu231-232
`ggml_backend_cuda_split_buffer_type_context::tensor_split`	Per-buffer split configuration	ggml/src/ggml-cuda/ggml-cuda.cu777-788

Split Calculation

Diagram: Tensor Split Ratio Calculation Flow

The default split is calculated proportionally to each GPU's VRAM:

During initialization, total VRAM across all devices is summed
Each device's split ratio = device_vram / total_vram
Split ratios are normalized so they sum to 1.0

Sources: ggml/src/ggml-cuda/ggml-cuda.cu195-313

Split Buffer Implementation

Buffer Type Hierarchy

Split buffers are a specialized buffer type that manages memory allocation and operations across multiple devices.

Diagram: Split Buffer Type Structure

Row-Based Splitting Strategy

Tensors are split along their row dimension (first dimension in row-major order):

Split Calculation Algorithm:

Diagram: Row-Based Splitting Algorithm

Key functions:

get_row_rounding() - Determines alignment requirements based on tensor type
Split boundaries are calculated to ensure each device gets a contiguous range of rows
Boundaries are rounded to satisfy alignment constraints

Sources: ggml/src/ggml-cuda/ggml-cuda.cu778-803 ggml/src/ggml-cuda/ggml-cuda.cu890-955

Multi-GPU Matrix Operations

Operation Distribution

Matrix multiplication operations are distributed across GPUs based on the split configuration:

Diagram: Multi-GPU Matrix Multiplication Flow

Quantized Matrix Multiplication (MMQ)

The mul_mat_q path handles quantized weights split across devices:

Key Components:

Component	Purpose	File
`ggml_cuda_mul_mat_q`	Main entry point for batched quantized matmul	ggml/src/ggml-cuda/mmq.cu71-218
`ggml_cuda_op_mul_mat_q`	Per-device entry point for split execution	ggml/src/ggml-cuda/mmq.cu220-260
`mmq_args`	Arguments structure encoding row range and strides	ggml/src/ggml-cuda/mmq.cu150-156
`mul_mat_q_case<TYPE>`	Template specialization per quantization type	ggml/src/ggml-cuda/mmq.cu6-68

Split Execution:

ggml_cuda_op_mul_mat_q is called per-device with explicit row_low and row_high parameters:

Each device computes only its assigned row range of the output matrix
Input activations (src1) are broadcast to all devices
nrows_dst: on the main device equals full output size; on other devices equals only the local row range
Results are gathered back to the main device buffer

Sources: ggml/src/ggml-cuda/mmq.cu71-218 ggml/src/ggml-cuda/mmq.cu220-260

Float Matrix Multiplication (MMF)

The float path uses different kernels optimized for FP16/FP32:

Architecture-Specific Kernels:

Architecture	Tile Size	Warp Config	File Reference
Volta (SM70)	32×8	Special mirrored layout	ggml/src/ggml-cuda/mmf.cuh69-73
Turing+ (SM75+)	16×8	Tensor core MMA	ggml/src/ggml-cuda/mmf.cuh74-78
AMD MFMA	16×8	MFMA instructions	ggml/src/ggml-cuda/mmf.cuh63-67
AMD WMMA	16×8	WMMA instructions	ggml/src/ggml-cuda/mmf.cuh58-62

Sources: ggml/src/ggml-cuda/mmf.cu13-131 ggml/src/ggml-cuda/mmf.cuh48-165

Peer-to-Peer Communication

GPU-to-GPU Memory Access

When multiple GPUs need to share data, llama.cpp can use direct peer-to-peer (P2P) memory access:

Diagram: Peer-to-Peer Access Flow

Key Functions:

cudaDeviceCanAccessPeer() - Check if P2P is supported between two devices
cudaDeviceEnablePeerAccess() - Enable P2P for a device pair
cudaMemcpyPeerAsync() - Direct GPU-to-GPU copy

Conditional Compilation:

The GGML_CUDA_NO_PEER_COPY macro can disable P2P copying:

When defined, forces copies through host memory
Useful for systems with unreliable P2P support

Sources: ggml/src/ggml-cuda/ggml-cuda.cu643-662 ggml/src/ggml-cuda/ggml-cuda.cu1156-1234

Split Buffer Memory Operations

Tensor Initialization

When a tensor is allocated in a split buffer, it's divided across devices:

Diagram: Split Tensor Initialization Sequence

Sources: ggml/src/ggml-cuda/ggml-cuda.cu890-955

Data Transfer Operations

Set Tensor (Host → GPU Split):

Diagram: Set Tensor Operation Flow

Get Tensor (GPU Split → Host):

Similar flow but copies from multiple devices to host, concatenating results.

Sources: ggml/src/ggml-cuda/ggml-cuda.cu960-1033

CLI Flags for Multi-GPU Inference

These flags are parsed via common/arg.cpp and stored in common_params:

Flag	Short	Description
`--gpu-layers N`	`-ngl N`	Number of model layers to offload to GPU(s). Set to a large value to offload all layers.
`--tensor-split RATIO`	`-ts RATIO`	Fraction of model layers to split across GPUs. Format: comma-separated ratios, e.g. `3,1` for 75%/25%. Values do not need to sum to 1.
`--main-gpu N`	`-mg N`	Index of the primary GPU device that aggregates final results (default: 0).
`--rpc SERVERS`		Comma-separated list of RPC server endpoints for distributed inference (e.g. `host1:50052,host2:50052`).

Examples:

Sources: README.md289

Multi-GPU Configuration

Device Selection and Setup

Environment Variables:

Variable	Purpose	Default
`CUDA_VISIBLE_DEVICES`	Restrict visible GPUs	All devices
`GGML_CUDA_ENABLE_UNIFIED_MEMORY`	Use CUDA Unified Memory	Disabled

Runtime Configuration:

Automatic Split: By default, tensors are split proportionally to device VRAM
Custom Split: Applications can provide explicit ratios via --tensor-split / -ts
Device Count: Maximum devices is GGML_CUDA_MAX_DEVICES (defined in ggml/src/ggml-cuda/common.cuh)

Split Buffer Creation:

The public API for creating split buffer types is ggml_backend_cuda_split_buffer_type(tensor_split), where tensor_split is a float[GGML_CUDA_MAX_DEVICES] array of ratios.

Sources: ggml/src/ggml-cuda/ggml-cuda.cu1309-1387 ggml/src/ggml-cuda/common.cuh151

Memory Pool Management

Each device maintains its own memory pool, created per-device via ggml_backend_cuda_context::new_pool_for_device:

Pool Types:

Legacy Pool (ggml_cuda_pool_leg):
- Fixed array of up to 256 preallocated buffers
- Buffer reuse via size-fit matching with 5% look-ahead padding
VMM Pool (ggml_cuda_pool_vmm):
- Uses CUDA Virtual Memory Management API (cuMemCreate, cuMemMap, cuMemSetAccess)
- Growable virtual address range up to 32 GB per device
- Activated when device reports vmm_granularity > 0

Sources: ggml/src/ggml-cuda/ggml-cuda.cu323-538 ggml/src/ggml-cuda/ggml-cuda.cu530-538

RPC Backend for Distributed Inference

Remote Procedure Call Architecture

RPC System Architecture

Diagram: RPC Backend and ggml_backend_sched Integration

Key Properties:

Property	Detail
Interface	`ggml_backend` — same interface as local backends
Transport	Binary protocol over TCP
Server binary	`tools/rpc/rpc-server`
Registration	`ggml_backend_rpc_add_device(host, port)`
Multiple servers	Supported — each server appears as a separate backend device
Mixed local+remote	Supported via `ggml_backend_sched`

Usage:

Sources: README.md289 tools/rpc

Performance Considerations

Memory Bandwidth

Split Buffer Overhead:

Operation	Overhead	Mitigation
Tensor Initialization	Minimal	One-time cost at load
Set/Get Tensor	2-3× slower	P2P access when available
Compute	~5%	Dominated by kernel time

Optimal Split Ratios

Factors Affecting Split Performance:

VRAM Balance: Uneven splits can cause memory pressure on smaller devices
Compute Balance: Faster GPUs should receive proportionally more work
Communication Cost: Minimize cross-device data transfers

Recommendation:

For homogeneous systems: Use default proportional split
For heterogeneous systems: Weight by both VRAM and compute capability

Sources: ggml/src/ggml-cuda/ggml-cuda.cu195-313

Quantization and Multi-GPU

Different quantization types have varying memory access patterns:

Quantization Support Matrix:

Type	MMQ Support	Recommended for Multi-GPU
Q4_0, Q4_1	✓	Yes - Low memory overhead
Q5_0, Q5_1	✓	Yes - Good balance
Q8_0	✓	Yes - Fast dequantization
Q2_K, Q3_K	✓	Yes - Maximum memory savings
Q4_K, Q5_K, Q6_K	✓	Yes - Best quality/size ratio
IQ1_S, IQ2_XXS, etc.	✓	Yes - Extreme compression
MXFP4	✓ (Blackwell+)	Yes - Native FP4 support

Sources: ggml/src/ggml-cuda/mmq.cu262-366

Debugging and Monitoring

Split Buffer Status

Debug Information:

The split buffer context tracks:

Number of active devices
Row split boundaries per device
Individual buffer pointers and sizes

Common Issues:

P2P Failures: Check GGML_CUDA_NO_PEER_COPY and system topology
Memory Fragmentation: Monitor per-device pool usage
Unbalanced Load: Verify split ratios match device capabilities

Performance Profiling

Key Metrics:

Device utilization per GPU
Memory transfer rates (host↔device, device↔device)
Kernel execution time per device
Stream synchronization overhead

Sources: ggml/src/ggml-cuda/ggml-cuda.cu547-564

Code Examples

Accessing Split Tensors

When working with split tensors programmatically:

Checking if a tensor is split:

Getting device for a row: The split buffer automatically routes operations to the correct device based on row ranges.

Sources: ggml/src/ggml-cuda/ggml-cuda.cu1285-1308

Custom Split Configuration

Applications can override default splits by creating custom buffer types:

Sources: ggml/src/ggml-cuda/ggml-cuda.cu1363-1387

Limitations and Future Work

Current Limitations

Layer Parallelism Only: Split is at tensor/layer granularity, not within a single layer
Fixed Split Ratios: Split ratios are static after buffer creation
Network Latency: RPC backend adds ~1-10ms latency per operation depending on network
P2P Requirements: Best performance requires NVLink or PCIe P2P support

Optimization Opportunities

Dynamic Load Balancing: Adjust split ratios based on runtime performance
Pipelining: Overlap computation on different devices
Hierarchical Splitting: Support for NUMA-aware multi-node clusters
Kernel Fusion: Reduce split overhead by fusing operations

Sources: ggml/src/ggml-cuda/ggml-cuda.cu1-98

Multi-GPU and Distributed Inference

Purpose and Scope

Architecture Overview

Multi-GPU System Architecture

Tensor Split Configuration

Split Ratio Array

Split Calculation

Split Buffer Implementation

Buffer Type Hierarchy

Row-Based Splitting Strategy

Multi-GPU Matrix Operations

Operation Distribution

Quantized Matrix Multiplication (MMQ)

Float Matrix Multiplication (MMF)

Peer-to-Peer Communication

GPU-to-GPU Memory Access

Split Buffer Memory Operations

Tensor Initialization

Data Transfer Operations

CLI Flags for Multi-GPU Inference

Multi-GPU Configuration

Device Selection and Setup

Memory Pool Management

RPC Backend for Distributed Inference

Remote Procedure Call Architecture

Performance Considerations

Memory Bandwidth

Optimal Split Ratios

Quantization and Multi-GPU

Debugging and Monitoring

Split Buffer Status

Performance Profiling

Code Examples

Accessing Split Tensors

Custom Split Configuration

Limitations and Future Work

Current Limitations

Optimization Opportunities

On this page

Multi-GPU and Distributed Inference

Purpose and Scope

Architecture Overview

Multi-GPU System Architecture

Tensor Split Configuration

Split Ratio Array

Split Calculation

Split Buffer Implementation

Buffer Type Hierarchy

Row-Based Splitting Strategy

Multi-GPU Matrix Operations

Operation Distribution

Quantized Matrix Multiplication (MMQ)

Float Matrix Multiplication (MMF)

Peer-to-Peer Communication

GPU-to-GPU Memory Access

Split Buffer Memory Operations

Tensor Initialization

Data Transfer Operations

CLI Flags for Multi-GPU Inference

Multi-GPU Configuration

Device Selection and Setup

Memory Pool Management

RPC Backend for Distributed Inference

Remote Procedure Call Architecture

Performance Considerations

Memory Bandwidth

Optimal Split Ratios

Quantization and Multi-GPU

Debugging and Monitoring

Split Buffer Status

Performance Profiling

Code Examples

Accessing Split Tensors

Custom Split Configuration

Limitations and Future Work

Current Limitations

Optimization Opportunities

On this page