The Vulkan backend provides cross-platform GPU acceleration for GGML tensor operations using the Vulkan API. It enables compute shader-based inference on GPUs from multiple vendors (NVIDIA, AMD, Intel) across Windows, Linux, macOS (via MoltenVK), Android, and other platforms that support Vulkan.
For vendor-specific backends optimized for particular GPU families, see CUDA Backend (NVIDIA), Metal Backend (Apple), and HIP and Other GPU Backends. For the backend abstraction system that coordinates multiple backends, see Backend System and Registration.
The Vulkan backend is structured as a compute-focused implementation that compiles GLSL shaders to SPIR-V and executes them via Vulkan compute pipelines. The backend integrates with GGML's backend abstraction system and provides implementations for matrix multiplication, quantization, element-wise operations, and specialized operations like flash attention.
Diagram: Vulkan Backend Component Architecture
The backend is organized into several key layers: the backend interface implementation, device management with queue and pipeline resources, a shader compilation system, and memory management for GPU buffers.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp116-872 ggml/src/ggml-vulkan/ggml-vulkan.cpp568-872
vk_device_struct represents a single Vulkan physical+logical device. It is reference-counted via std::shared_ptr<vk_device_struct> (vk_device). Key fields:
| Field | Type | Purpose |
|---|---|---|
physical_device | vk::PhysicalDevice | Physical device handle |
device | vk::Device | Logical device |
compute_queue | vk_queue | Compute + graphics queue |
transfer_queue | vk_queue | DMA transfer queue (may equal compute) |
single_queue | bool | True when compute and transfer share a queue |
dsl | vk::DescriptorSetLayout | Shared descriptor set layout |
all_pipelines | vector<vk_pipeline_ref> | Weak refs to all allocated pipelines |
mutex | recursive_mutex | Guards pipeline compilation and submission |
fence | vk::Fence | Per-device synchronization fence |
sync_staging | vk_buffer | Staging buffer for synchronous transfers |
architecture | vk_device_architecture | Detected GPU microarchitecture |
pinned_memory | vector<tuple<void*, size_t, vk_buffer>> | Imported host memory pages |
vk_queue holds a vk::Queue handle and an associated vk_command_pool. vk_command_pool wraps a vk::CommandPool plus a vector<vk::CommandBuffer> pool (indexed by cmd_buffer_idx), allocated with VK_COMMAND_POOL_CREATE_TRANSIENT_BIT.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp192-226 ggml/src/ggml-vulkan/ggml-vulkan.cpp570-877
The get_device_architecture() function maps a vk::PhysicalDevice to a vk_device_architecture enum value by examining vendor ID, available extensions, and device properties:
Architecture Detection Logic
vk_device_architecture | Vendor | Detection Criteria |
|---|---|---|
AMD_GCN | AMD | maxSubgroupSize == minSubgroupSize == 64 |
AMD_RDNA1 | AMD | wavefrontsPerSimd == 20, min=32/max=64 |
AMD_RDNA2 | AMD | RDNA wavefront, no int dot 4x8 mixed acceleration |
AMD_RDNA3 | AMD | integerDotProduct4x8BitPackedMixedSignednessAccelerated |
INTEL_XE2 | Intel | minSubgroupSize == 16 (SIMD16 vs legacy SIMD8) |
NVIDIA_PRE_TURING | NVIDIA | No VK_KHR_cooperative_matrix extension |
NVIDIA_TURING | NVIDIA | Cooperative matrix + shaderWarpsPerSM == 32 |
OTHER | Any | Default / extensions absent |
Extensions queried during AMD detection: VK_AMD_shader_core_properties, VK_KHR_shader_integer_dot_product, VK_EXT_subgroup_size_control. Extensions queried during NVIDIA detection: VK_KHR_cooperative_matrix, VK_NV_shader_sm_builtins.
Diagram: get_device_architecture() Decision Tree
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp250-371
Key boolean fields populated on vk_device_struct during device initialization:
| Field | Purpose |
|---|---|
coopmat_support, coopmat2 | Cooperative matrix (tensor core) paths |
coopmat_acc_f32_support, coopmat_acc_f16_support | Accumulator precision available |
coopmat_bf16_support, coopmat_int_support | BF16 and integer cooperative matrix |
coopmat1_fa_support | CoopMat1 flash attention path available |
bf16 | BF16 scalar shader type support |
integer_dot_product | Integer dot product instructions for quantized matmul |
shader_64b_indexing | 64-bit buffer indexing (large tensors) |
subgroup_arithmetic, subgroup_shuffle, subgroup_ballot, subgroup_clustered | Subgroup operation availability |
flash_attention_fp16 | FP16 flash attention path enabled |
uma | Unified memory architecture (zero-copy host buffers) |
external_memory_host | VK_EXT_external_memory_host for pinned memory |
pipeline_robustness | Robust pipeline access |
add_rms_fusion | Fused add+RMS norm enabled |
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp580-648
vk_pipeline_struct wraps a single compute pipeline. Key fields:
| Field | Type | Purpose |
|---|---|---|
shader_module | vk::ShaderModule | Compiled SPIR-V module |
layout | vk::PipelineLayout | Push constant + descriptor set layout |
pipeline | vk::Pipeline | Vulkan compute pipeline handle |
push_constant_size | uint32_t | Size of push constants block |
parameter_count | uint32_t | Number of descriptor bindings |
wg_denoms | array<uint32_t, 3> | Workgroup size denominators for dispatch |
needed | atomic<bool> | Set true when pipeline is required |
compiled | atomic<bool> | Set true once compilation completes |
register_count | uint32_t | From pipeline executable properties |
next | vk_pipeline | Linked list for 64-bit indexing variant |
vk_matmul_pipeline_struct groups six related vk_pipeline handles (l, m, s + aligned variants). vk_matmul_pipeline2 pairs f32acc and f16acc vk_matmul_pipeline instances.
Diagram: vk_pipeline_struct Lazy Compilation Flow
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp122-165
Matrix multiplication uses specialized pipeline variants based on matrix dimensions and data types. Each vk_matmul_pipeline_struct holds l, m, s (large/medium/small) and a_l, a_m, a_s (aligned) vk_pipeline handles. vk_matmul_pipeline2 bundles f32acc and f16acc accumulator variants.
Field on vk_device_struct | Type | Purpose |
|---|---|---|
pipeline_matmul_f32 | vk_matmul_pipeline | FP32 × FP32 → FP32 |
pipeline_matmul_f32_f16 | vk_matmul_pipeline | FP32 × FP16 → FP32 |
pipeline_matmul_bf16 | vk_matmul_pipeline | BF16 × BF16 → FP32 |
pipeline_matmul_f16 | vk_matmul_pipeline2 | FP16 × FP16, f32acc + f16acc |
pipeline_dequant_mul_mat_mat[GGML_TYPE_COUNT] | vk_matmul_pipeline2 | Fused dequant+matmul, all quant types |
pipeline_dequant_mul_mat_mat_q8_1[GGML_TYPE_COUNT] | vk_matmul_pipeline2 | Integer dot product (Q8_1 B matrix) |
pipeline_matmul_id_f32 | vk_matmul_pipeline | Expert-parallel FP32 (MoE) |
pipeline_dequant_mul_mat_mat_id[GGML_TYPE_COUNT] | vk_matmul_pipeline2 | Expert-parallel dequant+matmul |
pipeline_matmul_split_k_reduce | vk_pipeline | Reduce split-K partial sums |
pipeline_quantize_q8_1_x4 | vk_pipeline | Quantize activations to Q8_1 |
The l/m/s suffixes select tile size configurations (large/medium/small). The a_* variants require aligned memory layouts and enable vectorized loads.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp656-680
For GPUs supporting cooperative matrices, the backend compiles additional pipeline variants:
coopmat=true, coopmat2=false): Uses GL_KHR_cooperative_matrix. Compiled when GGML_VULKAN_COOPMAT_GLSLC_SUPPORT is defined. Shader source: mul_mm.comp with #define COOPMAT. Device capability tracked by coopmat_support, coopmat_m/n/k, coopmat_acc_f32_support, coopmat_acc_f16_support.coopmat2=true): Uses GL_NV_cooperative_matrix2 with tensor layout intrinsics (coopMatLoadTensorNV, coopMatStoreTensorNV). Compiled when GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT is defined. Shader source: mul_mm_cm2.comp. Device capability tracked by coopmat2.f32acc (default) and f16acc variants. f16acc is only used when the device reports coopmat_acc_f16_support.The vulkan-shaders-gen.cpp matmul_shaders() function controls which combinations are compiled via the coopmat and coopmat2 boolean parameters.
Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp427-621 ggml/src/ggml-vulkan/ggml-vulkan.cpp627-645 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp16-25 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp10-14
Shader compilation is driven at build time by the vulkan-shaders-gen executable (from vulkan-shaders-gen.cpp). The string_to_spv() function submits individual compilations; string_to_spv_func() invokes glslc as a child process.
Diagram: Shader Build Pipeline (vulkan-shaders-gen.cpp → ggml-vulkan-shaders.hpp)
Key behaviors:
_cm2 in name) use --target-env=vulkan1.3; all others use vulkan1.2-O (spirv-opt) is disabled for coopmat, bf16, and rope shaders to avoid known compiler bugs.d) are generated for incremental CMake rebuildsmin(16, hardware_concurrency) concurrent glslc processes via the compile slot semaphoreSources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp296-425
The matmul_shaders() function in vulkan-shaders-gen.cpp generates matrix multiplication variants. Key combinatorial dimensions:
| Dimension | Values |
|---|---|
| Input A Type | f32, f16, bf16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_k, q3_k, q4_k, q5_k, q6_k, iq1_s, iq1_m, iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq3_s, iq4_xs, iq4_nl, mxfp4 |
| Input B Type | f32, f16; optionally q8_1 for integer dot product paths |
| Accumulation | f32acc, f16acc |
| Implementation | Scalar FP32, Scalar FP16, CoopMat1 (cm1), CoopMat2 (cm2) |
| Alignment | Aligned (ALIGNED=1), Unaligned |
Expert Parallel (MUL_MAT_ID) | MatMulIdType::NONE, DEFAULT, SUBGROUP |
The type_names vector in vulkan-shaders-gen.cpp lists all 23 supported types. For each type, both _f32 and _f16 B-matrix variants are generated. Integer dot product (mul_mmq.comp, enabled by GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT) only applies to legacy quants, k-quants, and mxfp4.
Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp45-69 ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp427-620
Specialization constants allow a single compiled SPIR-V module to serve multiple configurations. For mul_mm.comp:
| Constant ID | Name | Purpose |
|---|---|---|
| 0 | BLOCK_SIZE | Total workgroup thread count |
| 1 | BM | Output tile M size per workgroup |
| 2 | BN | Output tile N size per workgroup |
| 3 | BK | K-dimension chunk (only for quant types) |
| 4 | WM | Warp tile M (WMMA sub-tile) |
| 5 | WN | Warp tile N |
| 6 | WMITER | Warp M iterations |
| 7 | TM | Thread tile M |
| 8 | TN | Thread tile N |
| 9 | TK | K tile for cooperative matrix |
| 10 | WARP | Warp (subgroup) size |
For mul_mm_cm2.comp, constant IDs 0–3 cover BLOCK_SIZE, BM, BN, BK, plus constant ID 4 (enable_smaller_matrices) which activates BN/2 and BN/4 tail-handling paths for the last N tiles.
Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp103-112 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp28-35
Diagram: Compute Operation Dispatch Sequence
Each tensor operation selects an appropriate pipeline, binds buffers via descriptor sets, configures push constants with operation parameters, and dispatches workgroups.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp1-100 (function definitions would be in the truncated portion)
The mul_mm.comp kernel implements a tiled blocked matrix multiplication:
BM × BN output tile, indexed by ir and ic from gl_WorkGroupID.BM × BK (A) and BN × BK (B) tiles into buf_a and buf_b shared memory arrays.TM × TN result tile in sums[] registers.BK, advancing pos_a and pos_b pointers.sums[] to the output buffer, bounds-checking against p.M and p.N.For COOPMAT variants (mul_mm.comp with #define COOPMAT), the kernel uses coopMatLoad, coopMatMulAdd, and coopMatStore with subgroup-scoped cooperative matrices (gl_ScopeSubgroup, tiles TM × TK and TK × TN).
For mul_mm_cm2.comp, workgroup-scoped tensors (gl_ScopeWorkgroup, tiles BM × BK and BK × BN) are loaded via coopMatLoadTensorNV with optional dequantization decode functions (DECODEFUNCA). The fast path detects aligned strides and avoids boundary clamping.
Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp140-458 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp199-560
The backend supports all GGML quantization formats:
Dequantization Pipeline:
Quantization Pipeline:
Diagram: Quantization Data Flow
Dequantization is often fused with matrix multiplication to avoid materializing the full dequantized matrix. Standalone dequantization is used for operations that don't support fused dequantization.
Sources: ggml/src/ggml-vulkan/vulkan-shaders/copy_to_quant.comp1-297 ggml/src/ggml-vulkan/vulkan-shaders/copy_from_quant.comp1-52
Matrix-vector operations use optimized kernels in mul_mat_vec.comp:
| Mode | Description |
|---|---|
| Default | Uses shared memory reduction |
| Subgroup | Uses subgroup shuffle operations |
| Subgroup No Shmem | Uses only subgroup operations without shared memory |
The kernel processes multiple rows in parallel and uses different reduction strategies based on hardware subgroup capabilities.
Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec.comp1-170
Flash attention pipelines are keyed by vk_fa_pipeline_state in the pipeline_flash_attn_f32_f16[GGML_TYPE_COUNT] map on vk_device_struct. There are three code paths, selected by the FaCodePath enum:
FaCodePath Enum Value | Source File | Extension Required |
|---|---|---|
FA_SCALAR | flash_attn.comp | None (subgroup ops optional) |
FA_COOPMAT1 | flash_attn_cm1.comp | GL_KHR_cooperative_matrix |
FA_COOPMAT2 | flash_attn_cm2.comp | GL_NV_cooperative_matrix2 |
All three paths share push constants and specialization constants defined in flash_attn_base.glsl. vk_fa_pipeline_state captures every dimension that requires a distinct pipeline:
vk_fa_pipeline_state Field | flash_attn_base.glsl Constant ID | Description |
|---|---|---|
HSK, HSV | 3, 4 | Head size for keys / values |
Br, Bc | 1, 2 | Query tile rows / KV tile cols |
D_split, row_split | 6, 7 | Thread tiling dimensions |
shmem_staging | 9 (SHMEM_STAGING) | Stage K/V through shared memory |
path | — | FA_SCALAR/FA_COOPMAT1/FA_COOPMAT2 |
workgroup_size, subgroup_size | 0, 8 | Dispatch dimensions |
aligned | 5 (Clamp) | Boundary clamping needed |
f32acc | — | FP32 vs FP16 accumulator |
flags | 10 | USE_MASK_OPT, MASK_ENABLE, LOGIT_SOFTCAP |
limit_occupancy_shmem | 11 | Shared memory occupancy limiter |
Supported key/value tensor types per path: FA_SCALAR and FA_COOPMAT1 support f16, q4_0, q8_0, and f32; FA_COOPMAT2 additionally supports all types enumerated in type_names (see vulkan-shaders-gen.cpp).
A separate pipeline_flash_attn_split_k_reduce handles the split-K reduction pass (flash_attn_split_k_reduce.comp), summing partial O, L, and M values from multiple split-K tiles.
Mask Optimization Pre-pass
pipeline_fa_mask_opt (keyed by {Br, Bc}) runs flash_attn_mask_opt.comp as a pre-pass. For each Br × Bc mask tile it writes a 2-bit code (MASK_OPT_ALL_NEG_INF = 1, MASK_OPT_ALL_ZERO = 2) into an output buffer. The main flash attention shader then uses these codes to skip tiles entirely.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp399-421 ggml/src/ggml-vulkan/ggml-vulkan.cpp827-831 ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp625-676 ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl1-66 ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_mask_opt.comp1-55
Buffer allocation abstracts Vulkan memory management:
Diagram: Buffer Allocation Flow
The vk_buffer_struct wraps Vulkan buffer and memory objects, tracking size, memory properties, and mapped pointers.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp888-907
The vk_subbuffer structure provides lightweight views into larger buffers:
Subbuffers enable efficient descriptor set binding without creating separate buffer objects.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp909-918
The backend manages different memory types based on usage:
| Memory Type | Flags | Usage |
|---|---|---|
| Device-Local | DEVICE_LOCAL_BIT | GPU-resident tensors |
| Host-Visible | `HOST_VISIBLE_BIT | HOST_COHERENT_BIT` |
| Host-Cached | `HOST_VISIBLE_BIT | HOST_CACHED_BIT` |
UMA (Unified Memory Architecture) systems use host-visible device-local memory for zero-copy access.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp568-872 (memory property selection logic)
The device maintains a pinned_memory vector tracking imported host memory:
This enables external memory import on systems supporting VK_EXT_external_memory_host.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp834
Each vk_queue has an associated vk_command_pool:
Diagram: Command Pool Structure
Command pools are marked with VK_COMMAND_POOL_CREATE_TRANSIENT_BIT for short-lived command buffers that are reset frequently.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp192-201 ggml/src/ggml-vulkan/ggml-vulkan.cpp874-886
The backend uses a global queue_mutex to prevent simultaneous submissions to the same queue:
This prevents race conditions when multiple backend contexts share the same device.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp206
Expert-parallel matrix multiplication (MUL_MAT_ID) dispatches work based on expert assignment:
Diagram: Expert-Parallel Matrix Multiplication Flow
Workgroups process tokens assigned to a specific expert. The load_row_ids function uses subgroup ballots to efficiently filter and load token-to-expert assignments into shared memory.
Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp69-72 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp130-192
The backend recognizes and fuses four graph patterns in MoE routing. Each pattern is defined as a list of ggml_op sequences plus edge constraints (src-to-node relationships) that must hold for the match to apply.
topk_moe_mode Value | Fused GGML Op Sequence | pipeline_topk_moe Index |
|---|---|---|
TOPK_MOE_EARLY_SOFTMAX_NORM | SOFT_MAX → RESHAPE → ARGSORT → VIEW → GET_ROWS → RESHAPE → SUM_ROWS → CLAMP → DIV → RESHAPE | 0 |
TOPK_MOE_SIGMOID_NORM_BIAS | UNARY(sigmoid) → RESHAPE → ADD → ARGSORT → VIEW → GET_ROWS → RESHAPE → SUM_ROWS → CLAMP → DIV → RESHAPE | 1 |
TOPK_MOE_EARLY_SOFTMAX | SOFT_MAX → RESHAPE → ARGSORT → VIEW → GET_ROWS | 2 |
TOPK_MOE_LATE_SOFTMAX | ARGSORT → VIEW → GET_ROWS → RESHAPE → SOFT_MAX → RESHAPE | 3 |
Edge constraints are encoded in topk_moe_*_edges constexpr arrays as {dst_node, src_slot, src_node} triples. The pipeline_topk_moe[num_topk_moe_pipelines][2] array provides two sub-variants: index [0] uses a specialization constant for n_experts, index [1] uses a push constant.
These fusions reduce kernel launch overhead and intermediate memory traffic in MoE models like Mixtral and DeepSeek.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp459-555
Vulkan backend is enabled with CMake options:
The build system locates the Vulkan SDK via find_package(Vulkan REQUIRED) and links against Vulkan libraries.
The shader generator runs at build time:
vulkan-shaders-gen executableglslcggml-vulkan-shaders.hpp as byte arraysThe generator creates dependency files (.d) tracking shader dependencies:
This ensures incremental rebuilds when shaders change.
Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp340-348
The Vulkan backend implements ggml_backend_buffer_type_i:
| Function | Implementation |
|---|---|
get_name | Returns device name |
alloc_buffer | Allocates device memory |
get_alignment | Returns buffer alignment (16B) |
get_max_size | Returns max_buffer_size |
get_alloc_size | Computes allocation size for tensor |
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp232-239
The backend implements ggml_backend_i callbacks:
| Callback | Purpose |
|---|---|
get_name | Returns "Vulkan" + device index |
free | Destroys context and resources |
set_tensor_async | Asynchronous tensor upload |
get_tensor_async | Asynchronous tensor download |
cpy_tensor_async | Device-to-device copy |
synchronize | Wait for operations to complete |
graph_compute | Execute computation graph |
supports_op | Check operation support |
offload_op | Determine if op should use backend |
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp232-239 (interface definition)
Compiled pipelines are expensive to create. The backend maintains a registry of all pipelines and lazy-compiles them on first use:
needed flag marks pipelines required by the workloadcompiled flag prevents redundant compilationOptimal workgroup sizes vary by architecture:
The backend uses subgroup_size property to select appropriate workgroup configurations.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp591
Buffers use 16-byte alignment for optimal memory access:
Aligned access patterns enable vectorized loads and coalesced memory transactions.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp96
Large matrix multiplications use split-K parallelization, dividing the K dimension across multiple workgroups and reducing results in a separate kernel:
Diagram: Split-K Matrix Multiplication
The pipeline_matmul_split_k_reduce kernel sums partial results from split-K computation.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp674
The backend enforces limits based on device capabilities:
max_buffer_size: Maximum single buffer sizemax_memory_allocation_size: Maximum memory allocationLarge tensors may require splitting or tiling.
Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp574-575
Vulkan compute operations are asynchronous. The backend provides synchronization primitives:
ggml_vk_synchronize(): Wait for all operationsSources: ggml/src/ggml-vulkan/ggml-vulkan.cpp919-932
Descriptor pools have fixed sizes. Complex graphs may exhaust descriptor sets, requiring pool recreation or larger pool sizes.
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.