Vulkan Backend (Cross-Platform)

Relevant source files

The Vulkan backend provides cross-platform GPU acceleration for GGML tensor operations using the Vulkan API. It enables compute shader-based inference on GPUs from multiple vendors (NVIDIA, AMD, Intel) across Windows, Linux, macOS (via MoltenVK), Android, and other platforms that support Vulkan.

For vendor-specific backends optimized for particular GPU families, see CUDA Backend (NVIDIA), Metal Backend (Apple), and HIP and Other GPU Backends. For the backend abstraction system that coordinates multiple backends, see Backend System and Registration.

Architecture Overview

The Vulkan backend is structured as a compute-focused implementation that compiles GLSL shaders to SPIR-V and executes them via Vulkan compute pipelines. The backend integrates with GGML's backend abstraction system and provides implementations for matrix multiplication, quantization, element-wise operations, and specialized operations like flash attention.

Diagram: Vulkan Backend Component Architecture

The backend is organized into several key layers: the backend interface implementation, device management with queue and pipeline resources, a shader compilation system, and memory management for GPU buffers.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp116-872 ggml/src/ggml-vulkan/ggml-vulkan.cpp568-872

Device Management

Device Structure and Initialization

vk_device_struct represents a single Vulkan physical+logical device. It is reference-counted via std::shared_ptr<vk_device_struct> (vk_device). Key fields:

Field	Type	Purpose
`physical_device`	`vk::PhysicalDevice`	Physical device handle
`device`	`vk::Device`	Logical device
`compute_queue`	`vk_queue`	Compute + graphics queue
`transfer_queue`	`vk_queue`	DMA transfer queue (may equal compute)
`single_queue`	`bool`	True when compute and transfer share a queue
`dsl`	`vk::DescriptorSetLayout`	Shared descriptor set layout
`all_pipelines`	`vector<vk_pipeline_ref>`	Weak refs to all allocated pipelines
`mutex`	`recursive_mutex`	Guards pipeline compilation and submission
`fence`	`vk::Fence`	Per-device synchronization fence
`sync_staging`	`vk_buffer`	Staging buffer for synchronous transfers
`architecture`	`vk_device_architecture`	Detected GPU microarchitecture
`pinned_memory`	`vector<tuple<void*, size_t, vk_buffer>>`	Imported host memory pages

vk_queue holds a vk::Queue handle and an associated vk_command_pool. vk_command_pool wraps a vk::CommandPool plus a vector<vk::CommandBuffer> pool (indexed by cmd_buffer_idx), allocated with VK_COMMAND_POOL_CREATE_TRANSIENT_BIT.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp192-226 ggml/src/ggml-vulkan/ggml-vulkan.cpp570-877

Architecture Detection

The get_device_architecture() function maps a vk::PhysicalDevice to a vk_device_architecture enum value by examining vendor ID, available extensions, and device properties:

Architecture Detection Logic

`vk_device_architecture`	Vendor	Detection Criteria
`AMD_GCN`	AMD	`maxSubgroupSize == minSubgroupSize == 64`
`AMD_RDNA1`	AMD	`wavefrontsPerSimd == 20`, min=32/max=64
`AMD_RDNA2`	AMD	RDNA wavefront, no int dot 4x8 mixed acceleration
`AMD_RDNA3`	AMD	`integerDotProduct4x8BitPackedMixedSignednessAccelerated`
`INTEL_XE2`	Intel	`minSubgroupSize == 16` (SIMD16 vs legacy SIMD8)
`NVIDIA_PRE_TURING`	NVIDIA	No `VK_KHR_cooperative_matrix` extension
`NVIDIA_TURING`	NVIDIA	Cooperative matrix + `shaderWarpsPerSM == 32`
`OTHER`	Any	Default / extensions absent

Extensions queried during AMD detection: VK_AMD_shader_core_properties, VK_KHR_shader_integer_dot_product, VK_EXT_subgroup_size_control. Extensions queried during NVIDIA detection: VK_KHR_cooperative_matrix, VK_NV_shader_sm_builtins.

Diagram: get_device_architecture() Decision Tree

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp250-371

Feature Detection

Key boolean fields populated on vk_device_struct during device initialization:

Field	Purpose
`coopmat_support`, `coopmat2`	Cooperative matrix (tensor core) paths
`coopmat_acc_f32_support`, `coopmat_acc_f16_support`	Accumulator precision available
`coopmat_bf16_support`, `coopmat_int_support`	BF16 and integer cooperative matrix
`coopmat1_fa_support`	CoopMat1 flash attention path available
`bf16`	BF16 scalar shader type support
`integer_dot_product`	Integer dot product instructions for quantized matmul
`shader_64b_indexing`	64-bit buffer indexing (large tensors)
`subgroup_arithmetic`, `subgroup_shuffle`, `subgroup_ballot`, `subgroup_clustered`	Subgroup operation availability
`flash_attention_fp16`	FP16 flash attention path enabled
`uma`	Unified memory architecture (zero-copy host buffers)
`external_memory_host`	`VK_EXT_external_memory_host` for pinned memory
`pipeline_robustness`	Robust pipeline access
`add_rms_fusion`	Fused add+RMS norm enabled

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp580-648

Pipeline System

Pipeline Structure

vk_pipeline_struct wraps a single compute pipeline. Key fields:

Field	Type	Purpose
`shader_module`	`vk::ShaderModule`	Compiled SPIR-V module
`layout`	`vk::PipelineLayout`	Push constant + descriptor set layout
`pipeline`	`vk::Pipeline`	Vulkan compute pipeline handle
`push_constant_size`	`uint32_t`	Size of push constants block
`parameter_count`	`uint32_t`	Number of descriptor bindings
`wg_denoms`	`array<uint32_t, 3>`	Workgroup size denominators for dispatch
`needed`	`atomic<bool>`	Set true when pipeline is required
`compiled`	`atomic<bool>`	Set true once compilation completes
`register_count`	`uint32_t`	From pipeline executable properties
`next`	`vk_pipeline`	Linked list for 64-bit indexing variant

vk_matmul_pipeline_struct groups six related vk_pipeline handles (l, m, s + aligned variants). vk_matmul_pipeline2 pairs f32acc and f16acc vk_matmul_pipeline instances.

Diagram: vk_pipeline_struct Lazy Compilation Flow

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp122-165

Matrix Multiplication Pipelines

Matrix multiplication uses specialized pipeline variants based on matrix dimensions and data types. Each vk_matmul_pipeline_struct holds l, m, s (large/medium/small) and a_l, a_m, a_s (aligned) vk_pipeline handles. vk_matmul_pipeline2 bundles f32acc and f16acc accumulator variants.

Field on `vk_device_struct`	Type	Purpose
`pipeline_matmul_f32`	`vk_matmul_pipeline`	FP32 × FP32 → FP32
`pipeline_matmul_f32_f16`	`vk_matmul_pipeline`	FP32 × FP16 → FP32
`pipeline_matmul_bf16`	`vk_matmul_pipeline`	BF16 × BF16 → FP32
`pipeline_matmul_f16`	`vk_matmul_pipeline2`	FP16 × FP16, f32acc + f16acc
`pipeline_dequant_mul_mat_mat[GGML_TYPE_COUNT]`	`vk_matmul_pipeline2`	Fused dequant+matmul, all quant types
`pipeline_dequant_mul_mat_mat_q8_1[GGML_TYPE_COUNT]`	`vk_matmul_pipeline2`	Integer dot product (Q8_1 B matrix)
`pipeline_matmul_id_f32`	`vk_matmul_pipeline`	Expert-parallel FP32 (MoE)
`pipeline_dequant_mul_mat_mat_id[GGML_TYPE_COUNT]`	`vk_matmul_pipeline2`	Expert-parallel dequant+matmul
`pipeline_matmul_split_k_reduce`	`vk_pipeline`	Reduce split-K partial sums
`pipeline_quantize_q8_1_x4`	`vk_pipeline`	Quantize activations to Q8_1

The l/m/s suffixes select tile size configurations (large/medium/small). The a_* variants require aligned memory layouts and enable vectorized loads.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp656-680

Cooperative Matrix Support

For GPUs supporting cooperative matrices, the backend compiles additional pipeline variants:

CoopMat1 (coopmat=true, coopmat2=false): Uses GL_KHR_cooperative_matrix. Compiled when GGML_VULKAN_COOPMAT_GLSLC_SUPPORT is defined. Shader source: mul_mm.comp with #define COOPMAT. Device capability tracked by coopmat_support, coopmat_m/n/k, coopmat_acc_f32_support, coopmat_acc_f16_support.
CoopMat2 (coopmat2=true): Uses GL_NV_cooperative_matrix2 with tensor layout intrinsics (coopMatLoadTensorNV, coopMatStoreTensorNV). Compiled when GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT is defined. Shader source: mul_mm_cm2.comp. Device capability tracked by coopmat2.
Accumulator Types: Both variants generate f32acc (default) and f16acc variants. f16acc is only used when the device reports coopmat_acc_f16_support.

The vulkan-shaders-gen.cpp matmul_shaders() function controls which combinations are compiled via the coopmat and coopmat2 boolean parameters.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp427-621 ggml/src/ggml-vulkan/ggml-vulkan.cpp627-645 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp16-25 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp10-14

Shader Compilation System

Build-Time Shader Generation

Shader compilation is driven at build time by the vulkan-shaders-gen executable (from vulkan-shaders-gen.cpp). The string_to_spv() function submits individual compilations; string_to_spv_func() invokes glslc as a child process.

Diagram: Shader Build Pipeline (vulkan-shaders-gen.cpp → ggml-vulkan-shaders.hpp)

Key behaviors:

CoopMat2 shaders (_cm2 in name) use --target-env=vulkan1.3; all others use vulkan1.2
-O (spirv-opt) is disabled for coopmat, bf16, and rope shaders to avoid known compiler bugs
Dependency files (.d) are generated for incremental CMake rebuilds
Up to min(16, hardware_concurrency) concurrent glslc processes via the compile slot semaphore

Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp296-425

Shader Variants

The matmul_shaders() function in vulkan-shaders-gen.cpp generates matrix multiplication variants. Key combinatorial dimensions:

Dimension	Values
Input A Type	f32, f16, bf16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_k, q3_k, q4_k, q5_k, q6_k, iq1_s, iq1_m, iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq3_s, iq4_xs, iq4_nl, mxfp4
Input B Type	f32, f16; optionally q8_1 for integer dot product paths
Accumulation	f32acc, f16acc
Implementation	Scalar FP32, Scalar FP16, CoopMat1 (cm1), CoopMat2 (cm2)
Alignment	Aligned (`ALIGNED=1`), Unaligned
Expert Parallel (`MUL_MAT_ID`)	`MatMulIdType::NONE`, `DEFAULT`, `SUBGROUP`

The type_names vector in vulkan-shaders-gen.cpp lists all 23 supported types. For each type, both _f32 and _f16 B-matrix variants are generated. Integer dot product (mul_mmq.comp, enabled by GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT) only applies to legacy quants, k-quants, and mxfp4.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp45-69 ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp427-620

Shader Specialization Constants

Specialization constants allow a single compiled SPIR-V module to serve multiple configurations. For mul_mm.comp:

Constant ID	Name	Purpose
0	`BLOCK_SIZE`	Total workgroup thread count
1	`BM`	Output tile M size per workgroup
2	`BN`	Output tile N size per workgroup
3	`BK`	K-dimension chunk (only for quant types)
4	`WM`	Warp tile M (WMMA sub-tile)
5	`WN`	Warp tile N
6	`WMITER`	Warp M iterations
7	`TM`	Thread tile M
8	`TN`	Thread tile N
9	`TK`	K tile for cooperative matrix
10	`WARP`	Warp (subgroup) size

For mul_mm_cm2.comp, constant IDs 0–3 cover BLOCK_SIZE, BM, BN, BK, plus constant ID 4 (enable_smaller_matrices) which activates BN/2 and BN/4 tail-handling paths for the last N tiles.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp103-112 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp28-35

Compute Operations

Operation Dispatch Flow

Diagram: Compute Operation Dispatch Sequence

Each tensor operation selects an appropriate pipeline, binds buffers via descriptor sets, configures push constants with operation parameters, and dispatches workgroups.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp1-100 (function definitions would be in the truncated portion)

Matrix Multiplication Implementation

The mul_mm.comp kernel implements a tiled blocked matrix multiplication:

Workgroup Assignment: Each workgroup computes a BM × BN output tile, indexed by ir and ic from gl_WorkGroupID.
Shared Memory Staging: Threads cooperatively load BM × BK (A) and BN × BK (B) tiles into buf_a and buf_b shared memory arrays.
Register Blocking: Each thread accumulates a TM × TN result tile in sums[] registers.
K-Loop: Iterates over K in steps of BK, advancing pos_a and pos_b pointers.
Output Write: Stores sums[] to the output buffer, bounds-checking against p.M and p.N.

For COOPMAT variants (mul_mm.comp with #define COOPMAT), the kernel uses coopMatLoad, coopMatMulAdd, and coopMatStore with subgroup-scoped cooperative matrices (gl_ScopeSubgroup, tiles TM × TK and TK × TN).

For mul_mm_cm2.comp, workgroup-scoped tensors (gl_ScopeWorkgroup, tiles BM × BK and BK × BN) are loaded via coopMatLoadTensorNV with optional dequantization decode functions (DECODEFUNCA). The fast path detects aligned strides and avoids boundary clamping.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp140-458 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp199-560

Quantization and Dequantization

The backend supports all GGML quantization formats:

Dequantization Pipeline:

Quantization Pipeline:

Diagram: Quantization Data Flow

Dequantization is often fused with matrix multiplication to avoid materializing the full dequantized matrix. Standalone dequantization is used for operations that don't support fused dequantization.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/copy_to_quant.comp1-297 ggml/src/ggml-vulkan/vulkan-shaders/copy_from_quant.comp1-52

Matrix-Vector Multiplication

Matrix-vector operations use optimized kernels in mul_mat_vec.comp:

Mode	Description
Default	Uses shared memory reduction
Subgroup	Uses subgroup shuffle operations
Subgroup No Shmem	Uses only subgroup operations without shared memory

The kernel processes multiple rows in parallel and uses different reduction strategies based on hardware subgroup capabilities.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec.comp1-170

Flash Attention

Flash attention pipelines are keyed by vk_fa_pipeline_state in the pipeline_flash_attn_f32_f16[GGML_TYPE_COUNT] map on vk_device_struct. There are three code paths, selected by the FaCodePath enum:

`FaCodePath` Enum Value	Source File	Extension Required
`FA_SCALAR`	`flash_attn.comp`	None (subgroup ops optional)
`FA_COOPMAT1`	`flash_attn_cm1.comp`	`GL_KHR_cooperative_matrix`
`FA_COOPMAT2`	`flash_attn_cm2.comp`	`GL_NV_cooperative_matrix2`

All three paths share push constants and specialization constants defined in flash_attn_base.glsl. vk_fa_pipeline_state captures every dimension that requires a distinct pipeline:

`vk_fa_pipeline_state` Field	`flash_attn_base.glsl` Constant ID	Description
`HSK`, `HSV`	3, 4	Head size for keys / values
`Br`, `Bc`	1, 2	Query tile rows / KV tile cols
`D_split`, `row_split`	6, 7	Thread tiling dimensions
`shmem_staging`	9 (`SHMEM_STAGING`)	Stage K/V through shared memory
`path`	—	`FA_SCALAR`/`FA_COOPMAT1`/`FA_COOPMAT2`
`workgroup_size`, `subgroup_size`	0, 8	Dispatch dimensions
`aligned`	5 (`Clamp`)	Boundary clamping needed
`f32acc`	—	FP32 vs FP16 accumulator
`flags`	10	`USE_MASK_OPT`, `MASK_ENABLE`, `LOGIT_SOFTCAP`
`limit_occupancy_shmem`	11	Shared memory occupancy limiter

Supported key/value tensor types per path: FA_SCALAR and FA_COOPMAT1 support f16, q4_0, q8_0, and f32; FA_COOPMAT2 additionally supports all types enumerated in type_names (see vulkan-shaders-gen.cpp).

A separate pipeline_flash_attn_split_k_reduce handles the split-K reduction pass (flash_attn_split_k_reduce.comp), summing partial O, L, and M values from multiple split-K tiles.

Mask Optimization Pre-pass

pipeline_fa_mask_opt (keyed by {Br, Bc}) runs flash_attn_mask_opt.comp as a pre-pass. For each Br × Bc mask tile it writes a 2-bit code (MASK_OPT_ALL_NEG_INF = 1, MASK_OPT_ALL_ZERO = 2) into an output buffer. The main flash attention shader then uses these codes to skip tiles entirely.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp399-421 ggml/src/ggml-vulkan/ggml-vulkan.cpp827-831 ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp625-676 ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl1-66 ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_mask_opt.comp1-55

Memory Management

Buffer Allocation

Buffer allocation abstracts Vulkan memory management:

Diagram: Buffer Allocation Flow

The vk_buffer_struct wraps Vulkan buffer and memory objects, tracking size, memory properties, and mapped pointers.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp888-907

Subbuffer Views

The vk_subbuffer structure provides lightweight views into larger buffers:

Subbuffers enable efficient descriptor set binding without creating separate buffer objects.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp909-918

Memory Types

The backend manages different memory types based on usage:

Memory Type	Flags	Usage
Device-Local	`DEVICE_LOCAL_BIT`	GPU-resident tensors
Host-Visible	`HOST_VISIBLE_BIT	HOST_COHERENT_BIT`
Host-Cached	`HOST_VISIBLE_BIT	HOST_CACHED_BIT`

UMA (Unified Memory Architecture) systems use host-visible device-local memory for zero-copy access.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp568-872 (memory property selection logic)

Pinned Memory

The device maintains a pinned_memory vector tracking imported host memory:

This enables external memory import on systems supporting VK_EXT_external_memory_host.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp834

Command Buffer Management

Command Pool Design

Each vk_queue has an associated vk_command_pool:

Diagram: Command Pool Structure

Command pools are marked with VK_COMMAND_POOL_CREATE_TRANSIENT_BIT for short-lived command buffers that are reset frequently.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp192-201 ggml/src/ggml-vulkan/ggml-vulkan.cpp874-886

Queue Synchronization

The backend uses a global queue_mutex to prevent simultaneous submissions to the same queue:

This prevents race conditions when multiple backend contexts share the same device.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp206

Expert Parallel Operations (MoE)

MoE Matrix Multiplication

Expert-parallel matrix multiplication (MUL_MAT_ID) dispatches work based on expert assignment:

Diagram: Expert-Parallel Matrix Multiplication Flow

Workgroups process tokens assigned to a specific expert. The load_row_ids function uses subgroup ballots to efficiently filter and load token-to-expert assignments into shared memory.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp69-72 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp130-192

TopK MoE Fusion

The backend recognizes and fuses four graph patterns in MoE routing. Each pattern is defined as a list of ggml_op sequences plus edge constraints (src-to-node relationships) that must hold for the match to apply.

`topk_moe_mode` Value	Fused GGML Op Sequence	`pipeline_topk_moe` Index
`TOPK_MOE_EARLY_SOFTMAX_NORM`	`SOFT_MAX → RESHAPE → ARGSORT → VIEW → GET_ROWS → RESHAPE → SUM_ROWS → CLAMP → DIV → RESHAPE`	0
`TOPK_MOE_SIGMOID_NORM_BIAS`	`UNARY(sigmoid) → RESHAPE → ADD → ARGSORT → VIEW → GET_ROWS → RESHAPE → SUM_ROWS → CLAMP → DIV → RESHAPE`	1
`TOPK_MOE_EARLY_SOFTMAX`	`SOFT_MAX → RESHAPE → ARGSORT → VIEW → GET_ROWS`	2
`TOPK_MOE_LATE_SOFTMAX`	`ARGSORT → VIEW → GET_ROWS → RESHAPE → SOFT_MAX → RESHAPE`	3

Edge constraints are encoded in topk_moe_*_edges constexpr arrays as {dst_node, src_slot, src_node} triples. The pipeline_topk_moe[num_topk_moe_pipelines][2] array provides two sub-variants: index [0] uses a specialization constant for n_experts, index [1] uses a push constant.

These fusions reduce kernel launch overhead and intermediate memory traffic in MoE models like Mixtral and DeepSeek.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp459-555

Build System Integration

CMake Configuration

Vulkan backend is enabled with CMake options:

The build system locates the Vulkan SDK via find_package(Vulkan REQUIRED) and links against Vulkan libraries.

Shader Compilation Integration

The shader generator runs at build time:

CMake invokes vulkan-shaders-gen executable
Generator compiles GLSL → SPIR-V using glslc
SPIR-V binaries embedded into ggml-vulkan-shaders.hpp as byte arrays
Runtime loads shaders from embedded data

Dependency Generation

The generator creates dependency files (.d) tracking shader dependencies:

This ensures incremental rebuilds when shaders change.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp340-348

Backend Interface Implementation

Backend Buffer Type

The Vulkan backend implements ggml_backend_buffer_type_i:

Function	Implementation
`get_name`	Returns device name
`alloc_buffer`	Allocates device memory
`get_alignment`	Returns buffer alignment (16B)
`get_max_size`	Returns `max_buffer_size`
`get_alloc_size`	Computes allocation size for tensor

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp232-239

Backend Interface

The backend implements ggml_backend_i callbacks:

Callback	Purpose
`get_name`	Returns "Vulkan" + device index
`free`	Destroys context and resources
`set_tensor_async`	Asynchronous tensor upload
`get_tensor_async`	Asynchronous tensor download
`cpy_tensor_async`	Device-to-device copy
`synchronize`	Wait for operations to complete
`graph_compute`	Execute computation graph
`supports_op`	Check operation support
`offload_op`	Determine if op should use backend

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp232-239 (interface definition)

Performance Considerations

Pipeline Cache

Compiled pipelines are expensive to create. The backend maintains a registry of all pipelines and lazy-compiles them on first use:

needed flag marks pipelines required by the workload
compiled flag prevents redundant compilation
Atomic operations ensure thread-safe compilation

Workgroup Size Selection

Optimal workgroup sizes vary by architecture:

AMD RDNA: Prefers 64-thread wavefronts
NVIDIA: Prefers 32-thread warps
Intel: Varies by architecture (SIMD8 vs SIMD16)

The backend uses subgroup_size property to select appropriate workgroup configurations.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp591

Memory Alignment

Buffers use 16-byte alignment for optimal memory access:

Aligned access patterns enable vectorized loads and coalesced memory transactions.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp96

Split-K Optimization

Large matrix multiplications use split-K parallelization, dividing the K dimension across multiple workgroups and reducing results in a separate kernel:

Diagram: Split-K Matrix Multiplication

The pipeline_matmul_split_k_reduce kernel sums partial results from split-K computation.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp674

Limitations and Constraints

Maximum Tensor Size

The backend enforces limits based on device capabilities:

max_buffer_size: Maximum single buffer size
max_memory_allocation_size: Maximum memory allocation

Large tensors may require splitting or tiling.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp574-575

Synchronization Requirements

Vulkan compute operations are asynchronous. The backend provides synchronization primitives:

ggml_vk_synchronize(): Wait for all operations
Fences and events for fine-grained synchronization
Queue family transfers for multi-queue operations

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp919-932

Descriptor Set Limits

Descriptor pools have fixed sizes. Complex graphs may exhaust descriptor sets, requiring pool recreation or larger pool sizes.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp96

Vulkan Backend (Cross-Platform)

Relevant source files

Architecture Overview

Diagram: Vulkan Backend Component Architecture

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp116-872 ggml/src/ggml-vulkan/ggml-vulkan.cpp568-872

Device Management

Device Structure and Initialization

vk_device_struct represents a single Vulkan physical+logical device. It is reference-counted via std::shared_ptr<vk_device_struct> (vk_device). Key fields:

Field	Type	Purpose
`physical_device`	`vk::PhysicalDevice`	Physical device handle
`device`	`vk::Device`	Logical device
`compute_queue`	`vk_queue`	Compute + graphics queue
`transfer_queue`	`vk_queue`	DMA transfer queue (may equal compute)
`single_queue`	`bool`	True when compute and transfer share a queue
`dsl`	`vk::DescriptorSetLayout`	Shared descriptor set layout
`all_pipelines`	`vector<vk_pipeline_ref>`	Weak refs to all allocated pipelines
`mutex`	`recursive_mutex`	Guards pipeline compilation and submission
`fence`	`vk::Fence`	Per-device synchronization fence
`sync_staging`	`vk_buffer`	Staging buffer for synchronous transfers
`architecture`	`vk_device_architecture`	Detected GPU microarchitecture
`pinned_memory`	`vector<tuple<void*, size_t, vk_buffer>>`	Imported host memory pages

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp192-226 ggml/src/ggml-vulkan/ggml-vulkan.cpp570-877

Architecture Detection

The get_device_architecture() function maps a vk::PhysicalDevice to a vk_device_architecture enum value by examining vendor ID, available extensions, and device properties:

Architecture Detection Logic

`vk_device_architecture`	Vendor	Detection Criteria
`AMD_GCN`	AMD	`maxSubgroupSize == minSubgroupSize == 64`
`AMD_RDNA1`	AMD	`wavefrontsPerSimd == 20`, min=32/max=64
`AMD_RDNA2`	AMD	RDNA wavefront, no int dot 4x8 mixed acceleration
`AMD_RDNA3`	AMD	`integerDotProduct4x8BitPackedMixedSignednessAccelerated`
`INTEL_XE2`	Intel	`minSubgroupSize == 16` (SIMD16 vs legacy SIMD8)
`NVIDIA_PRE_TURING`	NVIDIA	No `VK_KHR_cooperative_matrix` extension
`NVIDIA_TURING`	NVIDIA	Cooperative matrix + `shaderWarpsPerSM == 32`
`OTHER`	Any	Default / extensions absent

Diagram: get_device_architecture() Decision Tree

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp250-371

Feature Detection

Key boolean fields populated on vk_device_struct during device initialization:

Field	Purpose
`coopmat_support`, `coopmat2`	Cooperative matrix (tensor core) paths
`coopmat_acc_f32_support`, `coopmat_acc_f16_support`	Accumulator precision available
`coopmat_bf16_support`, `coopmat_int_support`	BF16 and integer cooperative matrix
`coopmat1_fa_support`	CoopMat1 flash attention path available
`bf16`	BF16 scalar shader type support
`integer_dot_product`	Integer dot product instructions for quantized matmul
`shader_64b_indexing`	64-bit buffer indexing (large tensors)
`subgroup_arithmetic`, `subgroup_shuffle`, `subgroup_ballot`, `subgroup_clustered`	Subgroup operation availability
`flash_attention_fp16`	FP16 flash attention path enabled
`uma`	Unified memory architecture (zero-copy host buffers)
`external_memory_host`	`VK_EXT_external_memory_host` for pinned memory
`pipeline_robustness`	Robust pipeline access
`add_rms_fusion`	Fused add+RMS norm enabled

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp580-648

Pipeline System

Pipeline Structure

vk_pipeline_struct wraps a single compute pipeline. Key fields:

Field	Type	Purpose
`shader_module`	`vk::ShaderModule`	Compiled SPIR-V module
`layout`	`vk::PipelineLayout`	Push constant + descriptor set layout
`pipeline`	`vk::Pipeline`	Vulkan compute pipeline handle
`push_constant_size`	`uint32_t`	Size of push constants block
`parameter_count`	`uint32_t`	Number of descriptor bindings
`wg_denoms`	`array<uint32_t, 3>`	Workgroup size denominators for dispatch
`needed`	`atomic<bool>`	Set true when pipeline is required
`compiled`	`atomic<bool>`	Set true once compilation completes
`register_count`	`uint32_t`	From pipeline executable properties
`next`	`vk_pipeline`	Linked list for 64-bit indexing variant

vk_matmul_pipeline_struct groups six related vk_pipeline handles (l, m, s + aligned variants). vk_matmul_pipeline2 pairs f32acc and f16acc vk_matmul_pipeline instances.

Diagram: vk_pipeline_struct Lazy Compilation Flow

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp122-165

Matrix Multiplication Pipelines

Field on `vk_device_struct`	Type	Purpose
`pipeline_matmul_f32`	`vk_matmul_pipeline`	FP32 × FP32 → FP32
`pipeline_matmul_f32_f16`	`vk_matmul_pipeline`	FP32 × FP16 → FP32
`pipeline_matmul_bf16`	`vk_matmul_pipeline`	BF16 × BF16 → FP32
`pipeline_matmul_f16`	`vk_matmul_pipeline2`	FP16 × FP16, f32acc + f16acc
`pipeline_dequant_mul_mat_mat[GGML_TYPE_COUNT]`	`vk_matmul_pipeline2`	Fused dequant+matmul, all quant types
`pipeline_dequant_mul_mat_mat_q8_1[GGML_TYPE_COUNT]`	`vk_matmul_pipeline2`	Integer dot product (Q8_1 B matrix)
`pipeline_matmul_id_f32`	`vk_matmul_pipeline`	Expert-parallel FP32 (MoE)
`pipeline_dequant_mul_mat_mat_id[GGML_TYPE_COUNT]`	`vk_matmul_pipeline2`	Expert-parallel dequant+matmul
`pipeline_matmul_split_k_reduce`	`vk_pipeline`	Reduce split-K partial sums
`pipeline_quantize_q8_1_x4`	`vk_pipeline`	Quantize activations to Q8_1

The l/m/s suffixes select tile size configurations (large/medium/small). The a_* variants require aligned memory layouts and enable vectorized loads.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp656-680

Cooperative Matrix Support

For GPUs supporting cooperative matrices, the backend compiles additional pipeline variants:

CoopMat1 (coopmat=true, coopmat2=false): Uses GL_KHR_cooperative_matrix. Compiled when GGML_VULKAN_COOPMAT_GLSLC_SUPPORT is defined. Shader source: mul_mm.comp with #define COOPMAT. Device capability tracked by coopmat_support, coopmat_m/n/k, coopmat_acc_f32_support, coopmat_acc_f16_support.
CoopMat2 (coopmat2=true): Uses GL_NV_cooperative_matrix2 with tensor layout intrinsics (coopMatLoadTensorNV, coopMatStoreTensorNV). Compiled when GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT is defined. Shader source: mul_mm_cm2.comp. Device capability tracked by coopmat2.
Accumulator Types: Both variants generate f32acc (default) and f16acc variants. f16acc is only used when the device reports coopmat_acc_f16_support.

The vulkan-shaders-gen.cpp matmul_shaders() function controls which combinations are compiled via the coopmat and coopmat2 boolean parameters.

Shader Compilation System

Build-Time Shader Generation

Diagram: Shader Build Pipeline (vulkan-shaders-gen.cpp → ggml-vulkan-shaders.hpp)

Key behaviors:

CoopMat2 shaders (_cm2 in name) use --target-env=vulkan1.3; all others use vulkan1.2
-O (spirv-opt) is disabled for coopmat, bf16, and rope shaders to avoid known compiler bugs
Dependency files (.d) are generated for incremental CMake rebuilds
Up to min(16, hardware_concurrency) concurrent glslc processes via the compile slot semaphore

Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp296-425

Shader Variants

The matmul_shaders() function in vulkan-shaders-gen.cpp generates matrix multiplication variants. Key combinatorial dimensions:

Dimension	Values
Input A Type	f32, f16, bf16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_k, q3_k, q4_k, q5_k, q6_k, iq1_s, iq1_m, iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq3_s, iq4_xs, iq4_nl, mxfp4
Input B Type	f32, f16; optionally q8_1 for integer dot product paths
Accumulation	f32acc, f16acc
Implementation	Scalar FP32, Scalar FP16, CoopMat1 (cm1), CoopMat2 (cm2)
Alignment	Aligned (`ALIGNED=1`), Unaligned
Expert Parallel (`MUL_MAT_ID`)	`MatMulIdType::NONE`, `DEFAULT`, `SUBGROUP`

Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp45-69 ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp427-620

Shader Specialization Constants

Specialization constants allow a single compiled SPIR-V module to serve multiple configurations. For mul_mm.comp:

Constant ID	Name	Purpose
0	`BLOCK_SIZE`	Total workgroup thread count
1	`BM`	Output tile M size per workgroup
2	`BN`	Output tile N size per workgroup
3	`BK`	K-dimension chunk (only for quant types)
4	`WM`	Warp tile M (WMMA sub-tile)
5	`WN`	Warp tile N
6	`WMITER`	Warp M iterations
7	`TM`	Thread tile M
8	`TN`	Thread tile N
9	`TK`	K tile for cooperative matrix
10	`WARP`	Warp (subgroup) size

For mul_mm_cm2.comp, constant IDs 0–3 cover BLOCK_SIZE, BM, BN, BK, plus constant ID 4 (enable_smaller_matrices) which activates BN/2 and BN/4 tail-handling paths for the last N tiles.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp103-112 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp28-35

Compute Operations

Operation Dispatch Flow

Diagram: Compute Operation Dispatch Sequence

Each tensor operation selects an appropriate pipeline, binds buffers via descriptor sets, configures push constants with operation parameters, and dispatches workgroups.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp1-100 (function definitions would be in the truncated portion)

Matrix Multiplication Implementation

The mul_mm.comp kernel implements a tiled blocked matrix multiplication:

Workgroup Assignment: Each workgroup computes a BM × BN output tile, indexed by ir and ic from gl_WorkGroupID.
Shared Memory Staging: Threads cooperatively load BM × BK (A) and BN × BK (B) tiles into buf_a and buf_b shared memory arrays.
Register Blocking: Each thread accumulates a TM × TN result tile in sums[] registers.
K-Loop: Iterates over K in steps of BK, advancing pos_a and pos_b pointers.
Output Write: Stores sums[] to the output buffer, bounds-checking against p.M and p.N.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp140-458 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp199-560

Quantization and Dequantization

The backend supports all GGML quantization formats:

Dequantization Pipeline:

Quantization Pipeline:

Diagram: Quantization Data Flow

Dequantization is often fused with matrix multiplication to avoid materializing the full dequantized matrix. Standalone dequantization is used for operations that don't support fused dequantization.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/copy_to_quant.comp1-297 ggml/src/ggml-vulkan/vulkan-shaders/copy_from_quant.comp1-52

Matrix-Vector Multiplication

Matrix-vector operations use optimized kernels in mul_mat_vec.comp:

Mode	Description
Default	Uses shared memory reduction
Subgroup	Uses subgroup shuffle operations
Subgroup No Shmem	Uses only subgroup operations without shared memory

The kernel processes multiple rows in parallel and uses different reduction strategies based on hardware subgroup capabilities.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec.comp1-170

Flash Attention

`FaCodePath` Enum Value	Source File	Extension Required
`FA_SCALAR`	`flash_attn.comp`	None (subgroup ops optional)
`FA_COOPMAT1`	`flash_attn_cm1.comp`	`GL_KHR_cooperative_matrix`
`FA_COOPMAT2`	`flash_attn_cm2.comp`	`GL_NV_cooperative_matrix2`

All three paths share push constants and specialization constants defined in flash_attn_base.glsl. vk_fa_pipeline_state captures every dimension that requires a distinct pipeline:

`vk_fa_pipeline_state` Field	`flash_attn_base.glsl` Constant ID	Description
`HSK`, `HSV`	3, 4	Head size for keys / values
`Br`, `Bc`	1, 2	Query tile rows / KV tile cols
`D_split`, `row_split`	6, 7	Thread tiling dimensions
`shmem_staging`	9 (`SHMEM_STAGING`)	Stage K/V through shared memory
`path`	—	`FA_SCALAR`/`FA_COOPMAT1`/`FA_COOPMAT2`
`workgroup_size`, `subgroup_size`	0, 8	Dispatch dimensions
`aligned`	5 (`Clamp`)	Boundary clamping needed
`f32acc`	—	FP32 vs FP16 accumulator
`flags`	10	`USE_MASK_OPT`, `MASK_ENABLE`, `LOGIT_SOFTCAP`
`limit_occupancy_shmem`	11	Shared memory occupancy limiter

A separate pipeline_flash_attn_split_k_reduce handles the split-K reduction pass (flash_attn_split_k_reduce.comp), summing partial O, L, and M values from multiple split-K tiles.

Mask Optimization Pre-pass

Memory Management

Buffer Allocation

Buffer allocation abstracts Vulkan memory management:

Diagram: Buffer Allocation Flow

The vk_buffer_struct wraps Vulkan buffer and memory objects, tracking size, memory properties, and mapped pointers.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp888-907

Subbuffer Views

The vk_subbuffer structure provides lightweight views into larger buffers:

Subbuffers enable efficient descriptor set binding without creating separate buffer objects.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp909-918

Memory Types

The backend manages different memory types based on usage:

Memory Type	Flags	Usage
Device-Local	`DEVICE_LOCAL_BIT`	GPU-resident tensors
Host-Visible	`HOST_VISIBLE_BIT	HOST_COHERENT_BIT`
Host-Cached	`HOST_VISIBLE_BIT	HOST_CACHED_BIT`

UMA (Unified Memory Architecture) systems use host-visible device-local memory for zero-copy access.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp568-872 (memory property selection logic)

Pinned Memory

The device maintains a pinned_memory vector tracking imported host memory:

This enables external memory import on systems supporting VK_EXT_external_memory_host.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp834

Command Buffer Management

Command Pool Design

Each vk_queue has an associated vk_command_pool:

Diagram: Command Pool Structure

Command pools are marked with VK_COMMAND_POOL_CREATE_TRANSIENT_BIT for short-lived command buffers that are reset frequently.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp192-201 ggml/src/ggml-vulkan/ggml-vulkan.cpp874-886

Queue Synchronization

The backend uses a global queue_mutex to prevent simultaneous submissions to the same queue:

This prevents race conditions when multiple backend contexts share the same device.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp206

Expert Parallel Operations (MoE)

MoE Matrix Multiplication

Expert-parallel matrix multiplication (MUL_MAT_ID) dispatches work based on expert assignment:

Diagram: Expert-Parallel Matrix Multiplication Flow

Workgroups process tokens assigned to a specific expert. The load_row_ids function uses subgroup ballots to efficiently filter and load token-to-expert assignments into shared memory.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp69-72 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp130-192

TopK MoE Fusion

`topk_moe_mode` Value	Fused GGML Op Sequence	`pipeline_topk_moe` Index
`TOPK_MOE_EARLY_SOFTMAX_NORM`	`SOFT_MAX → RESHAPE → ARGSORT → VIEW → GET_ROWS → RESHAPE → SUM_ROWS → CLAMP → DIV → RESHAPE`	0
`TOPK_MOE_SIGMOID_NORM_BIAS`	`UNARY(sigmoid) → RESHAPE → ADD → ARGSORT → VIEW → GET_ROWS → RESHAPE → SUM_ROWS → CLAMP → DIV → RESHAPE`	1
`TOPK_MOE_EARLY_SOFTMAX`	`SOFT_MAX → RESHAPE → ARGSORT → VIEW → GET_ROWS`	2
`TOPK_MOE_LATE_SOFTMAX`	`ARGSORT → VIEW → GET_ROWS → RESHAPE → SOFT_MAX → RESHAPE`	3

These fusions reduce kernel launch overhead and intermediate memory traffic in MoE models like Mixtral and DeepSeek.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp459-555

Build System Integration

CMake Configuration

Vulkan backend is enabled with CMake options:

The build system locates the Vulkan SDK via find_package(Vulkan REQUIRED) and links against Vulkan libraries.

Shader Compilation Integration

The shader generator runs at build time:

CMake invokes vulkan-shaders-gen executable
Generator compiles GLSL → SPIR-V using glslc
SPIR-V binaries embedded into ggml-vulkan-shaders.hpp as byte arrays
Runtime loads shaders from embedded data

Dependency Generation

The generator creates dependency files (.d) tracking shader dependencies:

This ensures incremental rebuilds when shaders change.

Sources: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp340-348

Backend Interface Implementation

Backend Buffer Type

The Vulkan backend implements ggml_backend_buffer_type_i:

Function	Implementation
`get_name`	Returns device name
`alloc_buffer`	Allocates device memory
`get_alignment`	Returns buffer alignment (16B)
`get_max_size`	Returns `max_buffer_size`
`get_alloc_size`	Computes allocation size for tensor

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp232-239

Backend Interface

The backend implements ggml_backend_i callbacks:

Callback	Purpose
`get_name`	Returns "Vulkan" + device index
`free`	Destroys context and resources
`set_tensor_async`	Asynchronous tensor upload
`get_tensor_async`	Asynchronous tensor download
`cpy_tensor_async`	Device-to-device copy
`synchronize`	Wait for operations to complete
`graph_compute`	Execute computation graph
`supports_op`	Check operation support
`offload_op`	Determine if op should use backend

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp232-239 (interface definition)

Performance Considerations

Pipeline Cache

Compiled pipelines are expensive to create. The backend maintains a registry of all pipelines and lazy-compiles them on first use:

needed flag marks pipelines required by the workload
compiled flag prevents redundant compilation
Atomic operations ensure thread-safe compilation

Workgroup Size Selection

Optimal workgroup sizes vary by architecture:

AMD RDNA: Prefers 64-thread wavefronts
NVIDIA: Prefers 32-thread warps
Intel: Varies by architecture (SIMD8 vs SIMD16)

The backend uses subgroup_size property to select appropriate workgroup configurations.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp591

Memory Alignment

Buffers use 16-byte alignment for optimal memory access:

Aligned access patterns enable vectorized loads and coalesced memory transactions.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp96

Split-K Optimization

Large matrix multiplications use split-K parallelization, dividing the K dimension across multiple workgroups and reducing results in a separate kernel:

Diagram: Split-K Matrix Multiplication

The pipeline_matmul_split_k_reduce kernel sums partial results from split-K computation.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp674

Limitations and Constraints

Maximum Tensor Size

The backend enforces limits based on device capabilities:

max_buffer_size: Maximum single buffer size
max_memory_allocation_size: Maximum memory allocation

Large tensors may require splitting or tiling.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp574-575

Synchronization Requirements

Vulkan compute operations are asynchronous. The backend provides synchronization primitives:

ggml_vk_synchronize(): Wait for all operations
Fences and events for fine-grained synchronization
Queue family transfers for multi-queue operations

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp919-932

Descriptor Set Limits

Descriptor pools have fixed sizes. Complex graphs may exhaust descriptor sets, requiring pool recreation or larger pool sizes.

Sources: ggml/src/ggml-vulkan/ggml-vulkan.cpp96

Vulkan Backend (Cross-Platform)

Architecture Overview

Device Management

Device Structure and Initialization

Architecture Detection

Feature Detection

Pipeline System

Pipeline Structure

Matrix Multiplication Pipelines

Cooperative Matrix Support

Shader Compilation System

Build-Time Shader Generation

Shader Variants

Shader Specialization Constants

Compute Operations

Operation Dispatch Flow

Matrix Multiplication Implementation

Quantization and Dequantization

Matrix-Vector Multiplication

Flash Attention

Memory Management

Buffer Allocation

Subbuffer Views

Memory Types

Pinned Memory

Command Buffer Management

Command Pool Design

Queue Synchronization

Expert Parallel Operations (MoE)

MoE Matrix Multiplication

TopK MoE Fusion

Build System Integration

CMake Configuration

Shader Compilation Integration

Dependency Generation

Backend Interface Implementation

Backend Buffer Type

Backend Interface

Performance Considerations

Pipeline Cache

Workgroup Size Selection

Memory Alignment

Split-K Optimization

Limitations and Constraints

Maximum Tensor Size

Synchronization Requirements

Descriptor Set Limits

On this page

Vulkan Backend (Cross-Platform)

Architecture Overview

Device Management

Device Structure and Initialization

Architecture Detection

Feature Detection

Pipeline System

Pipeline Structure

Matrix Multiplication Pipelines

Cooperative Matrix Support

Shader Compilation System

Build-Time Shader Generation

Shader Variants

Shader Specialization Constants

Compute Operations

Operation Dispatch Flow

Matrix Multiplication Implementation

Quantization and Dequantization

Matrix-Vector Multiplication

Flash Attention

Memory Management

Buffer Allocation

Subbuffer Views

Memory Types

Pinned Memory

Command Buffer Management

Command Pool Design

Queue Synchronization

Expert Parallel Operations (MoE)

MoE Matrix Multiplication

TopK MoE Fusion

Build System Integration