GPU and Vulkan System

Relevant source files

Purpose and Scope

This document describes the Vulkan-based GPU acceleration system in ncnn. It covers the initialization and management of Vulkan instances and devices, the command recording and execution model, GPU memory allocation and CPU-GPU data transfer mechanisms, and how Vulkan compute layers are implemented.

For information about the core inference runtime that uses this GPU system, see Core Runtime Architecture. For details on CPU-specific layer optimizations, see CPU Layer Implementations. For platform-specific SIMD optimizations, see Platform-Specific Optimizations.

System Architecture Overview

The Vulkan system provides GPU acceleration through compute shaders. The system is organized into four primary subsystems:

Vulkan System Component Map

Diagram: Vulkan System — Classes and Source Files

Sources: src/gpu.h1-548 src/command.h1-118 src/allocator.h208-397 src/pipeline.h1-65

Vulkan Instance and Device Initialization

Instance Creation and Management

The Vulkan instance is created once globally and shared across all GPU operations. The instance manages physical device enumeration and extension loading.

Diagram: Vulkan Instance Initialization Flow

The global instance holder is defined in src/gpu.cpp37-78:

Global Variable	Type	Purpose
`g_instance`	`__ncnn_vulkan_instance_holder`	Stores VkInstance and initialization state
`g_gpu_count`	`int`	Number of available GPUs
`g_default_gpu_index`	`int`	Index of default GPU
`g_gpu_infos[]`	`GpuInfo*[32]`	Array of GPU capability info
`g_default_vkdev[]`	`VulkanDevice*[32]`	Array of default device instances

Sources: src/gpu.cpp32-89 src/gpu.h15-30

GPU Information and Capabilities

Each physical device's capabilities are queried and stored in a GpuInfo object. The GpuInfoPrivate class src/gpu.cpp253-408 performs comprehensive capability detection:

Diagram: GPU Capability Query Process

Key capability categories queried:

Device Type: Discrete (0), Integrated (1), Virtual (2), CPU (3) src/gpu.cpp429-440
Queue Families: Compute-only, compute+graphics, transfer-only src/gpu.cpp521-615
Memory Properties: Device-local, host-visible, host-coherent, resizable BAR src/gpu.cpp617-666
Extensions: FP16, INT8, cooperative matrix, subgroup operations src/gpu.cpp688-838
Bug Detection: Storage buffer L1 cache, corrupted pipeline cache, implicit FP16 src/gpu.cpp465-518

Sources: src/gpu.cpp253-859 src/gpu.h184-414

VulkanDevice Creation

The VulkanDevice class src/gpu.h423-546 wraps a logical device and provides device-specific operations:

Diagram: VulkanDevice Initialization

The device provides key methods:

Method	Purpose
`compile_shader_module()`	Compiles SPIR-V to VkShaderModule src/gpu.cpp2746-2846
`create_descriptorset_layout()`	Creates descriptor set layout src/gpu.cpp2848-2925
`create_pipeline_layout()`	Creates pipeline layout src/gpu.cpp2927-2990
`create_pipeline()`	Creates compute pipeline src/gpu.cpp2992-3080
`acquire_queue()` / `reclaim_queue()`	Queue management for multi-threading src/gpu.cpp3225-3296
`acquire_blob_allocator()` / `reclaim_blob_allocator()`	Allocator management src/gpu.cpp3298-3365
`convert_packing()`	Utility for converting element packing src/gpu.cpp3580-3805

Sources: src/gpu.cpp2562-3805 src/gpu.h423-546

Command Recording and Execution

VkCompute and VkTransfer

Two command-recorder classes are provided in src/command.h1-118:

VkCompute src/command.h22-88: Used during inference. Records both compute dispatches and CPU↔GPU data transfers. Maintains upload/download staging buffers across the lifetime of a command recording.
VkTransfer src/command.h90-111: Used at model load time (weight upload only). A simpler, upload-only recorder called from each layer's upload_model() method. Does not support download or compute dispatch.

VkCompute provides a high-level interface for recording GPU operations. It maintains a command pool, command buffer, and fence internally through VkComputePrivate src/command.cpp13-189

VkCompute Command Buffer System

VkCompute Command Buffer Lifecycle

Diagram: VkCompute Command Recording and Submission Flow

Sources: src/command.cpp13-356 src/command.h22-88

Upload and Download Operations

Data transfer between CPU and GPU involves staging buffers for devices without host-visible device memory.

Upload Path src/command.cpp358-432:

Convert FP32 to FP16/BF16 on CPU if discrete GPU
Create staging buffer with VkStagingAllocator
memcpy source data to staging buffer's mapped memory
Flush staging buffer if non-coherent memory
Call convert_packing() to transfer staging → destination with optimal element packing
Staging buffer retained in upload_staging_buffers vector

Download Path src/command.cpp434-586:

Call convert_packing() to transfer source → staging with element unpacking
Insert pipeline barrier: device → host-readable
Record delayed download operation (actual memcpy after submit)
In submit_and_wait(), copy staging to output Mat
Convert FP16/BF16 to FP32 on CPU if discrete GPU

Diagram: CPU-GPU Data Transfer Paths

Sources: src/command.cpp358-586

Pipeline Binding and Dispatch

Recording a pipeline execution involves binding the pipeline, updating descriptors, pushing constants, and dispatching workgroups.

For devices with VK_KHR_push_descriptor src/command.cpp1268-1328:

For devices without push descriptors src/command.cpp1330-1439:

Descriptors are allocated from pools and recorded as delayed operations, then executed in submit_and_wait().

Workgroup Size Calculation src/command.cpp1197-1233:

Dispatcher determines the number of workgroups to dispatch based on output dimensions and pipeline's local size:

group_count_x = (w + local_size_x - 1) / local_size_x
group_count_y = (h + local_size_y - 1) / local_size_y  
group_count_z = (c + local_size_z - 1) / local_size_z

Sources: src/command.cpp1142-1439

Submission and Synchronization

The submit_and_wait() method src/command.cpp1441-1607 handles command submission and synchronization:

Diagram: Command Submission and Wait Flow

The method also handles benchmark query results if NCNN_BENCHMARK is enabled src/command.cpp1467-1482

Sources: src/command.cpp1441-1607 src/command.cpp1609-1685

GPU Memory Allocation

Allocator Hierarchy

The Vulkan memory allocators follow a common interface defined by VkAllocator src/allocator.h263-295:

Diagram: Vulkan Allocator Hierarchy

Sources: src/allocator.h208-397 src/allocator.cpp349-367

VkBlobAllocator Implementation

VkBlobAllocator src/allocator.cpp596-1012 manages memory in large blocks and sub-allocates from them:

Diagram: VkBlobAllocator Buffer Allocation

Memory Type Selection src/gpu.cpp3082-3176:

The allocator prefers memory types with specific properties:

Required	Preferred	Preferred Not
DEVICE_LOCAL	HOST_VISIBLE	HOST_CACHED
-	DEVICE_LOCAL	-

This allows direct GPU access while supporting mapped access on integrated GPUs.

Block Management src/allocator.cpp608-1012:

buffer_blocks: Vector of allocated VkBufferMemory* blocks
buffer_budgets: List of free (offset, size) ranges per block
image_memory_blocks: Separate vector for image memory
image_memory_budgets: Free ranges for image allocations

Allocation aligns to:

buffer_offset_alignment: From VkPhysicalDeviceLimits src/allocator.cpp611
bind_memory_offset_alignment: Buffer-image granularity src/allocator.cpp612

Sources: src/allocator.cpp596-1012

VkStagingAllocator for CPU-GPU Transfer

VkStagingAllocator src/allocator.cpp1014-1299 is optimized for frequent allocations during data transfer:

Key Features:

Memory Type: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED src/allocator.cpp1055-1076
Pooling: Implements budget/payout recycling similar to PoolAllocator
Size Comparison Ratio: Default 0.75 for flexible reuse src/allocator.cpp1026

Diagram: Staging Allocator Lifecycle

Sources: src/allocator.cpp1014-1299

VkWeightAllocator for Model Weights

VkWeightAllocator src/allocator.cpp1301-1616 is designed for immutable model weight storage:

Configuration Options:

Prefer Host Memory: For use_weights_in_host_memory option src/allocator.cpp1316-1338
- Uses HOST_VISIBLE | HOST_CACHED memory
- Reduces device memory pressure
- Suitable for large models on memory-constrained GPUs
Device Local: Default mode
- Uses DEVICE_LOCAL memory for best performance
- Weights uploaded once during model load

Block Strategy:

Unlike blob allocator, weight allocator creates dedicated blocks per allocation, avoiding fragmentation from mixed-size weight tensors src/allocator.cpp1396-1529

Sources: src/allocator.cpp1301-1616

Pipeline Creation and Shader Compilation

Pipeline Class Structure

The Pipeline class src/pipeline.h18-65 wraps a Vulkan compute pipeline and provides workgroup size configuration. Internally it holds a PipelinePrivate with:

VkShaderModule shader_module
VkDescriptorSetLayout descriptorset_layout
VkPipelineLayout pipeline_layout
VkPipeline pipeline
VkDescriptorUpdateTemplateKHR descriptor_update_template
ShaderInfo shader_info — binding types and push constant count
local_size_x/y/z, subgroup_size

Pipeline Creation Flow

Diagram: Pipeline Creation via PipelineCache

Sources: src/pipeline.cpp35-237 src/pipeline.h18-65

Local Workgroup Size Configuration

The set_optimal_local_size_xyz() method src/pipeline.cpp65-103 configures the shader's local workgroup dimensions:

Default Approach src/pipeline.cpp76-82:

Falls back to 4×4×4 if input dimensions are all zero
Clamps to device's max workgroup size limits

Adjustment for Subgroup Size src/pipeline.cpp132-203:

The adjust_xyz() function ensures local_size_x × local_size_y × local_size_z is a multiple of the subgroup size (typically 4-128):

Strategy:
- If z==1 and y==1: adjust x only
- If z==1 and x==1: adjust y only  
- If z==1: adjust x and y
- If y==1 and x==1: adjust z only
- If y==1: adjust x and z
- If x==1: adjust y and z
- Else: adjust x, y, and z

This ensures efficient subgroup utilization while minimizing wasted invocations src/pipeline.cpp132-203

Sources: src/pipeline.cpp65-217

Shader Compilation and SPIR-V

Shaders are compiled from GLSL to SPIR-V using the embedded glslang compiler.

Compilation Functions src/gpu.cpp3807-4237:

Diagram: Shader Compilation Pipeline

Preprocessor Defines src/gpu.cpp3929-4008:

The compiler adds defines based on device capabilities and Option settings:

Define	Condition
`NCNN_fp16_storage`	`opt.use_fp16_storage && info.support_fp16_storage()`
`NCNN_fp16_arithmetic`	`opt.use_fp16_arithmetic && info.support_fp16_arithmetic()`
`NCNN_int8_storage`	`opt.use_int8_storage && info.support_int8_storage()`
`NCNN_int8_arithmetic`	`opt.use_int8_arithmetic && info.support_int8_arithmetic()`
`NCNN_fp16_packed`	`opt.use_fp16_packed && info.support_fp16_packed()`
`NCNN_int8_packed`	`opt.use_int8_packed && info.support_int8_packed()`
`NCNN_image_shader`	`opt.use_tensor_storage`

Layer Shader Registry src/gpu.cpp91-103:

Pre-compiled shaders are embedded as hex data in the binary:

layer_shader_registry[]: Array of shader SPIR-V data
layer_shader_registry_entry_count: Number of built-in shaders
Generated during build from .comp files

Sources: src/gpu.cpp3807-4237 src/gpu.cpp91-103

Pipeline Cache

The PipelineCache class (referenced but implementation in src/pipelinecache.cpp) manages compiled pipelines and shader modules to avoid recompilation:

Cache Key Components:

Shader SPIR-V data or shader type index
Specialization constants
Local workgroup size (x, y, z)
Subgroup size

The cache stores:

VkShaderModule: Compiled shader
VkDescriptorSetLayout: Descriptor layout
VkPipelineLayout: Pipeline layout
VkPipeline: Compute pipeline
VkDescriptorUpdateTemplateKHR: Descriptor update template
ShaderInfo: Binding types and constant info

Sources: src/pipeline.cpp221-237

Vulkan Layer Implementation Pattern

Layer Shader Structure

Vulkan layers implement the forward pass using GLSL compute shaders compiled to SPIR-V at runtime. Each *_vulkan layer class follows a common lifecycle:

Vulkan Layer Lifecycle

Diagram: Vulkan Layer Implementation Pattern

Sources: src/layer/vulkan/convolution_vulkan.cpp1-100 src/layer/vulkan/innerproduct_vulkan.cpp1-30 src/layer/vulkan/pooling_vulkan.cpp1-30

Example: AbsVal_vulkan Layer

AbsVal_vulkan is a representative example of the simplest Vulkan layer pattern — an in-place element-wise operation.

Class structure: The class holds one Pipeline* member per element-packing variant: pipeline_absval (elempack=1), pipeline_absval_pack4 (elempack=4), and pipeline_absval_pack8 (elempack=8 for fp16 packed). The choice of variant is determined by the element pack of the input VkMat at forward time.

create_pipeline(): Instantiates one Pipeline per active packing variant. Each pipeline calls set_optimal_local_size_xyz() and create(shader_type_index, opt, specializations) referencing a shader type from the LayerShaderType enum. Specialization constants for simple element-wise operations are typically empty; shape-dependent layers encode dimensions and strides here.

forward_inplace(VkMat&, VkCompute&, Option&): Selects the correct pipeline based on bottom_top_blob.elempack, prepares a std::vector<VkMat> for bindings, and calls cmd.record_pipeline(). The single binding is the in-out buffer. No host-side output allocation is needed.

Shader design: The corresponding .comp shader uses specialization constant IDs 233, 234, 235 for local workgroup size x/y/z, accesses dimensions via the psc(w/h/c) push constant macro, and uses buffer_ld1/buffer_st1 macros that resolve to the correct precision at compile time via NCNN_fp16_storage and related defines.

Sources: src/layer/vulkan/absval_vulkan.cpp1-189 src/layer/vulkan/absval_vulkan.h1-50

Shader Macros and Utilities

Common shader macros are conditionally injected by the GLSL compiler based on device capabilities and Option settings.

Precision type aliases:

afp: Arithmetic type — float or float16_t depending on NCNN_fp16_arithmetic
sfp: Storage type — float16_t, int8_t, bfloat16_t, or float
afpvec4, sfpvec4: Corresponding packed vector types

Buffer access macros:

buffer_ld1(data, i) / buffer_st1(data, i, v): Scalar load/store with precision conversion
buffer_ld4(data, i) / buffer_st4(data, i, v): 4-element vector load/store

Push constant access:

psc(name): Expands to the named push constant field (w, h, c, cstep, etc.) — automatically generated from the layer's push constant struct

These macros allow a single shader source file to compile correctly across all precision modes. The glslang pre-processor injects the relevant defines during compile_spirv_module() based on the Option settings and device support flags src/gpu.cpp3929-4008

Sources: src/gpu.cpp3807-4237

Data Types and Format Conversion

Supported Precision Modes

ncnn's Vulkan backend supports multiple precision modes configured through Option:

Precision	Storage Size	Usage	Performance
FP32	4 bytes	Default, highest accuracy	Baseline
FP16 storage	2 bytes	Reduced memory, FP32 arithmetic	2× memory reduction
FP16 packed	2 bytes	Reduced memory, FP16 arithmetic	2× memory + compute speedup
INT8	1 byte	Quantized inference	4× memory reduction
BF16	2 bytes	Mixed precision training/inference	2× memory reduction

Option fields controlling GPU precision src/option.h90-96:

Field	Meaning
`use_fp16_packed`	Pack 8 FP16 values per vec8 register
`use_fp16_storage`	Store in FP16, compute in FP32
`use_fp16_arithmetic`	Both store and compute in FP16
`use_int8_packed`	Pack 8 INT8 values per vec8 register
`use_int8_storage`	Store in INT8
`use_int8_arithmetic`	Both store and compute in INT8

GpuInfo device capability predicates src/gpu.h280-288:

support_fp16_packed(), support_fp16_storage(), support_fp16_arithmetic(), support_int8_packed(), support_int8_storage(), support_int8_arithmetic() — all queried from extension features during create_gpu_instance().

Sources: src/option.h90-96 src/gpu.h280-288

Format Conversion During Transfer

Format conversion happens automatically during upload/download based on GPU type:

Discrete GPU (type == 0):

Upload: CPU casts FP32→FP16/BF16 before transfer src/command.cpp363-383
Download: CPU casts FP16/BF16→FP32 after transfer src/command.cpp531-586

Integrated GPU (type != 0):

Upload: GPU shader casts during convert_packing() src/command.cpp421-431
Download: GPU shader casts during convert_packing() src/command.cpp452-469

This optimization leverages CPU SIMD for discrete GPUs (PCIe bandwidth limited) and GPU parallelism for integrated GPUs (shared memory).

Sources: src/command.cpp358-586

Performance Considerations

Dispatch Threshold and Batching

To avoid GPU driver timeouts, VkCompute tracks pending work and submits when threshold is exceeded:

Threshold Calculation src/net.cpp252-273:

The threshold represents the approximate number of elements processed. High-end GPUs can handle larger batches without timeout.

Sources: src/net.cpp250-298

Memory Alignment and Coherency

Proper memory alignment is critical for Vulkan performance:

Buffer Offset Alignment src/allocator.cpp611:

Aligns to VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment
Typically 16 or 256 bytes depending on vendor

Non-Coherent Memory src/allocator.cpp379-420:

Flush/invalidate must align to nonCoherentAtomSize
Round down offset, round up size to atom boundary

Buffer-Image Granularity src/allocator.cpp612:

Aligns image memory to bufferImageGranularity
Prevents aliasing between buffers and images

Sources: src/allocator.cpp369-595

Subgroup Operations

Modern GPUs benefit from subgroup (warp/wavefront) operations:

Subgroup Size Configuration src/pipeline.cpp105-112:

Workgroup Adjustment src/pipeline.cpp132-203:

The adjust_xyz() function ensures local workgroup size is a multiple of subgroup size for efficient execution.

Subgroup Features src/gpu.h263-271:

Shaders can use subgroup operations when opt.use_subgroup_ops is enabled and device supports them.

Sources: src/pipeline.cpp105-203 src/gpu.h263-271 src/option.h32

GPU and Vulkan System

Relevant source files

Purpose and Scope

System Architecture Overview

The Vulkan system provides GPU acceleration through compute shaders. The system is organized into four primary subsystems:

Vulkan System Component Map

Diagram: Vulkan System — Classes and Source Files

Sources: src/gpu.h1-548 src/command.h1-118 src/allocator.h208-397 src/pipeline.h1-65

Vulkan Instance and Device Initialization

Instance Creation and Management

The Vulkan instance is created once globally and shared across all GPU operations. The instance manages physical device enumeration and extension loading.

Diagram: Vulkan Instance Initialization Flow

The global instance holder is defined in src/gpu.cpp37-78:

Global Variable	Type	Purpose
`g_instance`	`__ncnn_vulkan_instance_holder`	Stores VkInstance and initialization state
`g_gpu_count`	`int`	Number of available GPUs
`g_default_gpu_index`	`int`	Index of default GPU
`g_gpu_infos[]`	`GpuInfo*[32]`	Array of GPU capability info
`g_default_vkdev[]`	`VulkanDevice*[32]`	Array of default device instances

Sources: src/gpu.cpp32-89 src/gpu.h15-30

GPU Information and Capabilities

Each physical device's capabilities are queried and stored in a GpuInfo object. The GpuInfoPrivate class src/gpu.cpp253-408 performs comprehensive capability detection:

Diagram: GPU Capability Query Process

Key capability categories queried:

Device Type: Discrete (0), Integrated (1), Virtual (2), CPU (3) src/gpu.cpp429-440
Queue Families: Compute-only, compute+graphics, transfer-only src/gpu.cpp521-615
Memory Properties: Device-local, host-visible, host-coherent, resizable BAR src/gpu.cpp617-666
Extensions: FP16, INT8, cooperative matrix, subgroup operations src/gpu.cpp688-838
Bug Detection: Storage buffer L1 cache, corrupted pipeline cache, implicit FP16 src/gpu.cpp465-518

Sources: src/gpu.cpp253-859 src/gpu.h184-414

VulkanDevice Creation

The VulkanDevice class src/gpu.h423-546 wraps a logical device and provides device-specific operations:

Diagram: VulkanDevice Initialization

The device provides key methods:

Method	Purpose
`compile_shader_module()`	Compiles SPIR-V to VkShaderModule src/gpu.cpp2746-2846
`create_descriptorset_layout()`	Creates descriptor set layout src/gpu.cpp2848-2925
`create_pipeline_layout()`	Creates pipeline layout src/gpu.cpp2927-2990
`create_pipeline()`	Creates compute pipeline src/gpu.cpp2992-3080
`acquire_queue()` / `reclaim_queue()`	Queue management for multi-threading src/gpu.cpp3225-3296
`acquire_blob_allocator()` / `reclaim_blob_allocator()`	Allocator management src/gpu.cpp3298-3365
`convert_packing()`	Utility for converting element packing src/gpu.cpp3580-3805

Sources: src/gpu.cpp2562-3805 src/gpu.h423-546

Command Recording and Execution

VkCompute and VkTransfer

Two command-recorder classes are provided in src/command.h1-118:

VkCompute src/command.h22-88: Used during inference. Records both compute dispatches and CPU↔GPU data transfers. Maintains upload/download staging buffers across the lifetime of a command recording.
VkTransfer src/command.h90-111: Used at model load time (weight upload only). A simpler, upload-only recorder called from each layer's upload_model() method. Does not support download or compute dispatch.

VkCompute provides a high-level interface for recording GPU operations. It maintains a command pool, command buffer, and fence internally through VkComputePrivate src/command.cpp13-189

VkCompute Command Buffer System

VkCompute Command Buffer Lifecycle

Diagram: VkCompute Command Recording and Submission Flow

Sources: src/command.cpp13-356 src/command.h22-88

Upload and Download Operations

Data transfer between CPU and GPU involves staging buffers for devices without host-visible device memory.

Upload Path src/command.cpp358-432:

Convert FP32 to FP16/BF16 on CPU if discrete GPU
Create staging buffer with VkStagingAllocator
memcpy source data to staging buffer's mapped memory
Flush staging buffer if non-coherent memory
Call convert_packing() to transfer staging → destination with optimal element packing
Staging buffer retained in upload_staging_buffers vector

Download Path src/command.cpp434-586:

Call convert_packing() to transfer source → staging with element unpacking
Insert pipeline barrier: device → host-readable
Record delayed download operation (actual memcpy after submit)
In submit_and_wait(), copy staging to output Mat
Convert FP16/BF16 to FP32 on CPU if discrete GPU

Diagram: CPU-GPU Data Transfer Paths

Sources: src/command.cpp358-586

Pipeline Binding and Dispatch

Recording a pipeline execution involves binding the pipeline, updating descriptors, pushing constants, and dispatching workgroups.

For devices with VK_KHR_push_descriptor src/command.cpp1268-1328:

For devices without push descriptors src/command.cpp1330-1439:

Descriptors are allocated from pools and recorded as delayed operations, then executed in submit_and_wait().

Workgroup Size Calculation src/command.cpp1197-1233:

Dispatcher determines the number of workgroups to dispatch based on output dimensions and pipeline's local size:

group_count_x = (w + local_size_x - 1) / local_size_x
group_count_y = (h + local_size_y - 1) / local_size_y  
group_count_z = (c + local_size_z - 1) / local_size_z

Sources: src/command.cpp1142-1439

Submission and Synchronization

The submit_and_wait() method src/command.cpp1441-1607 handles command submission and synchronization:

Diagram: Command Submission and Wait Flow

The method also handles benchmark query results if NCNN_BENCHMARK is enabled src/command.cpp1467-1482

Sources: src/command.cpp1441-1607 src/command.cpp1609-1685

GPU Memory Allocation

Allocator Hierarchy

The Vulkan memory allocators follow a common interface defined by VkAllocator src/allocator.h263-295:

Diagram: Vulkan Allocator Hierarchy

Sources: src/allocator.h208-397 src/allocator.cpp349-367

VkBlobAllocator Implementation

VkBlobAllocator src/allocator.cpp596-1012 manages memory in large blocks and sub-allocates from them:

Diagram: VkBlobAllocator Buffer Allocation

Memory Type Selection src/gpu.cpp3082-3176:

The allocator prefers memory types with specific properties:

Required	Preferred	Preferred Not
DEVICE_LOCAL	HOST_VISIBLE	HOST_CACHED
-	DEVICE_LOCAL	-

This allows direct GPU access while supporting mapped access on integrated GPUs.

Block Management src/allocator.cpp608-1012:

buffer_blocks: Vector of allocated VkBufferMemory* blocks
buffer_budgets: List of free (offset, size) ranges per block
image_memory_blocks: Separate vector for image memory
image_memory_budgets: Free ranges for image allocations

Allocation aligns to:

buffer_offset_alignment: From VkPhysicalDeviceLimits src/allocator.cpp611
bind_memory_offset_alignment: Buffer-image granularity src/allocator.cpp612

Sources: src/allocator.cpp596-1012

VkStagingAllocator for CPU-GPU Transfer

VkStagingAllocator src/allocator.cpp1014-1299 is optimized for frequent allocations during data transfer:

Key Features:

Memory Type: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED src/allocator.cpp1055-1076
Pooling: Implements budget/payout recycling similar to PoolAllocator
Size Comparison Ratio: Default 0.75 for flexible reuse src/allocator.cpp1026

Diagram: Staging Allocator Lifecycle

Sources: src/allocator.cpp1014-1299

VkWeightAllocator for Model Weights

VkWeightAllocator src/allocator.cpp1301-1616 is designed for immutable model weight storage:

Configuration Options:

Prefer Host Memory: For use_weights_in_host_memory option src/allocator.cpp1316-1338
- Uses HOST_VISIBLE | HOST_CACHED memory
- Reduces device memory pressure
- Suitable for large models on memory-constrained GPUs
Device Local: Default mode
- Uses DEVICE_LOCAL memory for best performance
- Weights uploaded once during model load

Block Strategy:

Unlike blob allocator, weight allocator creates dedicated blocks per allocation, avoiding fragmentation from mixed-size weight tensors src/allocator.cpp1396-1529

Sources: src/allocator.cpp1301-1616

Pipeline Creation and Shader Compilation

Pipeline Class Structure

The Pipeline class src/pipeline.h18-65 wraps a Vulkan compute pipeline and provides workgroup size configuration. Internally it holds a PipelinePrivate with:

VkShaderModule shader_module
VkDescriptorSetLayout descriptorset_layout
VkPipelineLayout pipeline_layout
VkPipeline pipeline
VkDescriptorUpdateTemplateKHR descriptor_update_template
ShaderInfo shader_info — binding types and push constant count
local_size_x/y/z, subgroup_size

Pipeline Creation Flow

Diagram: Pipeline Creation via PipelineCache

Sources: src/pipeline.cpp35-237 src/pipeline.h18-65

Local Workgroup Size Configuration

The set_optimal_local_size_xyz() method src/pipeline.cpp65-103 configures the shader's local workgroup dimensions:

Default Approach src/pipeline.cpp76-82:

Falls back to 4×4×4 if input dimensions are all zero
Clamps to device's max workgroup size limits

Adjustment for Subgroup Size src/pipeline.cpp132-203:

The adjust_xyz() function ensures local_size_x × local_size_y × local_size_z is a multiple of the subgroup size (typically 4-128):

Strategy:
- If z==1 and y==1: adjust x only
- If z==1 and x==1: adjust y only  
- If z==1: adjust x and y
- If y==1 and x==1: adjust z only
- If y==1: adjust x and z
- If x==1: adjust y and z
- Else: adjust x, y, and z

This ensures efficient subgroup utilization while minimizing wasted invocations src/pipeline.cpp132-203

Sources: src/pipeline.cpp65-217

Shader Compilation and SPIR-V

Shaders are compiled from GLSL to SPIR-V using the embedded glslang compiler.

Compilation Functions src/gpu.cpp3807-4237:

Diagram: Shader Compilation Pipeline

Preprocessor Defines src/gpu.cpp3929-4008:

The compiler adds defines based on device capabilities and Option settings:

Define	Condition
`NCNN_fp16_storage`	`opt.use_fp16_storage && info.support_fp16_storage()`
`NCNN_fp16_arithmetic`	`opt.use_fp16_arithmetic && info.support_fp16_arithmetic()`
`NCNN_int8_storage`	`opt.use_int8_storage && info.support_int8_storage()`
`NCNN_int8_arithmetic`	`opt.use_int8_arithmetic && info.support_int8_arithmetic()`
`NCNN_fp16_packed`	`opt.use_fp16_packed && info.support_fp16_packed()`
`NCNN_int8_packed`	`opt.use_int8_packed && info.support_int8_packed()`
`NCNN_image_shader`	`opt.use_tensor_storage`

Layer Shader Registry src/gpu.cpp91-103:

Pre-compiled shaders are embedded as hex data in the binary:

layer_shader_registry[]: Array of shader SPIR-V data
layer_shader_registry_entry_count: Number of built-in shaders
Generated during build from .comp files

Sources: src/gpu.cpp3807-4237 src/gpu.cpp91-103

Pipeline Cache

The PipelineCache class (referenced but implementation in src/pipelinecache.cpp) manages compiled pipelines and shader modules to avoid recompilation:

Cache Key Components:

Shader SPIR-V data or shader type index
Specialization constants
Local workgroup size (x, y, z)
Subgroup size

The cache stores:

VkShaderModule: Compiled shader
VkDescriptorSetLayout: Descriptor layout
VkPipelineLayout: Pipeline layout
VkPipeline: Compute pipeline
VkDescriptorUpdateTemplateKHR: Descriptor update template
ShaderInfo: Binding types and constant info

Sources: src/pipeline.cpp221-237

Vulkan Layer Implementation Pattern

Layer Shader Structure

Vulkan layers implement the forward pass using GLSL compute shaders compiled to SPIR-V at runtime. Each *_vulkan layer class follows a common lifecycle:

Vulkan Layer Lifecycle

Diagram: Vulkan Layer Implementation Pattern

Sources: src/layer/vulkan/convolution_vulkan.cpp1-100 src/layer/vulkan/innerproduct_vulkan.cpp1-30 src/layer/vulkan/pooling_vulkan.cpp1-30

Example: AbsVal_vulkan Layer

AbsVal_vulkan is a representative example of the simplest Vulkan layer pattern — an in-place element-wise operation.

Sources: src/layer/vulkan/absval_vulkan.cpp1-189 src/layer/vulkan/absval_vulkan.h1-50

Shader Macros and Utilities

Common shader macros are conditionally injected by the GLSL compiler based on device capabilities and Option settings.

Precision type aliases:

afp: Arithmetic type — float or float16_t depending on NCNN_fp16_arithmetic
sfp: Storage type — float16_t, int8_t, bfloat16_t, or float
afpvec4, sfpvec4: Corresponding packed vector types

Buffer access macros:

buffer_ld1(data, i) / buffer_st1(data, i, v): Scalar load/store with precision conversion
buffer_ld4(data, i) / buffer_st4(data, i, v): 4-element vector load/store

Push constant access:

psc(name): Expands to the named push constant field (w, h, c, cstep, etc.) — automatically generated from the layer's push constant struct

Sources: src/gpu.cpp3807-4237

Data Types and Format Conversion

Supported Precision Modes

ncnn's Vulkan backend supports multiple precision modes configured through Option:

Precision	Storage Size	Usage	Performance
FP32	4 bytes	Default, highest accuracy	Baseline
FP16 storage	2 bytes	Reduced memory, FP32 arithmetic	2× memory reduction
FP16 packed	2 bytes	Reduced memory, FP16 arithmetic	2× memory + compute speedup
INT8	1 byte	Quantized inference	4× memory reduction
BF16	2 bytes	Mixed precision training/inference	2× memory reduction

Option fields controlling GPU precision src/option.h90-96:

Field	Meaning
`use_fp16_packed`	Pack 8 FP16 values per vec8 register
`use_fp16_storage`	Store in FP16, compute in FP32
`use_fp16_arithmetic`	Both store and compute in FP16
`use_int8_packed`	Pack 8 INT8 values per vec8 register
`use_int8_storage`	Store in INT8
`use_int8_arithmetic`	Both store and compute in INT8

GpuInfo device capability predicates src/gpu.h280-288:

Sources: src/option.h90-96 src/gpu.h280-288

Format Conversion During Transfer

Format conversion happens automatically during upload/download based on GPU type:

Discrete GPU (type == 0):

Upload: CPU casts FP32→FP16/BF16 before transfer src/command.cpp363-383
Download: CPU casts FP16/BF16→FP32 after transfer src/command.cpp531-586

Integrated GPU (type != 0):

Upload: GPU shader casts during convert_packing() src/command.cpp421-431
Download: GPU shader casts during convert_packing() src/command.cpp452-469

This optimization leverages CPU SIMD for discrete GPUs (PCIe bandwidth limited) and GPU parallelism for integrated GPUs (shared memory).

Sources: src/command.cpp358-586

Performance Considerations

Dispatch Threshold and Batching

To avoid GPU driver timeouts, VkCompute tracks pending work and submits when threshold is exceeded:

Threshold Calculation src/net.cpp252-273:

The threshold represents the approximate number of elements processed. High-end GPUs can handle larger batches without timeout.

Sources: src/net.cpp250-298

Memory Alignment and Coherency

Proper memory alignment is critical for Vulkan performance:

Buffer Offset Alignment src/allocator.cpp611:

Aligns to VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment
Typically 16 or 256 bytes depending on vendor

Non-Coherent Memory src/allocator.cpp379-420:

Flush/invalidate must align to nonCoherentAtomSize
Round down offset, round up size to atom boundary

Buffer-Image Granularity src/allocator.cpp612:

Aligns image memory to bufferImageGranularity
Prevents aliasing between buffers and images

Sources: src/allocator.cpp369-595

Subgroup Operations

Modern GPUs benefit from subgroup (warp/wavefront) operations:

Subgroup Size Configuration src/pipeline.cpp105-112:

Workgroup Adjustment src/pipeline.cpp132-203:

The adjust_xyz() function ensures local workgroup size is a multiple of subgroup size for efficient execution.

Subgroup Features src/gpu.h263-271:

Shaders can use subgroup operations when opt.use_subgroup_ops is enabled and device supports them.

Sources: src/pipeline.cpp105-203 src/gpu.h263-271 src/option.h32

GPU and Vulkan System

Purpose and Scope

System Architecture Overview

Vulkan Instance and Device Initialization

Instance Creation and Management

GPU Information and Capabilities

VulkanDevice Creation

Command Recording and Execution

VkCompute and VkTransfer

VkCompute Command Buffer System

Upload and Download Operations

Pipeline Binding and Dispatch

Submission and Synchronization

GPU Memory Allocation

Allocator Hierarchy

VkBlobAllocator Implementation

VkStagingAllocator for CPU-GPU Transfer

VkWeightAllocator for Model Weights

Pipeline Creation and Shader Compilation

Pipeline Class Structure

Local Workgroup Size Configuration

Shader Compilation and SPIR-V

Pipeline Cache

Vulkan Layer Implementation Pattern

Layer Shader Structure

Example: AbsVal_vulkan Layer

Shader Macros and Utilities

Data Types and Format Conversion

Supported Precision Modes

Format Conversion During Transfer

Performance Considerations

Dispatch Threshold and Batching

Memory Alignment and Coherency

Subgroup Operations

On this page

GPU and Vulkan System

Purpose and Scope

System Architecture Overview

Vulkan Instance and Device Initialization

Instance Creation and Management

GPU Information and Capabilities

VulkanDevice Creation

Command Recording and Execution

VkCompute and VkTransfer

VkCompute Command Buffer System

Upload and Download Operations

Pipeline Binding and Dispatch

Submission and Synchronization

GPU Memory Allocation

Allocator Hierarchy

VkBlobAllocator Implementation

VkStagingAllocator for CPU-GPU Transfer

VkWeightAllocator for Model Weights

Pipeline Creation and Shader Compilation

Pipeline Class Structure

Local Workgroup Size Configuration

Shader Compilation and SPIR-V

Pipeline Cache

Vulkan Layer Implementation Pattern

Layer Shader Structure

Example: AbsVal_vulkan Layer

Shader Macros and Utilities

Data Types and Format Conversion

Supported Precision Modes

Format Conversion During Transfer

Performance Considerations

Dispatch Threshold and Batching

Memory Alignment and Coherency

Subgroup Operations

On this page