This document describes the Vulkan-based GPU acceleration system in ncnn. It covers the initialization and management of Vulkan instances and devices, the command recording and execution model, GPU memory allocation and CPU-GPU data transfer mechanisms, and how Vulkan compute layers are implemented.
For information about the core inference runtime that uses this GPU system, see Core Runtime Architecture. For details on CPU-specific layer optimizations, see CPU Layer Implementations. For platform-specific SIMD optimizations, see Platform-Specific Optimizations.
The Vulkan system provides GPU acceleration through compute shaders. The system is organized into four primary subsystems:
Vulkan System Component Map
Diagram: Vulkan System — Classes and Source Files
Sources: src/gpu.h1-548 src/command.h1-118 src/allocator.h208-397 src/pipeline.h1-65
The Vulkan instance is created once globally and shared across all GPU operations. The instance manages physical device enumeration and extension loading.
Diagram: Vulkan Instance Initialization Flow
The global instance holder is defined in src/gpu.cpp37-78:
| Global Variable | Type | Purpose |
|---|---|---|
g_instance | __ncnn_vulkan_instance_holder | Stores VkInstance and initialization state |
g_gpu_count | int | Number of available GPUs |
g_default_gpu_index | int | Index of default GPU |
g_gpu_infos[] | GpuInfo*[32] | Array of GPU capability info |
g_default_vkdev[] | VulkanDevice*[32] | Array of default device instances |
Sources: src/gpu.cpp32-89 src/gpu.h15-30
Each physical device's capabilities are queried and stored in a GpuInfo object. The GpuInfoPrivate class src/gpu.cpp253-408 performs comprehensive capability detection:
Diagram: GPU Capability Query Process
Key capability categories queried:
Sources: src/gpu.cpp253-859 src/gpu.h184-414
The VulkanDevice class src/gpu.h423-546 wraps a logical device and provides device-specific operations:
Diagram: VulkanDevice Initialization
The device provides key methods:
| Method | Purpose |
|---|---|
compile_shader_module() | Compiles SPIR-V to VkShaderModule src/gpu.cpp2746-2846 |
create_descriptorset_layout() | Creates descriptor set layout src/gpu.cpp2848-2925 |
create_pipeline_layout() | Creates pipeline layout src/gpu.cpp2927-2990 |
create_pipeline() | Creates compute pipeline src/gpu.cpp2992-3080 |
acquire_queue() / reclaim_queue() | Queue management for multi-threading src/gpu.cpp3225-3296 |
acquire_blob_allocator() / reclaim_blob_allocator() | Allocator management src/gpu.cpp3298-3365 |
convert_packing() | Utility for converting element packing src/gpu.cpp3580-3805 |
Sources: src/gpu.cpp2562-3805 src/gpu.h423-546
Two command-recorder classes are provided in src/command.h1-118:
VkCompute src/command.h22-88: Used during inference. Records both compute dispatches and CPU↔GPU data transfers. Maintains upload/download staging buffers across the lifetime of a command recording.VkTransfer src/command.h90-111: Used at model load time (weight upload only). A simpler, upload-only recorder called from each layer's upload_model() method. Does not support download or compute dispatch.VkCompute provides a high-level interface for recording GPU operations. It maintains a command pool, command buffer, and fence internally through VkComputePrivate src/command.cpp13-189
VkCompute Command Buffer Lifecycle
Diagram: VkCompute Command Recording and Submission Flow
Sources: src/command.cpp13-356 src/command.h22-88
Data transfer between CPU and GPU involves staging buffers for devices without host-visible device memory.
Upload Path src/command.cpp358-432:
VkStagingAllocatormemcpy source data to staging buffer's mapped memoryconvert_packing() to transfer staging → destination with optimal element packingupload_staging_buffers vectorDownload Path src/command.cpp434-586:
convert_packing() to transfer source → staging with element unpackingmemcpy after submit)submit_and_wait(), copy staging to output MatDiagram: CPU-GPU Data Transfer Paths
Sources: src/command.cpp358-586
Recording a pipeline execution involves binding the pipeline, updating descriptors, pushing constants, and dispatching workgroups.
For devices with VK_KHR_push_descriptor src/command.cpp1268-1328:
For devices without push descriptors src/command.cpp1330-1439:
Descriptors are allocated from pools and recorded as delayed operations, then executed in submit_and_wait().
Workgroup Size Calculation src/command.cpp1197-1233:
Dispatcher determines the number of workgroups to dispatch based on output dimensions and pipeline's local size:
group_count_x = (w + local_size_x - 1) / local_size_x
group_count_y = (h + local_size_y - 1) / local_size_y
group_count_z = (c + local_size_z - 1) / local_size_z
Sources: src/command.cpp1142-1439
The submit_and_wait() method src/command.cpp1441-1607 handles command submission and synchronization:
Diagram: Command Submission and Wait Flow
The method also handles benchmark query results if NCNN_BENCHMARK is enabled src/command.cpp1467-1482
Sources: src/command.cpp1441-1607 src/command.cpp1609-1685
The Vulkan memory allocators follow a common interface defined by VkAllocator src/allocator.h263-295:
Diagram: Vulkan Allocator Hierarchy
Sources: src/allocator.h208-397 src/allocator.cpp349-367
VkBlobAllocator src/allocator.cpp596-1012 manages memory in large blocks and sub-allocates from them:
Diagram: VkBlobAllocator Buffer Allocation
Memory Type Selection src/gpu.cpp3082-3176:
The allocator prefers memory types with specific properties:
| Required | Preferred | Preferred Not |
|---|---|---|
| DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED |
| - | DEVICE_LOCAL | - |
This allows direct GPU access while supporting mapped access on integrated GPUs.
Block Management src/allocator.cpp608-1012:
buffer_blocks: Vector of allocated VkBufferMemory* blocksbuffer_budgets: List of free (offset, size) ranges per blockimage_memory_blocks: Separate vector for image memoryimage_memory_budgets: Free ranges for image allocationsAllocation aligns to:
buffer_offset_alignment: From VkPhysicalDeviceLimits src/allocator.cpp611bind_memory_offset_alignment: Buffer-image granularity src/allocator.cpp612Sources: src/allocator.cpp596-1012
VkStagingAllocator src/allocator.cpp1014-1299 is optimized for frequent allocations during data transfer:
Key Features:
PoolAllocatorDiagram: Staging Allocator Lifecycle
Sources: src/allocator.cpp1014-1299
VkWeightAllocator src/allocator.cpp1301-1616 is designed for immutable model weight storage:
Configuration Options:
Prefer Host Memory: For use_weights_in_host_memory option src/allocator.cpp1316-1338
Device Local: Default mode
Block Strategy:
Unlike blob allocator, weight allocator creates dedicated blocks per allocation, avoiding fragmentation from mixed-size weight tensors src/allocator.cpp1396-1529
Sources: src/allocator.cpp1301-1616
The Pipeline class src/pipeline.h18-65 wraps a Vulkan compute pipeline and provides workgroup size configuration. Internally it holds a PipelinePrivate with:
VkShaderModule shader_moduleVkDescriptorSetLayout descriptorset_layoutVkPipelineLayout pipeline_layoutVkPipeline pipelineVkDescriptorUpdateTemplateKHR descriptor_update_templateShaderInfo shader_info — binding types and push constant countlocal_size_x/y/z, subgroup_sizePipeline Creation Flow
Diagram: Pipeline Creation via PipelineCache
Sources: src/pipeline.cpp35-237 src/pipeline.h18-65
The set_optimal_local_size_xyz() method src/pipeline.cpp65-103 configures the shader's local workgroup dimensions:
Default Approach src/pipeline.cpp76-82:
Adjustment for Subgroup Size src/pipeline.cpp132-203:
The adjust_xyz() function ensures local_size_x × local_size_y × local_size_z is a multiple of the subgroup size (typically 4-128):
Strategy:
- If z==1 and y==1: adjust x only
- If z==1 and x==1: adjust y only
- If z==1: adjust x and y
- If y==1 and x==1: adjust z only
- If y==1: adjust x and z
- If x==1: adjust y and z
- Else: adjust x, y, and z
This ensures efficient subgroup utilization while minimizing wasted invocations src/pipeline.cpp132-203
Sources: src/pipeline.cpp65-217
Shaders are compiled from GLSL to SPIR-V using the embedded glslang compiler.
Compilation Functions src/gpu.cpp3807-4237:
Diagram: Shader Compilation Pipeline
Preprocessor Defines src/gpu.cpp3929-4008:
The compiler adds defines based on device capabilities and Option settings:
| Define | Condition |
|---|---|
NCNN_fp16_storage | opt.use_fp16_storage && info.support_fp16_storage() |
NCNN_fp16_arithmetic | opt.use_fp16_arithmetic && info.support_fp16_arithmetic() |
NCNN_int8_storage | opt.use_int8_storage && info.support_int8_storage() |
NCNN_int8_arithmetic | opt.use_int8_arithmetic && info.support_int8_arithmetic() |
NCNN_fp16_packed | opt.use_fp16_packed && info.support_fp16_packed() |
NCNN_int8_packed | opt.use_int8_packed && info.support_int8_packed() |
NCNN_image_shader | opt.use_tensor_storage |
Layer Shader Registry src/gpu.cpp91-103:
Pre-compiled shaders are embedded as hex data in the binary:
layer_shader_registry[]: Array of shader SPIR-V datalayer_shader_registry_entry_count: Number of built-in shaders.comp filesSources: src/gpu.cpp3807-4237 src/gpu.cpp91-103
The PipelineCache class (referenced but implementation in src/pipelinecache.cpp) manages compiled pipelines and shader modules to avoid recompilation:
Cache Key Components:
The cache stores:
VkShaderModule: Compiled shaderVkDescriptorSetLayout: Descriptor layoutVkPipelineLayout: Pipeline layoutVkPipeline: Compute pipelineVkDescriptorUpdateTemplateKHR: Descriptor update templateShaderInfo: Binding types and constant infoSources: src/pipeline.cpp221-237
Vulkan layers implement the forward pass using GLSL compute shaders compiled to SPIR-V at runtime. Each *_vulkan layer class follows a common lifecycle:
Vulkan Layer Lifecycle
Diagram: Vulkan Layer Implementation Pattern
Sources: src/layer/vulkan/convolution_vulkan.cpp1-100 src/layer/vulkan/innerproduct_vulkan.cpp1-30 src/layer/vulkan/pooling_vulkan.cpp1-30
AbsVal_vulkan is a representative example of the simplest Vulkan layer pattern — an in-place element-wise operation.
Class structure: The class holds one Pipeline* member per element-packing variant: pipeline_absval (elempack=1), pipeline_absval_pack4 (elempack=4), and pipeline_absval_pack8 (elempack=8 for fp16 packed). The choice of variant is determined by the element pack of the input VkMat at forward time.
create_pipeline(): Instantiates one Pipeline per active packing variant. Each pipeline calls set_optimal_local_size_xyz() and create(shader_type_index, opt, specializations) referencing a shader type from the LayerShaderType enum. Specialization constants for simple element-wise operations are typically empty; shape-dependent layers encode dimensions and strides here.
forward_inplace(VkMat&, VkCompute&, Option&): Selects the correct pipeline based on bottom_top_blob.elempack, prepares a std::vector<VkMat> for bindings, and calls cmd.record_pipeline(). The single binding is the in-out buffer. No host-side output allocation is needed.
Shader design: The corresponding .comp shader uses specialization constant IDs 233, 234, 235 for local workgroup size x/y/z, accesses dimensions via the psc(w/h/c) push constant macro, and uses buffer_ld1/buffer_st1 macros that resolve to the correct precision at compile time via NCNN_fp16_storage and related defines.
Sources: src/layer/vulkan/absval_vulkan.cpp1-189 src/layer/vulkan/absval_vulkan.h1-50
Common shader macros are conditionally injected by the GLSL compiler based on device capabilities and Option settings.
Precision type aliases:
afp: Arithmetic type — float or float16_t depending on NCNN_fp16_arithmeticsfp: Storage type — float16_t, int8_t, bfloat16_t, or floatafpvec4, sfpvec4: Corresponding packed vector typesBuffer access macros:
buffer_ld1(data, i) / buffer_st1(data, i, v): Scalar load/store with precision conversionbuffer_ld4(data, i) / buffer_st4(data, i, v): 4-element vector load/storePush constant access:
psc(name): Expands to the named push constant field (w, h, c, cstep, etc.) — automatically generated from the layer's push constant structThese macros allow a single shader source file to compile correctly across all precision modes. The glslang pre-processor injects the relevant defines during compile_spirv_module() based on the Option settings and device support flags src/gpu.cpp3929-4008
Sources: src/gpu.cpp3807-4237
ncnn's Vulkan backend supports multiple precision modes configured through Option:
| Precision | Storage Size | Usage | Performance |
|---|---|---|---|
| FP32 | 4 bytes | Default, highest accuracy | Baseline |
| FP16 storage | 2 bytes | Reduced memory, FP32 arithmetic | 2× memory reduction |
| FP16 packed | 2 bytes | Reduced memory, FP16 arithmetic | 2× memory + compute speedup |
| INT8 | 1 byte | Quantized inference | 4× memory reduction |
| BF16 | 2 bytes | Mixed precision training/inference | 2× memory reduction |
Option fields controlling GPU precision src/option.h90-96:
| Field | Meaning |
|---|---|
use_fp16_packed | Pack 8 FP16 values per vec8 register |
use_fp16_storage | Store in FP16, compute in FP32 |
use_fp16_arithmetic | Both store and compute in FP16 |
use_int8_packed | Pack 8 INT8 values per vec8 register |
use_int8_storage | Store in INT8 |
use_int8_arithmetic | Both store and compute in INT8 |
GpuInfo device capability predicates src/gpu.h280-288:
support_fp16_packed(), support_fp16_storage(), support_fp16_arithmetic(), support_int8_packed(), support_int8_storage(), support_int8_arithmetic() — all queried from extension features during create_gpu_instance().
Sources: src/option.h90-96 src/gpu.h280-288
Format conversion happens automatically during upload/download based on GPU type:
Discrete GPU (type == 0):
Integrated GPU (type != 0):
convert_packing() src/command.cpp421-431convert_packing() src/command.cpp452-469This optimization leverages CPU SIMD for discrete GPUs (PCIe bandwidth limited) and GPU parallelism for integrated GPUs (shared memory).
Sources: src/command.cpp358-586
To avoid GPU driver timeouts, VkCompute tracks pending work and submits when threshold is exceeded:
Threshold Calculation src/net.cpp252-273:
The threshold represents the approximate number of elements processed. High-end GPUs can handle larger batches without timeout.
Sources: src/net.cpp250-298
Proper memory alignment is critical for Vulkan performance:
Buffer Offset Alignment src/allocator.cpp611:
VkPhysicalDeviceLimits::minStorageBufferOffsetAlignmentNon-Coherent Memory src/allocator.cpp379-420:
nonCoherentAtomSizeBuffer-Image Granularity src/allocator.cpp612:
bufferImageGranularitySources: src/allocator.cpp369-595
Modern GPUs benefit from subgroup (warp/wavefront) operations:
Subgroup Size Configuration src/pipeline.cpp105-112:
Workgroup Adjustment src/pipeline.cpp132-203:
The adjust_xyz() function ensures local workgroup size is a multiple of subgroup size for efficient execution.
Subgroup Features src/gpu.h263-271:
Shaders can use subgroup operations when opt.use_subgroup_ops is enabled and device supports them.
Sources: src/pipeline.cpp105-203 src/gpu.h263-271 src/option.h32
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.