This document describes the command recording and pipeline execution system for GPU compute operations in NCNN's Vulkan backend. This system provides the interface for recording GPU work into Vulkan command buffers and executing compute shaders on the GPU.
Scope: This page covers the VkCompute class used for recording GPU operations, the Pipeline class for managing compute pipelines, and the execution flow from command recording to GPU submission. For Vulkan device initialization and selection, see Vulkan Instance and Device Management. For GPU memory allocation and data transfer details, see GPU Memory and Data Transfer. For implementation of specific GPU layers, see Vulkan Layer Implementations.
The VkCompute class is the primary interface for recording GPU compute operations. It encapsulates a Vulkan command buffer and manages the recording, submission, and synchronization of GPU work.
Sources: src/command.h22-88 src/command.cpp13-189 src/pipeline.h18-65
The command buffer follows a specific lifecycle from initialization through recording, submission, and reset.
Initialization (lines 255-316 in command.cpp):
VkCommandPool with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT flagVkCommandBuffer from the poolVkFence for synchronizationSources: src/command.cpp255-316
Data transfer between CPU and GPU is recorded using staging buffers and memory barriers.
Upload Implementation (src/command.cpp358-432):
memcpy to transfer data from CPU Mat to staging bufferallocator->flush() to make CPU writes visible to GPUvkdev->convert_packing() command to convert element packing and optionally cast typesDownload Implementation (src/command.cpp434-586):
VkBufferMemoryBarrier to ensure GPU writes complete before CPU readSources: src/command.cpp358-586
Clone operations duplicate data between Mat, VkMat, and VkImageMat representations.
| Source Type | Destination Type | Implementation |
|---|---|---|
| Mat | VkMat | CPU→Staging→GPU (via record_upload) |
| Mat | VkImageMat | CPU→Staging→GPU buffer→GPU image |
| VkMat | Mat | GPU→Staging→CPU (via record_download) |
| VkImageMat | Mat | GPU image→GPU buffer→Staging→CPU |
| VkMat | VkMat | GPU buffer copy or packing conversion |
| VkImageMat | VkImageMat | GPU image copy |
| VkMat | VkImageMat | Buffer-to-image copy |
| VkImageMat | VkMat | Image-to-buffer copy |
Sources: src/command.cpp588-1038
Pipeline execution involves binding a compute pipeline, setting up descriptors, and dispatching work.
Pipeline Creation (src/pipeline.cpp219-237):
Pipeline::create() accepts either raw SPIR-V or shader type indexPipelineCache for cached pipeline objectsPipelinePrivate structureLocal Workgroup Size (src/pipeline.cpp65-217):
set_optimal_local_size_xyz(): Automatically calculates optimal workgroup dimensionsset_local_size_xyz(): Sets explicit workgroup sizeadjust_xyz(): Ensures total threads are multiple of subgroup sizemax_workgroup_size_x/y/z, max_workgroup_invocationsDescriptor Binding (src/command.cpp1040-1348):
vkCmdPushDescriptorSetWithTemplateKHR for direct bindingvkCmdBindDescriptorSetsVkDescriptorImageInfo or VkDescriptorBufferInfo structures for each bindingDispatch (src/command.cpp1350-1433):
vkCmdDispatchIndirect if dispatcher is GPU buffer (advanced path)vkCmdDispatch with calculated workgroup countspending_dispatch_total for submit thresholdingSources: src/command.cpp1040-1433 src/pipeline.cpp65-237
Memory barriers ensure proper synchronization between GPU operations by controlling when memory becomes visible between pipeline stages.
Barrier Functions (src/command.cpp2187-2296):
| Function | Access Pattern | Pipeline Stages | Purpose |
|---|---|---|---|
barrier_readwrite(VkMat) | SHADER_WRITE → SHADER_READ|WRITE | COMPUTE → COMPUTE | Ensure compute shader writes visible to subsequent shader reads |
barrier_readwrite(VkImageMat) | SHADER_WRITE → SHADER_READ|WRITE | COMPUTE → COMPUTE | Same for image storage |
barrier_readonly(VkImageMat) | current → SHADER_READ | current → COMPUTE | Prepare image for shader sampling |
Automatic Barrier Insertion (src/command.cpp1040-1348):
access_flags and stage_flagsVkBufferMemory::access_flags and VkImageMemory::access_flagsSources: src/command.cpp1040-1348 src/command.cpp2187-2296
The submit-and-wait mechanism executes recorded commands and synchronizes with the CPU.
Submission Threshold (src/net.cpp250-298):
During the network forward pass, NetPrivate::forward_layer automatically calls submit_and_wait() when cmd.pending_dispatch_total() exceeds a threshold. The threshold scales with the device's rough_score() to prevent driver timeouts on slower hardware while reducing unnecessary synchronizations on high-end GPUs:
rough_score() | pending_dispatch_threshold |
|---|---|
| > 75 | 8 MB |
| > 50 | 4 MB |
| > 15 | 1 MB |
| > 10 | 256 KB |
| ≤ 10 | 32 KB |
pending_dispatch_total is accumulated in VkComputePrivate with each dispatch call. The threshold is also crossed immediately when a layer lacks Vulkan support and requires record_download to produce a CPU Mat for the next layer.
Sources: src/net.cpp250-298 src/command.h71
submit_and_wait() Implementation (src/command.cpp1435-1587):
vkEndCommandBuffer()vkWaitForFences()reset() Function (src/command.cpp1589-1605):
submit_and_wait() to prepare for new recordingpending_dispatch_total to 0Sources: src/command.cpp1435-1605 src/net.cpp250-298
When NCNN_BENCHMARK is enabled, VkCompute supports timestamp queries for performance measurement.
Query Pool Management (src/command.cpp1607-1652):
create_query_pool(query_count): Creates VkQueryPool with timestamp query typeTimestamp Recording (src/command.cpp1655-1676):
record_write_timestamp(query): Inserts vkCmdWriteTimestamp commandVK_PIPELINE_STAGE_COMPUTE_SHADER_BITlayer_index * 2 and layer_index * 2 + 1Result Retrieval (src/command.cpp1678-1695):
get_query_pool_results(): Retrieves timestamp values after submissionvkdev->info.timestamp_period()net.cpp:281-292 to print per-layer GPU timingIntegration in NetPrivate::forward_layer (src/net.cpp280-293):
For each Vulkan-supported layer, record_write_timestamp is called with indices layer_index * 2 (before) and layer_index * 2 + 1 (after) the do_forward_layer call. After submit_and_wait(), get_query_pool_results retrieves all accumulated timestamps. Duration in microseconds is computed by multiplying the raw tick delta by vkdev->info.timestamp_period() (nanoseconds per tick) and dividing by 1000. Results are logged per-layer using NCNN_LOGE.
Sources: src/command.cpp1607-1695 src/net.cpp280-293
VkTransfer is a specialized command recorder for weight upload operations during model loading.
Differences from VkCompute:
Sources: src/command.cpp1697-1997
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.