Command Recording and Pipeline Execution

Relevant source files

This document describes the command recording and pipeline execution system for GPU compute operations in NCNN's Vulkan backend. This system provides the interface for recording GPU work into Vulkan command buffers and executing compute shaders on the GPU.

Scope: This page covers the VkCompute class used for recording GPU operations, the Pipeline class for managing compute pipelines, and the execution flow from command recording to GPU submission. For Vulkan device initialization and selection, see Vulkan Instance and Device Management. For GPU memory allocation and data transfer details, see GPU Memory and Data Transfer. For implementation of specific GPU layers, see Vulkan Layer Implementations.

VkCompute Class Architecture

The VkCompute class is the primary interface for recording GPU compute operations. It encapsulates a Vulkan command buffer and manages the recording, submission, and synchronization of GPU work.

Sources: src/command.h22-88 src/command.cpp13-189 src/pipeline.h18-65

Command Buffer Lifecycle

The command buffer follows a specific lifecycle from initialization through recording, submission, and reset.

Initialization (lines 255-316 in command.cpp):

Creates VkCommandPool with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT flag
Allocates primary VkCommandBuffer from the pool
Creates VkFence for synchronization
For devices with push descriptor support, begins command buffer immediately

Sources: src/command.cpp255-316

Command Recording Operations

Upload and Download Operations

Data transfer between CPU and GPU is recorded using staging buffers and memory barriers.

Upload Implementation (src/command.cpp358-432):

CPU to FP16 conversion (discrete GPU only): Converts FP32 to FP16/BF16 on CPU if discrete GPU
Create staging buffer: Allocates mappable staging buffer
Copy to staging: Uses memcpy to transfer data from CPU Mat to staging buffer
Flush staging: Calls allocator->flush() to make CPU writes visible to GPU
Convert packing: Records vkdev->convert_packing() command to convert element packing and optionally cast types
Resolve elempack: Determines optimal element packing (1 or 4) based on element count

Download Implementation (src/command.cpp434-586):

Convert packing: Records conversion to staging buffer with optimal packing
Memory barrier: Inserts VkBufferMemoryBarrier to ensure GPU writes complete before CPU read
Delayed memcpy: Schedules post-execution memory copy from staging to destination
FP16 to CPU conversion (discrete GPU only): Schedules FP16/BF16 to FP32 conversion after download

Sources: src/command.cpp358-586

Clone Operations

Clone operations duplicate data between Mat, VkMat, and VkImageMat representations.

Source Type	Destination Type	Implementation
Mat	VkMat	CPU→Staging→GPU (via record_upload)
Mat	VkImageMat	CPU→Staging→GPU buffer→GPU image
VkMat	Mat	GPU→Staging→CPU (via record_download)
VkImageMat	Mat	GPU image→GPU buffer→Staging→CPU
VkMat	VkMat	GPU buffer copy or packing conversion
VkImageMat	VkImageMat	GPU image copy
VkMat	VkImageMat	Buffer-to-image copy
VkImageMat	VkMat	Image-to-buffer copy

Sources: src/command.cpp588-1038

Pipeline Execution

Pipeline execution involves binding a compute pipeline, setting up descriptors, and dispatching work.

Pipeline Creation (src/pipeline.cpp219-237):

Pipeline::create() accepts either raw SPIR-V or shader type index
Queries PipelineCache for cached pipeline objects
If not cached, creates shader module, layouts, and pipeline
Stores created objects in PipelinePrivate structure

Local Workgroup Size (src/pipeline.cpp65-217):

set_optimal_local_size_xyz(): Automatically calculates optimal workgroup dimensions
set_local_size_xyz(): Sets explicit workgroup size
adjust_xyz(): Ensures total threads are multiple of subgroup size
Respects hardware limits: max_workgroup_size_x/y/z, max_workgroup_invocations

Descriptor Binding (src/command.cpp1040-1348):

Push descriptors (preferred): Uses vkCmdPushDescriptorSetWithTemplateKHR for direct binding
Traditional path: Allocates descriptor set from pool and binds with vkCmdBindDescriptorSets
Creates VkDescriptorImageInfo or VkDescriptorBufferInfo structures for each binding
Handles mixed buffer and image bindings

Dispatch (src/command.cpp1350-1433):

Calculates workgroup counts based on dispatcher dimensions and local workgroup size
Uses vkCmdDispatchIndirect if dispatcher is GPU buffer (advanced path)
Otherwise uses vkCmdDispatch with calculated workgroup counts
Accumulates pending_dispatch_total for submit thresholding

Sources: src/command.cpp1040-1433 src/pipeline.cpp65-237

Memory Barriers

Memory barriers ensure proper synchronization between GPU operations by controlling when memory becomes visible between pipeline stages.

Barrier Functions (src/command.cpp2187-2296):

Function	Access Pattern	Pipeline Stages	Purpose
`barrier_readwrite(VkMat)`	`SHADER_WRITE` → `SHADER_READ\|WRITE`	`COMPUTE` → `COMPUTE`	Ensure compute shader writes visible to subsequent shader reads
`barrier_readwrite(VkImageMat)`	`SHADER_WRITE` → `SHADER_READ\|WRITE`	`COMPUTE` → `COMPUTE`	Same for image storage
`barrier_readonly(VkImageMat)`	`current` → `SHADER_READ`	`current` → `COMPUTE`	Prepare image for shader sampling

Automatic Barrier Insertion (src/command.cpp1040-1348):

Before pipeline dispatch, checks each binding's access_flags and stage_flags
Inserts barrier if previous operation was write and current is read/write
Updates Mat/ImageMat access state after each operation
Tracks state in VkBufferMemory::access_flags and VkImageMemory::access_flags

Sources: src/command.cpp1040-1348 src/command.cpp2187-2296

Submission and Synchronization

The submit-and-wait mechanism executes recorded commands and synchronizes with the CPU.

Submission Threshold (src/net.cpp250-298):

During the network forward pass, NetPrivate::forward_layer automatically calls submit_and_wait() when cmd.pending_dispatch_total() exceeds a threshold. The threshold scales with the device's rough_score() to prevent driver timeouts on slower hardware while reducing unnecessary synchronizations on high-end GPUs:

`rough_score()`	`pending_dispatch_threshold`
> 75	8 MB
> 50	4 MB
> 15	1 MB
> 10	256 KB
≤ 10	32 KB

pending_dispatch_total is accumulated in VkComputePrivate with each dispatch call. The threshold is also crossed immediately when a layer lacks Vulkan support and requires record_download to produce a CPU Mat for the next layer.

Sources: src/net.cpp250-298 src/command.h71

submit_and_wait() Implementation (src/command.cpp1435-1587):

Execute delayed records: Processes records that couldn't be executed during recording (barriers, descriptor bindings on non-push-descriptor devices)
End command buffer: Calls vkEndCommandBuffer()
Submit to queue: Submits command buffer to compute queue with fence
Wait for fence: Blocks until GPU work completes with vkWaitForFences()
Post-processing: Executes delayed operations:
- Memory copies from staging to destination Mats
- FP16/BF16 to FP32 conversions on CPU
Cleanup: Clears staging buffers and delayed records
Reset: Resets fence and command buffer for reuse

reset() Function (src/command.cpp1589-1605):

Called after submit_and_wait() to prepare for new recording
Clears all internal state vectors
Resets pending_dispatch_total to 0
For push descriptor devices, begins a new command buffer immediately

Sources: src/command.cpp1435-1605 src/net.cpp250-298

Benchmarking Support

When NCNN_BENCHMARK is enabled, VkCompute supports timestamp queries for performance measurement.

Query Pool Management (src/command.cpp1607-1652):

create_query_pool(query_count): Creates VkQueryPool with timestamp query type
Each layer gets 2 queries: one at start, one at end of GPU execution
Query pool reset at command buffer begin

Timestamp Recording (src/command.cpp1655-1676):

record_write_timestamp(query): Inserts vkCmdWriteTimestamp command
Recorded at pipeline stage VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
Query index calculated as layer_index * 2 and layer_index * 2 + 1

Result Retrieval (src/command.cpp1678-1695):

get_query_pool_results(): Retrieves timestamp values after submission
Converts timestamps to microseconds using vkdev->info.timestamp_period()
Used in net.cpp:281-292 to print per-layer GPU timing

Integration in NetPrivate::forward_layer (src/net.cpp280-293):

For each Vulkan-supported layer, record_write_timestamp is called with indices layer_index * 2 (before) and layer_index * 2 + 1 (after) the do_forward_layer call. After submit_and_wait(), get_query_pool_results retrieves all accumulated timestamps. Duration in microseconds is computed by multiplying the raw tick delta by vkdev->info.timestamp_period() (nanoseconds per tick) and dividing by 1000. Results are logged per-layer using NCNN_LOGE.

Sources: src/command.cpp1607-1695 src/net.cpp280-293

VkTransfer Class

VkTransfer is a specialized command recorder for weight upload operations during model loading.

Differences from VkCompute:

Uses transfer queue instead of compute queue (if available)
Simpler operations: only buffer uploads, no pipeline dispatch
No barriers needed (write-once during loading)
Optimized for bulk data transfer

Sources: src/command.cpp1697-1997

Command Recording and Pipeline Execution

Relevant source files

VkCompute Class Architecture

The VkCompute class is the primary interface for recording GPU compute operations. It encapsulates a Vulkan command buffer and manages the recording, submission, and synchronization of GPU work.

Sources: src/command.h22-88 src/command.cpp13-189 src/pipeline.h18-65

Command Buffer Lifecycle

The command buffer follows a specific lifecycle from initialization through recording, submission, and reset.

Initialization (lines 255-316 in command.cpp):

Creates VkCommandPool with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT flag
Allocates primary VkCommandBuffer from the pool
Creates VkFence for synchronization
For devices with push descriptor support, begins command buffer immediately

Sources: src/command.cpp255-316

Command Recording Operations

Upload and Download Operations

Data transfer between CPU and GPU is recorded using staging buffers and memory barriers.

Upload Implementation (src/command.cpp358-432):

CPU to FP16 conversion (discrete GPU only): Converts FP32 to FP16/BF16 on CPU if discrete GPU
Create staging buffer: Allocates mappable staging buffer
Copy to staging: Uses memcpy to transfer data from CPU Mat to staging buffer
Flush staging: Calls allocator->flush() to make CPU writes visible to GPU
Convert packing: Records vkdev->convert_packing() command to convert element packing and optionally cast types
Resolve elempack: Determines optimal element packing (1 or 4) based on element count

Download Implementation (src/command.cpp434-586):

Convert packing: Records conversion to staging buffer with optimal packing
Memory barrier: Inserts VkBufferMemoryBarrier to ensure GPU writes complete before CPU read
Delayed memcpy: Schedules post-execution memory copy from staging to destination
FP16 to CPU conversion (discrete GPU only): Schedules FP16/BF16 to FP32 conversion after download

Sources: src/command.cpp358-586

Clone Operations

Clone operations duplicate data between Mat, VkMat, and VkImageMat representations.

Source Type	Destination Type	Implementation
Mat	VkMat	CPU→Staging→GPU (via record_upload)
Mat	VkImageMat	CPU→Staging→GPU buffer→GPU image
VkMat	Mat	GPU→Staging→CPU (via record_download)
VkImageMat	Mat	GPU image→GPU buffer→Staging→CPU
VkMat	VkMat	GPU buffer copy or packing conversion
VkImageMat	VkImageMat	GPU image copy
VkMat	VkImageMat	Buffer-to-image copy
VkImageMat	VkMat	Image-to-buffer copy

Sources: src/command.cpp588-1038

Pipeline Execution

Pipeline execution involves binding a compute pipeline, setting up descriptors, and dispatching work.

Pipeline Creation (src/pipeline.cpp219-237):

Pipeline::create() accepts either raw SPIR-V or shader type index
Queries PipelineCache for cached pipeline objects
If not cached, creates shader module, layouts, and pipeline
Stores created objects in PipelinePrivate structure

Local Workgroup Size (src/pipeline.cpp65-217):

set_optimal_local_size_xyz(): Automatically calculates optimal workgroup dimensions
set_local_size_xyz(): Sets explicit workgroup size
adjust_xyz(): Ensures total threads are multiple of subgroup size
Respects hardware limits: max_workgroup_size_x/y/z, max_workgroup_invocations

Descriptor Binding (src/command.cpp1040-1348):

Push descriptors (preferred): Uses vkCmdPushDescriptorSetWithTemplateKHR for direct binding
Traditional path: Allocates descriptor set from pool and binds with vkCmdBindDescriptorSets
Creates VkDescriptorImageInfo or VkDescriptorBufferInfo structures for each binding
Handles mixed buffer and image bindings

Dispatch (src/command.cpp1350-1433):

Calculates workgroup counts based on dispatcher dimensions and local workgroup size
Uses vkCmdDispatchIndirect if dispatcher is GPU buffer (advanced path)
Otherwise uses vkCmdDispatch with calculated workgroup counts
Accumulates pending_dispatch_total for submit thresholding

Sources: src/command.cpp1040-1433 src/pipeline.cpp65-237

Memory Barriers

Memory barriers ensure proper synchronization between GPU operations by controlling when memory becomes visible between pipeline stages.

Barrier Functions (src/command.cpp2187-2296):

Function	Access Pattern	Pipeline Stages	Purpose
`barrier_readwrite(VkMat)`	`SHADER_WRITE` → `SHADER_READ\|WRITE`	`COMPUTE` → `COMPUTE`	Ensure compute shader writes visible to subsequent shader reads
`barrier_readwrite(VkImageMat)`	`SHADER_WRITE` → `SHADER_READ\|WRITE`	`COMPUTE` → `COMPUTE`	Same for image storage
`barrier_readonly(VkImageMat)`	`current` → `SHADER_READ`	`current` → `COMPUTE`	Prepare image for shader sampling

Automatic Barrier Insertion (src/command.cpp1040-1348):

Before pipeline dispatch, checks each binding's access_flags and stage_flags
Inserts barrier if previous operation was write and current is read/write
Updates Mat/ImageMat access state after each operation
Tracks state in VkBufferMemory::access_flags and VkImageMemory::access_flags

Sources: src/command.cpp1040-1348 src/command.cpp2187-2296

Submission and Synchronization

The submit-and-wait mechanism executes recorded commands and synchronizes with the CPU.

Submission Threshold (src/net.cpp250-298):

`rough_score()`	`pending_dispatch_threshold`
> 75	8 MB
> 50	4 MB
> 15	1 MB
> 10	256 KB
≤ 10	32 KB

Sources: src/net.cpp250-298 src/command.h71

submit_and_wait() Implementation (src/command.cpp1435-1587):

Execute delayed records: Processes records that couldn't be executed during recording (barriers, descriptor bindings on non-push-descriptor devices)
End command buffer: Calls vkEndCommandBuffer()
Submit to queue: Submits command buffer to compute queue with fence
Wait for fence: Blocks until GPU work completes with vkWaitForFences()
Post-processing: Executes delayed operations:
- Memory copies from staging to destination Mats
- FP16/BF16 to FP32 conversions on CPU
Cleanup: Clears staging buffers and delayed records
Reset: Resets fence and command buffer for reuse

reset() Function (src/command.cpp1589-1605):

Called after submit_and_wait() to prepare for new recording
Clears all internal state vectors
Resets pending_dispatch_total to 0
For push descriptor devices, begins a new command buffer immediately

Sources: src/command.cpp1435-1605 src/net.cpp250-298

Benchmarking Support

When NCNN_BENCHMARK is enabled, VkCompute supports timestamp queries for performance measurement.

Query Pool Management (src/command.cpp1607-1652):

create_query_pool(query_count): Creates VkQueryPool with timestamp query type
Each layer gets 2 queries: one at start, one at end of GPU execution
Query pool reset at command buffer begin

Timestamp Recording (src/command.cpp1655-1676):

record_write_timestamp(query): Inserts vkCmdWriteTimestamp command
Recorded at pipeline stage VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
Query index calculated as layer_index * 2 and layer_index * 2 + 1

Result Retrieval (src/command.cpp1678-1695):

get_query_pool_results(): Retrieves timestamp values after submission
Converts timestamps to microseconds using vkdev->info.timestamp_period()
Used in net.cpp:281-292 to print per-layer GPU timing

Integration in NetPrivate::forward_layer (src/net.cpp280-293):

Sources: src/command.cpp1607-1695 src/net.cpp280-293

VkTransfer Class

VkTransfer is a specialized command recorder for weight upload operations during model loading.

Differences from VkCompute:

Uses transfer queue instead of compute queue (if available)
Simpler operations: only buffer uploads, no pipeline dispatch
No barriers needed (write-once during loading)
Optimized for bulk data transfer

Sources: src/command.cpp1697-1997

Command Recording and Pipeline Execution

VkCompute Class Architecture

Command Buffer Lifecycle

Command Recording Operations

Upload and Download Operations

Clone Operations

Pipeline Execution

Memory Barriers

Submission and Synchronization

Benchmarking Support

VkTransfer Class

On this page

Command Recording and Pipeline Execution

VkCompute Class Architecture

Command Buffer Lifecycle

Command Recording Operations

Upload and Download Operations

Clone Operations

Pipeline Execution

Memory Barriers

Submission and Synchronization

Benchmarking Support

VkTransfer Class

On this page