This page documents the Vulkan-based GPU memory management and data transfer system in NCNN. It covers memory allocators, buffer/image allocation strategies, CPU-GPU data transfer operations, and synchronization mechanisms.
Scope: This page focuses on GPU memory allocation and data movement between CPU and GPU. For information about Vulkan device initialization and queue management, see Vulkan Instance and Device Management. For command recording and pipeline execution, see Command Recording and Pipeline Execution. For CPU memory allocators, see Memory Allocators.
NCNN's Vulkan backend uses a multi-tier memory system designed to optimize performance across different GPU architectures (discrete vs integrated).
Memory Object Types
| Type | Purpose | Backing Structure | Key Fields |
|---|---|---|---|
VkBufferMemory | Linear buffer storage | VkBuffer + VkDeviceMemory | buffer, offset, capacity, mapped_ptr, access_flags, stage_flags |
VkImageMemory | Image/texture storage | VkImage + VkImageView | image, imageview, format, width, height, depth, image_layout |
Sources: src/allocator.h212-261 src/mat.h
NCNN provides specialized allocators for different memory usage patterns in GPU inference.
Sources: src/allocator.h263-397 src/allocator.cpp349-1675
Purpose: Manages device-local memory for intermediate tensors during inference with block-based pooling.
Architecture:
Key Methods:
fastMalloc(size_t size) → VkBufferMemory* - Allocates buffer from poolfastMalloc(int w, int h, int c, size_t elemsize, int elempack) → VkImageMemory* - Allocates image from poolfastFree(VkBufferMemory* ptr) - Returns buffer to poolclear() - Releases all blocks immediatelyMemory Allocation Strategy:
block_size (16MB default)buffer_offset_alignment, bind_memory_offset_alignment)buffer_budgets, image_memory_budgets) for recyclingSources: src/allocator.cpp607-1007 src/allocator.h298-319
Purpose: Manages host-visible, device-visible memory for CPU-GPU data transfer with size-based pooling.
Characteristics:
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT memoryMemory Properties Selection:
Size Comparison Logic:
S, searches budgets for size B where B >= S and B * size_compare_ratio <= SSources: src/allocator.cpp1289-1562 src/allocator.h351-376
Purpose: Allocates dedicated memory blocks for model weights with optional host memory preference.
Allocation Strategy:
VkDeviceMemory block (no sub-allocation)prefer_host_memory flag for keeping weights in system RAM (resizable BAR optimization)Host Memory Path (for discrete GPUs with resizable BAR):
Sources: src/allocator.cpp1009-1287 src/allocator.h322-348
Data transfer between CPU and GPU is managed by the VkCompute class, which records transfer commands into Vulkan command buffers.
Upload Flow Diagram:
Key Steps in VkCompute::record_upload():
CPU-side type conversion (discrete GPU only):
vkdev->info.type() == 0) and FP16 storage enabledStaging buffer allocation:
opt.staging_vkallocatormemcpy() to mapped_ptrallocator->flush() calls vkFlushMappedMemoryRanges()Memory barrier:
VK_ACCESS_HOST_WRITE_BIT at VK_PIPELINE_STAGE_HOST_BITFormat conversion and packing:
vkdev->convert_packing() to convert to optimal element packing (4-wide vectors)dst_elempack based on element count (pack4 if count divisible by 4)Staging buffer retention:
upload_staging_buffers vectorsubmit_and_wait() completesSources: src/command.cpp358-432 src/command.h29
Download Flow Diagram:
Key Steps in VkCompute::record_download():
Packing and type conversion (on GPU):
dst_elempack (based on opt.use_packing_layout)opt.staging_vkallocator (or use mappable blob allocator)Memory barrier:
VK_ACCESS_HOST_WRITE_BIT or wrong stageVkBufferMemoryBarrier: srcAccessMask = current access, dstAccessMask = VK_ACCESS_HOST_READ_BITDeferred download recording:
download_post_buffersMat destination in download_post_mats_fp16 or download_post_matsTYPE_post_download operation in delayed_recordsmemcpy() happens after submit_and_wait()CPU-side type conversion (discrete GPU only):
TYPE_post_cast_float16_to_float32 operationopt.num_threadsSources: src/command.cpp434-586 src/command.h31
Buffer Memory Barrier Structure:
VkBufferMemoryBarrier {
srcAccessMask = current buffer access (e.g., VK_ACCESS_HOST_WRITE_BIT)
dstAccessMask = required access (e.g., VK_ACCESS_TRANSFER_READ_BIT)
srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED
dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED
buffer = VkBuffer handle
offset = buffer_offset()
size = buffer_capacity()
}
Pipeline Stage Synchronization:
VK_PIPELINE_STAGE_HOST_BIT - CPU memory operationsVK_PIPELINE_STAGE_TRANSFER_BIT - vkCmdCopy* operationsVK_PIPELINE_STAGE_COMPUTE_SHADER_BIT - Compute shader executionAccess Flags Tracking:
Each VkBufferMemory and VkImageMemory tracks:
access_flags - Current access type (read/write)stage_flags - Pipeline stage that last accessed itimage_layout (images only) - Current image layoutThese are updated by command recording functions to enable automatic barrier insertion.
Sources: src/command.cpp471-508 src/allocator.h224-254
NCNN selects appropriate Vulkan memory types based on usage requirements and GPU architecture.
| Allocator | Required Flags | Preferred Flags | Preferred Not Flags | Typical Memory Type |
|---|---|---|---|---|
VkBlobAllocator (buffer) | DEVICE_LOCAL_BIT | - | HOST_VISIBLE_BIT | Pure device-local (fastest) |
VkBlobAllocator (image) | DEVICE_LOCAL_BIT | - | - | Device-local for images |
VkStagingAllocator | HOST_VISIBLE_BIT | HOST_COHERENT_BIT, DEVICE_LOCAL_BIT | HOST_CACHED_BIT | Host-visible + coherent |
VkWeightAllocator (normal) | DEVICE_LOCAL_BIT | - | HOST_VISIBLE_BIT | Device-local weights |
VkWeightAllocator (host) | HOST_VISIBLE_BIT, DEVICE_LOCAL_BIT | - | HOST_CACHED_BIT | Resizable BAR memory |
Sources: src/gpu.cpp2717-2797 src/allocator.cpp664-1383
Discrete GPU (vkdev->info.type() == 0):
vkCmdCopyBuffer operationsHOST_VISIBLE + HOST_COHERENT but not DEVICE_LOCALIntegrated GPU (vkdev->info.type() == 1):
DEVICE_LOCAL_BIT memory is also HOST_VISIBLE_BITResizable BAR Detection:
Resizable BAR allows discrete GPUs to expose device memory as directly CPU-accessible, avoiding staging buffers for weight uploads.
Sources: src/gpu.cpp617-666 src/command.cpp366-556
For non-coherent memory types, NCNN performs explicit cache management.
Purpose: Makes CPU writes visible to GPU before GPU reads the memory.
Alignment Requirements:
VkPhysicalDeviceLimits::nonCoherentAtomSizeround_down() and round_up() ensure proper alignmentSources: src/allocator.cpp379-399
Purpose: Ensures CPU reads the latest GPU-written data from memory.
Called after GPU writes complete and before CPU reads downloaded data.
Sources: src/allocator.cpp401-421
Typical Forward Pass Memory Flow:
Memory Lifecycle:
Setup Phase (once per network):
VkBlobAllocator for intermediate tensorsVkStagingAllocator for transfersVkWeightAllocator for model weightsOption::blob_vkallocator, Option::staging_vkallocatorPer-Inference:
VkBlobAllocator poollightmode (released after layer completes)submit_and_wait()Cleanup:
VkBlobAllocator::clear() to release pooled memorySources: src/net.cpp192-356 src/command.cpp358-586
Lightmode (Option::lightmode = true):
Packing Layout (Option::use_packing_layout = true):
Staging Allocator Tuning:
set_size_compare_ratio(float scr) - Controls size matching tolerance (default 0.75)Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.