GPU Memory and Data Transfer

Relevant source files

This page documents the Vulkan-based GPU memory management and data transfer system in NCNN. It covers memory allocators, buffer/image allocation strategies, CPU-GPU data transfer operations, and synchronization mechanisms.

Scope: This page focuses on GPU memory allocation and data movement between CPU and GPU. For information about Vulkan device initialization and queue management, see Vulkan Instance and Device Management. For command recording and pipeline execution, see Command Recording and Pipeline Execution. For CPU memory allocators, see Memory Allocators.

Memory Architecture Overview

NCNN's Vulkan backend uses a multi-tier memory system designed to optimize performance across different GPU architectures (discrete vs integrated).

Memory Object Hierarchy

Memory Object Types

Type	Purpose	Backing Structure	Key Fields
`VkBufferMemory`	Linear buffer storage	`VkBuffer` + `VkDeviceMemory`	`buffer`, `offset`, `capacity`, `mapped_ptr`, `access_flags`, `stage_flags`
`VkImageMemory`	Image/texture storage	`VkImage` + `VkImageView`	`image`, `imageview`, `format`, `width`, `height`, `depth`, `image_layout`

Sources: src/allocator.h212-261 src/mat.h

Memory Allocator System

NCNN provides specialized allocators for different memory usage patterns in GPU inference.

Allocator Class Hierarchy

Sources: src/allocator.h263-397 src/allocator.cpp349-1675

VkBlobAllocator - Pooled Memory for Inference

Purpose: Manages device-local memory for intermediate tensors during inference with block-based pooling.

Architecture:

Key Methods:

fastMalloc(size_t size) → VkBufferMemory* - Allocates buffer from pool
fastMalloc(int w, int h, int c, size_t elemsize, int elempack) → VkImageMemory* - Allocates image from pool
fastFree(VkBufferMemory* ptr) - Returns buffer to pool
clear() - Releases all blocks immediately

Memory Allocation Strategy:

Searches for free region in existing blocks that fits the requested size
If no suitable region found, allocates a new block of block_size (16MB default)
Sub-allocates from blocks with proper alignment (buffer_offset_alignment, bind_memory_offset_alignment)
Maintains free region lists (buffer_budgets, image_memory_budgets) for recycling

Sources: src/allocator.cpp607-1007 src/allocator.h298-319

VkStagingAllocator - Host-Visible Transfer Memory

Purpose: Manages host-visible, device-visible memory for CPU-GPU data transfer with size-based pooling.

Characteristics:

Allocates from VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT memory
Uses size-based pooling with configurable size comparison ratio (default 0.75)
Memory can be directly mapped and accessed by CPU
Budget recycling avoids repeated allocation/deallocation

Memory Properties Selection:

Size Comparison Logic:

When requesting memory of size S, searches budgets for size B where B >= S and B * size_compare_ratio <= S
Default ratio 0.75 means a 1MB request can reuse a buffer between 1MB and ~1.33MB
Reduces memory waste while maintaining reasonable reuse rates

Sources: src/allocator.cpp1289-1562 src/allocator.h351-376

VkWeightAllocator - Dedicated Weight Storage

Purpose: Allocates dedicated memory blocks for model weights with optional host memory preference.

Allocation Strategy:

Each allocation gets a dedicated VkDeviceMemory block (no sub-allocation)
Supports prefer_host_memory flag for keeping weights in system RAM (resizable BAR optimization)
Block size configurable (default 8MB)
Maintains block lists but no budget recycling (weights are long-lived)

Host Memory Path (for discrete GPUs with resizable BAR):

Sources: src/allocator.cpp1009-1287 src/allocator.h322-348

Data Transfer Operations

Data transfer between CPU and GPU is managed by the VkCompute class, which records transfer commands into Vulkan command buffers.

Upload Path: Mat → VkMat

Upload Flow Diagram:

Key Steps in VkCompute::record_upload():

CPU-side type conversion (discrete GPU only):
- If discrete GPU (vkdev->info.type() == 0) and FP16 storage enabled
- Convert FP32 → FP16 on CPU to reduce transfer bandwidth
- Integrated GPUs skip this, convert on GPU instead
Staging buffer allocation:
- Allocate host-visible staging buffer via opt.staging_vkallocator
- Copy data with memcpy() to mapped_ptr
- Flush memory: allocator->flush() calls vkFlushMappedMemoryRanges()
Memory barrier:
- Mark staging buffer as VK_ACCESS_HOST_WRITE_BIT at VK_PIPELINE_STAGE_HOST_BIT
- This ensures CPU writes are visible to GPU transfer operations
Format conversion and packing:
- Call vkdev->convert_packing() to convert to optimal element packing (4-wide vectors)
- On integrated GPU, also performs FP16 conversion via GPU shader
- Resolves dst_elempack based on element count (pack4 if count divisible by 4)
Staging buffer retention:
- Store staging buffer in upload_staging_buffers vector
- Keeps staging buffer alive until submit_and_wait() completes
- Automatic cleanup after GPU finishes reading

Sources: src/command.cpp358-432 src/command.h29

Download Path: VkMat → Mat

Download Flow Diagram:

Key Steps in VkCompute::record_download():

Packing and type conversion (on GPU):
- If integrated GPU, convert FP16 → FP32 on GPU via shader
- Unpack from pack4 to desired dst_elempack (based on opt.use_packing_layout)
- Allocate staging buffer from opt.staging_vkallocator (or use mappable blob allocator)
Memory barrier:
- If buffer access flags include VK_ACCESS_HOST_WRITE_BIT or wrong stage
- Insert VkBufferMemoryBarrier: srcAccessMask = current access, dstAccessMask = VK_ACCESS_HOST_READ_BIT
- Ensures GPU writes complete before CPU reads
Deferred download recording:
- Store staging buffer in download_post_buffers
- Create Mat destination in download_post_mats_fp16 or download_post_mats
- Record TYPE_post_download operation in delayed_records
- Actual memcpy() happens after submit_and_wait()
CPU-side type conversion (discrete GPU only):
- After GPU completes, if FP16 storage was used on discrete GPU
- Record TYPE_post_cast_float16_to_float32 operation
- Multi-threaded CPU conversion using opt.num_threads

Sources: src/command.cpp434-586 src/command.h31

Memory Barrier Details

Buffer Memory Barrier Structure:

VkBufferMemoryBarrier {
    srcAccessMask     = current buffer access (e.g., VK_ACCESS_HOST_WRITE_BIT)
    dstAccessMask     = required access (e.g., VK_ACCESS_TRANSFER_READ_BIT)
    srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED
    dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED
    buffer            = VkBuffer handle
    offset            = buffer_offset()
    size              = buffer_capacity()
}

Pipeline Stage Synchronization:

VK_PIPELINE_STAGE_HOST_BIT - CPU memory operations
VK_PIPELINE_STAGE_TRANSFER_BIT - vkCmdCopy* operations
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT - Compute shader execution

Access Flags Tracking: Each VkBufferMemory and VkImageMemory tracks:

access_flags - Current access type (read/write)
stage_flags - Pipeline stage that last accessed it
image_layout (images only) - Current image layout

These are updated by command recording functions to enable automatic barrier insertion.

Sources: src/command.cpp471-508 src/allocator.h224-254

Memory Type Selection

NCNN selects appropriate Vulkan memory types based on usage requirements and GPU architecture.

Memory Type Selection Algorithm

Memory Type Characteristics by Allocator

Allocator	Required Flags	Preferred Flags	Preferred Not Flags	Typical Memory Type
`VkBlobAllocator` (buffer)	`DEVICE_LOCAL_BIT`	-	`HOST_VISIBLE_BIT`	Pure device-local (fastest)
`VkBlobAllocator` (image)	`DEVICE_LOCAL_BIT`	-	-	Device-local for images
`VkStagingAllocator`	`HOST_VISIBLE_BIT`	`HOST_COHERENT_BIT`, `DEVICE_LOCAL_BIT`	`HOST_CACHED_BIT`	Host-visible + coherent
`VkWeightAllocator` (normal)	`DEVICE_LOCAL_BIT`	-	`HOST_VISIBLE_BIT`	Device-local weights
`VkWeightAllocator` (host)	`HOST_VISIBLE_BIT`, `DEVICE_LOCAL_BIT`	-	`HOST_CACHED_BIT`	Resizable BAR memory

Sources: src/gpu.cpp2717-2797 src/allocator.cpp664-1383

Discrete vs Integrated GPU Behavior

Discrete GPU (vkdev->info.type() == 0):

Clear separation between host and device memory
Uses staging buffers for all transfers
Performs FP16 conversion on CPU before upload
Requires explicit vkCmdCopyBuffer operations
Staging memory is HOST_VISIBLE + HOST_COHERENT but not DEVICE_LOCAL

Integrated GPU (vkdev->info.type() == 1):

Shared memory between CPU and GPU
DEVICE_LOCAL_BIT memory is also HOST_VISIBLE_BIT
Can skip staging buffers in some cases
Performs FP16 conversion on GPU via shader
More efficient for frequent CPU-GPU transfers

Resizable BAR Detection:

Resizable BAR allows discrete GPUs to expose device memory as directly CPU-accessible, avoiding staging buffers for weight uploads.

Sources: src/gpu.cpp617-666 src/command.cpp366-556

Flush and Invalidate Operations

For non-coherent memory types, NCNN performs explicit cache management.

Flush Operation (CPU → GPU)

Purpose: Makes CPU writes visible to GPU before GPU reads the memory.

Alignment Requirements:

Range must be aligned to VkPhysicalDeviceLimits::nonCoherentAtomSize
Typically 64 or 128 bytes on most GPUs
round_down() and round_up() ensure proper alignment

Sources: src/allocator.cpp379-399

Invalidate Operation (GPU → CPU)

Purpose: Ensures CPU reads the latest GPU-written data from memory.

Called after GPU writes complete and before CPU reads downloaded data.

Sources: src/allocator.cpp401-421

Practical Usage Patterns

Inference Memory Flow

Typical Forward Pass Memory Flow:

Memory Lifecycle:

Setup Phase (once per network):
- Allocate VkBlobAllocator for intermediate tensors
- Allocate VkStagingAllocator for transfers
- Allocate VkWeightAllocator for model weights
- Set in Option::blob_vkallocator, Option::staging_vkallocator
Per-Inference:
- Input uploaded via staging buffer
- Intermediate blobs allocated from VkBlobAllocator pool
- Memory recycled in lightmode (released after layer completes)
- Output downloaded via staging buffer
- Staging buffers automatically freed after submit_and_wait()
Cleanup:
- Call VkBlobAllocator::clear() to release pooled memory
- Destroy allocators

Sources: src/net.cpp192-356 src/command.cpp358-586

Memory Optimization Strategies

Lightmode (Option::lightmode = true):

Intermediate blobs released immediately after layer execution
Reduces peak memory usage
Enabled by default
Trade-off: May cause more allocations if memory reused across layers

Packing Layout (Option::use_packing_layout = true):

Store tensors in 4-wide vector format (pack4)
Improves GPU compute efficiency (vectorized operations)
Slight memory overhead for channels not divisible by 4
Enabled by default

Staging Allocator Tuning:

set_size_compare_ratio(float scr) - Controls size matching tolerance (default 0.75)
Lower ratio = stricter matching = less memory waste but more allocations
Higher ratio = looser matching = more memory waste but better reuse

Sources: src/option.cpp10-76 src/allocator.cpp1376-1383

GPU Memory and Data Transfer

Relevant source files

Memory Architecture Overview

NCNN's Vulkan backend uses a multi-tier memory system designed to optimize performance across different GPU architectures (discrete vs integrated).

Memory Object Hierarchy

Memory Object Types

Type	Purpose	Backing Structure	Key Fields
`VkBufferMemory`	Linear buffer storage	`VkBuffer` + `VkDeviceMemory`	`buffer`, `offset`, `capacity`, `mapped_ptr`, `access_flags`, `stage_flags`
`VkImageMemory`	Image/texture storage	`VkImage` + `VkImageView`	`image`, `imageview`, `format`, `width`, `height`, `depth`, `image_layout`

Sources: src/allocator.h212-261 src/mat.h

Memory Allocator System

NCNN provides specialized allocators for different memory usage patterns in GPU inference.

Allocator Class Hierarchy

Sources: src/allocator.h263-397 src/allocator.cpp349-1675

VkBlobAllocator - Pooled Memory for Inference

Purpose: Manages device-local memory for intermediate tensors during inference with block-based pooling.

Architecture:

Key Methods:

fastMalloc(size_t size) → VkBufferMemory* - Allocates buffer from pool
fastMalloc(int w, int h, int c, size_t elemsize, int elempack) → VkImageMemory* - Allocates image from pool
fastFree(VkBufferMemory* ptr) - Returns buffer to pool
clear() - Releases all blocks immediately

Memory Allocation Strategy:

Searches for free region in existing blocks that fits the requested size
If no suitable region found, allocates a new block of block_size (16MB default)
Sub-allocates from blocks with proper alignment (buffer_offset_alignment, bind_memory_offset_alignment)
Maintains free region lists (buffer_budgets, image_memory_budgets) for recycling

Sources: src/allocator.cpp607-1007 src/allocator.h298-319

VkStagingAllocator - Host-Visible Transfer Memory

Purpose: Manages host-visible, device-visible memory for CPU-GPU data transfer with size-based pooling.

Characteristics:

Allocates from VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT memory
Uses size-based pooling with configurable size comparison ratio (default 0.75)
Memory can be directly mapped and accessed by CPU
Budget recycling avoids repeated allocation/deallocation

Memory Properties Selection:

Size Comparison Logic:

When requesting memory of size S, searches budgets for size B where B >= S and B * size_compare_ratio <= S
Default ratio 0.75 means a 1MB request can reuse a buffer between 1MB and ~1.33MB
Reduces memory waste while maintaining reasonable reuse rates

Sources: src/allocator.cpp1289-1562 src/allocator.h351-376

VkWeightAllocator - Dedicated Weight Storage

Purpose: Allocates dedicated memory blocks for model weights with optional host memory preference.

Allocation Strategy:

Each allocation gets a dedicated VkDeviceMemory block (no sub-allocation)
Supports prefer_host_memory flag for keeping weights in system RAM (resizable BAR optimization)
Block size configurable (default 8MB)
Maintains block lists but no budget recycling (weights are long-lived)

Host Memory Path (for discrete GPUs with resizable BAR):

Sources: src/allocator.cpp1009-1287 src/allocator.h322-348

Data Transfer Operations

Data transfer between CPU and GPU is managed by the VkCompute class, which records transfer commands into Vulkan command buffers.

Upload Path: Mat → VkMat

Upload Flow Diagram:

Key Steps in VkCompute::record_upload():

CPU-side type conversion (discrete GPU only):
- If discrete GPU (vkdev->info.type() == 0) and FP16 storage enabled
- Convert FP32 → FP16 on CPU to reduce transfer bandwidth
- Integrated GPUs skip this, convert on GPU instead
Staging buffer allocation:
- Allocate host-visible staging buffer via opt.staging_vkallocator
- Copy data with memcpy() to mapped_ptr
- Flush memory: allocator->flush() calls vkFlushMappedMemoryRanges()
Memory barrier:
- Mark staging buffer as VK_ACCESS_HOST_WRITE_BIT at VK_PIPELINE_STAGE_HOST_BIT
- This ensures CPU writes are visible to GPU transfer operations
Format conversion and packing:
- Call vkdev->convert_packing() to convert to optimal element packing (4-wide vectors)
- On integrated GPU, also performs FP16 conversion via GPU shader
- Resolves dst_elempack based on element count (pack4 if count divisible by 4)
Staging buffer retention:
- Store staging buffer in upload_staging_buffers vector
- Keeps staging buffer alive until submit_and_wait() completes
- Automatic cleanup after GPU finishes reading

Sources: src/command.cpp358-432 src/command.h29

Download Path: VkMat → Mat

Download Flow Diagram:

Key Steps in VkCompute::record_download():

Packing and type conversion (on GPU):
- If integrated GPU, convert FP16 → FP32 on GPU via shader
- Unpack from pack4 to desired dst_elempack (based on opt.use_packing_layout)
- Allocate staging buffer from opt.staging_vkallocator (or use mappable blob allocator)
Memory barrier:
- If buffer access flags include VK_ACCESS_HOST_WRITE_BIT or wrong stage
- Insert VkBufferMemoryBarrier: srcAccessMask = current access, dstAccessMask = VK_ACCESS_HOST_READ_BIT
- Ensures GPU writes complete before CPU reads
Deferred download recording:
- Store staging buffer in download_post_buffers
- Create Mat destination in download_post_mats_fp16 or download_post_mats
- Record TYPE_post_download operation in delayed_records
- Actual memcpy() happens after submit_and_wait()
CPU-side type conversion (discrete GPU only):
- After GPU completes, if FP16 storage was used on discrete GPU
- Record TYPE_post_cast_float16_to_float32 operation
- Multi-threaded CPU conversion using opt.num_threads

Sources: src/command.cpp434-586 src/command.h31

Memory Barrier Details

Buffer Memory Barrier Structure:

VkBufferMemoryBarrier {
    srcAccessMask     = current buffer access (e.g., VK_ACCESS_HOST_WRITE_BIT)
    dstAccessMask     = required access (e.g., VK_ACCESS_TRANSFER_READ_BIT)
    srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED
    dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED
    buffer            = VkBuffer handle
    offset            = buffer_offset()
    size              = buffer_capacity()
}

Pipeline Stage Synchronization:

VK_PIPELINE_STAGE_HOST_BIT - CPU memory operations
VK_PIPELINE_STAGE_TRANSFER_BIT - vkCmdCopy* operations
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT - Compute shader execution

Access Flags Tracking: Each VkBufferMemory and VkImageMemory tracks:

access_flags - Current access type (read/write)
stage_flags - Pipeline stage that last accessed it
image_layout (images only) - Current image layout

These are updated by command recording functions to enable automatic barrier insertion.

Sources: src/command.cpp471-508 src/allocator.h224-254

Memory Type Selection

NCNN selects appropriate Vulkan memory types based on usage requirements and GPU architecture.

Memory Type Selection Algorithm

Memory Type Characteristics by Allocator

Allocator	Required Flags	Preferred Flags	Preferred Not Flags	Typical Memory Type
`VkBlobAllocator` (buffer)	`DEVICE_LOCAL_BIT`	-	`HOST_VISIBLE_BIT`	Pure device-local (fastest)
`VkBlobAllocator` (image)	`DEVICE_LOCAL_BIT`	-	-	Device-local for images
`VkStagingAllocator`	`HOST_VISIBLE_BIT`	`HOST_COHERENT_BIT`, `DEVICE_LOCAL_BIT`	`HOST_CACHED_BIT`	Host-visible + coherent
`VkWeightAllocator` (normal)	`DEVICE_LOCAL_BIT`	-	`HOST_VISIBLE_BIT`	Device-local weights
`VkWeightAllocator` (host)	`HOST_VISIBLE_BIT`, `DEVICE_LOCAL_BIT`	-	`HOST_CACHED_BIT`	Resizable BAR memory

Sources: src/gpu.cpp2717-2797 src/allocator.cpp664-1383

Discrete vs Integrated GPU Behavior

Discrete GPU (vkdev->info.type() == 0):

Clear separation between host and device memory
Uses staging buffers for all transfers
Performs FP16 conversion on CPU before upload
Requires explicit vkCmdCopyBuffer operations
Staging memory is HOST_VISIBLE + HOST_COHERENT but not DEVICE_LOCAL

Integrated GPU (vkdev->info.type() == 1):

Shared memory between CPU and GPU
DEVICE_LOCAL_BIT memory is also HOST_VISIBLE_BIT
Can skip staging buffers in some cases
Performs FP16 conversion on GPU via shader
More efficient for frequent CPU-GPU transfers

Resizable BAR Detection:

Resizable BAR allows discrete GPUs to expose device memory as directly CPU-accessible, avoiding staging buffers for weight uploads.

Sources: src/gpu.cpp617-666 src/command.cpp366-556

Flush and Invalidate Operations

For non-coherent memory types, NCNN performs explicit cache management.

Flush Operation (CPU → GPU)

Purpose: Makes CPU writes visible to GPU before GPU reads the memory.

Alignment Requirements:

Range must be aligned to VkPhysicalDeviceLimits::nonCoherentAtomSize
Typically 64 or 128 bytes on most GPUs
round_down() and round_up() ensure proper alignment

Sources: src/allocator.cpp379-399

Invalidate Operation (GPU → CPU)

Purpose: Ensures CPU reads the latest GPU-written data from memory.

Called after GPU writes complete and before CPU reads downloaded data.

Sources: src/allocator.cpp401-421

Practical Usage Patterns

Inference Memory Flow

Typical Forward Pass Memory Flow:

Memory Lifecycle:

Setup Phase (once per network):
- Allocate VkBlobAllocator for intermediate tensors
- Allocate VkStagingAllocator for transfers
- Allocate VkWeightAllocator for model weights
- Set in Option::blob_vkallocator, Option::staging_vkallocator
Per-Inference:
- Input uploaded via staging buffer
- Intermediate blobs allocated from VkBlobAllocator pool
- Memory recycled in lightmode (released after layer completes)
- Output downloaded via staging buffer
- Staging buffers automatically freed after submit_and_wait()
Cleanup:
- Call VkBlobAllocator::clear() to release pooled memory
- Destroy allocators

Sources: src/net.cpp192-356 src/command.cpp358-586

Memory Optimization Strategies

Lightmode (Option::lightmode = true):

Intermediate blobs released immediately after layer execution
Reduces peak memory usage
Enabled by default
Trade-off: May cause more allocations if memory reused across layers

Packing Layout (Option::use_packing_layout = true):

Store tensors in 4-wide vector format (pack4)
Improves GPU compute efficiency (vectorized operations)
Slight memory overhead for channels not divisible by 4
Enabled by default

Staging Allocator Tuning:

set_size_compare_ratio(float scr) - Controls size matching tolerance (default 0.75)
Lower ratio = stricter matching = less memory waste but more allocations
Higher ratio = looser matching = more memory waste but better reuse

Sources: src/option.cpp10-76 src/allocator.cpp1376-1383

GPU Memory and Data Transfer

Memory Architecture Overview

Memory Object Hierarchy

Memory Allocator System

Allocator Class Hierarchy

VkBlobAllocator - Pooled Memory for Inference

VkStagingAllocator - Host-Visible Transfer Memory

VkWeightAllocator - Dedicated Weight Storage

Data Transfer Operations

Upload Path: Mat → VkMat

Download Path: VkMat → Mat

Memory Barrier Details

Memory Type Selection

Memory Type Selection Algorithm

Memory Type Characteristics by Allocator

Discrete vs Integrated GPU Behavior

Flush and Invalidate Operations

Flush Operation (CPU → GPU)

Invalidate Operation (GPU → CPU)

Practical Usage Patterns

Inference Memory Flow

Memory Optimization Strategies

On this page

GPU Memory and Data Transfer

Memory Architecture Overview

Memory Object Hierarchy

Memory Allocator System

Allocator Class Hierarchy

VkBlobAllocator - Pooled Memory for Inference

VkStagingAllocator - Host-Visible Transfer Memory

VkWeightAllocator - Dedicated Weight Storage

Data Transfer Operations

Upload Path: Mat → VkMat

Download Path: VkMat → Mat

Memory Barrier Details

Memory Type Selection

Memory Type Selection Algorithm

Memory Type Characteristics by Allocator

Discrete vs Integrated GPU Behavior

Flush and Invalidate Operations

Flush Operation (CPU → GPU)

Invalidate Operation (GPU → CPU)

Practical Usage Patterns

Inference Memory Flow

Memory Optimization Strategies

On this page