Mat and Tensor Data Structures

Relevant source files

This document describes the tensor data structures used in ncnn for storing and manipulating multidimensional arrays during neural network inference. The primary data structures are Mat (CPU tensors), VkMat (GPU buffer tensors), and VkImageMat (GPU image tensors).

For information about memory allocators used with these structures, see page 2.4. For details on network loading and blob management during inference, see page 2.1. For GPU compute and how VkMat is used in pipeline dispatch, see page 3.

Overview of Tensor Types

ncnn provides three main tensor representations optimized for different execution contexts:

Class	Purpose	Memory Location	Use Case
`Mat`	CPU tensor	System RAM	CPU layer execution, data preparation
`VkMat`	GPU buffer tensor	GPU device memory	GPU computation via Vulkan compute shaders
`VkImageMat`	GPU image tensor	GPU image memory	GPU computation with image samplers

Sources: src/mat.h50-336 src/mat.h341-462 src/mat.h464-589

Mat: CPU Tensor Structure

Dimensions and Shape

Mat supports four dimensional ranks:

Diagram: Mat Dimensional Structure

The dimension fields determine the tensor shape:

dims: Dimension rank (1, 2, 3, or 4)
w: Width (x-axis)
h: Height (y-axis)
d: Depth (z-axis, for 4D tensors)
c: Channels (number of feature channels)

Sources: src/mat.h327-334

Memory Layout and Element Packing

Mat uses a Channel-Height-Width (CHW) memory layout where channels are stored sequentially, with each channel containing a contiguous HW plane.

Diagram: Mat Memory Layout with Channel Stride

The cstep field specifies the stride in elements between consecutive channels, accounting for 16-byte alignment:

cstep = alignSize(w × h × d × elemsize, 16) / elemsize

Element Packing: Mat supports packing multiple scalar values into one memory slot for SIMD operations. When elempack > 1, the outermost dimension (c for 3D/4D tensors, h for 2D, w for 1D) is divided by elempack, and elemsize is multiplied by elempack. Each slot then holds elempack consecutive scalar values, matching a SIMD register width.

The elempack comment in src/mat.h documents the layout conventions:

c/1-d-h-w-1  → scalar  (elempack=1)
c/4-d-h-w-4  → SSE/NEON pack4 (elempack=4, elemsize=4×scalar_bytes)
c/8-d-h-w-8  → AVX/FP16 pack8 (elempack=8, elemsize=8×scalar_bytes)

`elempack`	Architecture	SIMD width (float32)
1	Scalar	32-bit
4	SSE/NEON	128-bit
8	AVX	256-bit
8	ARM FP16	128-bit (8×16-bit)
16	AVX-512	512-bit

Because elemsize encodes the packing (elemsize = scalar_bytes × elempack), the total allocated bytes per channel plane is simply cstep × elemsize, where cstep already accounts for alignment.

Sources: src/mat.h311-336 src/mat.cpp222-254

Core Fields and Types

Diagram: Mat Class Structure

Sources: src/mat.h304-336

Construction and Lifecycle

Mat objects are created through various constructors:

Diagram: Mat Object Lifecycle

Reference Counting: Mat uses automatic reference counting for memory management. Multiple Mat objects can share the same underlying data:

Copy construction/assignment creates a shallow copy and increments refcount
release() decrements refcount and frees memory when it reaches zero
clone() creates a deep copy with independent memory

Sources: src/mat.h52-92 src/mat.cpp19-53 src/mat.cpp222-255

Data Access Methods

Mat provides multiple access patterns:

Diagram: Mat Data Access Patterns

Key access methods:

channel(int c): Returns a Mat view of a single channel
depth(int z): Returns a Mat view of a single depth slice (for 4D)
row(int y): Returns a pointer to a row
operator[](size_t i): Direct element access for 1D vectors
channel_range(), depth_range(), row_range(), range(): Return views of subsets

All range methods return Mat objects that share the underlying data (shallow views).

Sources: src/mat.h187-206

Pixel Operations

Mat provides extensive support for image pixel data conversion:

Diagram: Mat Pixel Format Conversion Pipeline

Pixel Type Enumeration: The PixelType enum in Mat defines conversion operations:

PIXEL_RGB, PIXEL_BGR, PIXEL_GRAY, PIXEL_RGBA, PIXEL_BGRA
PIXEL_RGB2BGR, PIXEL_RGB2GRAY, PIXEL_BGR2RGB, etc.

Format encoding: type = source_format | (dest_format << 16)

Key pixel methods:

from_pixels(): Convert packed pixel data to CHW float format
from_pixels_resize(): Convert and resize in one operation
from_pixels_roi(): Extract region of interest during conversion
to_pixels(): Convert CHW float format back to packed pixels
substract_mean_normalize(): Apply mean subtraction and normalization

Sources: src/mat.h218-299 src/mat_pixel.cpp16-127 src/mat_pixel_resize.cpp190-374

Reshape and Transformation Operations

Diagram: Mat Reshape and Copy Operations

The reshape() method attempts to return a view with different dimensions when possible, but may allocate new memory if channel stride alignment requirements differ between the source and target shapes.

Sources: src/mat.h140-147 src/mat.cpp60-220

VkMat: GPU Buffer Tensor

VkMat represents tensor data in GPU device memory as Vulkan buffer objects, suitable for compute shader operations.

Diagram: VkMat Class Structure and Relationships

Key Differences from Mat

Aspect	Mat	VkMat
Memory	System RAM (`void*`)	GPU device memory (`VkBufferMemory*`)
Allocator	`Allocator*`	`VkAllocator*`
Access	Direct CPU pointer	Via `mapped()` or buffer handles
Refcount location	After data	In `VkBufferMemory` struct
Creation	Allocates CPU memory	Allocates `VkBuffer` and `VkDeviceMemory`

CPU-GPU Data Transfer

Diagram: VkMat CPU-GPU Data Transfer Flow

Data transfer methods:

mapped(): Returns a CPU-accessible Mat view of the buffer (requires host-visible memory)
mapped_ptr(): Returns the raw CPU-mapped pointer
Upload/download operations use VkStagingAllocator for temporary host-visible buffers
Transfer commands are recorded via VkCompute command buffers

Sources: src/mat.h341-462 src/mat.cpp544-806

VkImageMat: GPU Image Tensor

VkImageMat stores tensor data as Vulkan image objects, optimized for texture sampling and image operations.

Diagram: VkImageMat Class Structure

Image-Specific Characteristics

Property	Details
Memory Layout	2D/3D image layout (GPU-optimized tiling)
Element Access	Via image samplers in shaders
Channel Packing	Stored in image RGBA components
Dimensions	4D tensors stored as 3D images (h×d as height)

The image format is determined by elemsize and elempack:

elempack=1, elemsize=4: R32F (single channel float32)
elempack=4, elemsize=16: RGBA32F (4-channel float32)
elempack=1, elemsize=2: R16F (half precision)
elempack=4, elemsize=8: RGBA16F (4-channel half)

For 4D tensors, the depth dimension is folded into the height: image_height = h * d

Sources: src/mat.h464-589 src/mat.cpp847-1118

Element Size and Precision Modes

Mat, VkMat, and VkImageMat support multiple numeric precisions. The elemsize field holds the total bytes per packed slot — it scales with elempack.

Diagram: Mat elemsize Values by Precision and Packing

Sources: src/mat.h311-322

The elembits() method returns the scalar element precision in bits, independent of packing. Because elemsize already encodes the packing factor (elemsize = scalar_bytes × elempack), dividing by elempack recovers the per-scalar width:

elembits() = (elemsize × 8) / elempack

elembits() is used in NetPrivate::convert_layout (see src/net.cpp) to determine which type-conversion path to apply — for example, elembits() == 32 triggers an fp16/bf16 cast, and elembits() == 16 triggers a cast back to fp32. It always returns 32, 16, or 8 regardless of packing.

Precision	`elempack`	`elemsize`	`elembits()`
float32, scalar	1	4	32
float32, pack4 (SSE/NEON)	4	16	32
float32, pack8 (AVX)	8	32	32
float16/bf16, scalar	1	2	16
float16/bf16, pack8	8	16	16
int8, scalar	1	1	8
int8, pack8	8	8	8

Sources: src/mat.h180-181 src/mat.h311-322

Type Conversion Utilities

ncnn provides element-wise and tensor-level functions for converting between numeric formats. These are invoked automatically by NetPrivate::convert_layout when a blob's elembits() does not match what the next layer requires.

Diagram: Type Conversion Functions (mat.h scalar helpers and tensor-level casts)

Conversion	Notes
`float32_to_float16()`	Full IEEE 754 half-precision
`float32_to_bfloat16()`	Truncates mantissa; keeps exponent range
`float8` (E4M3)	4-bit exponent, 3-bit mantissa
`bfloat8` (E5M2)	5-bit exponent, 2-bit mantissa

Tensor-level cast_* functions allocate an output Mat and process all elements, with SIMD optimizations per platform.

Sources: src/mat.h726-771 src/mat.h790-796 src/net.cpp358-549

Memory Management Integration

All three tensor classes integrate with ncnn's allocator system:

Diagram: Tensor Memory Management Integration

Key behaviors:

create() methods check if existing allocation matches requirements before reallocating
Memory is allocated via the allocator's fastMalloc() method
Refcount is stored after the data (Mat) or within the memory structure (VkMat/VkImageMat)
Allocator can be null for default system allocation (Mat only)

Sources: src/mat.h324-325 src/mat.cpp222-500 src/mat.cpp544-806 src/mat.cpp847-1079

Summary Table

Feature	Mat	VkMat	VkImageMat
Memory	System RAM	GPU buffer memory	GPU image memory
Layout	Linear CHW	Linear CHW	Tiled 2D/3D
CPU Access	Direct pointer	Via `mapped()`	Via `mapped()`
GPU Access	N/A	Storage buffer	Sampled image
Allocator Type	`Allocator*`	`VkAllocator*`	`VkAllocator*`
Primary Use	CPU layers, I/O	GPU compute	GPU compute (texture ops)
Channel Packing	elempack channels per element	elempack channels per element	elempack in RGBA components
Precision Support	FP32, FP16, INT8, BF16	FP32, FP16, INT8, BF16	FP32, FP16

Sources: src/mat.h50-589

Mat and Tensor Data Structures

Relevant source files

Overview of Tensor Types

ncnn provides three main tensor representations optimized for different execution contexts:

Class	Purpose	Memory Location	Use Case
`Mat`	CPU tensor	System RAM	CPU layer execution, data preparation
`VkMat`	GPU buffer tensor	GPU device memory	GPU computation via Vulkan compute shaders
`VkImageMat`	GPU image tensor	GPU image memory	GPU computation with image samplers

Sources: src/mat.h50-336 src/mat.h341-462 src/mat.h464-589

Mat: CPU Tensor Structure

Dimensions and Shape

Mat supports four dimensional ranks:

Diagram: Mat Dimensional Structure

The dimension fields determine the tensor shape:

dims: Dimension rank (1, 2, 3, or 4)
w: Width (x-axis)
h: Height (y-axis)
d: Depth (z-axis, for 4D tensors)
c: Channels (number of feature channels)

Sources: src/mat.h327-334

Memory Layout and Element Packing

Mat uses a Channel-Height-Width (CHW) memory layout where channels are stored sequentially, with each channel containing a contiguous HW plane.

Diagram: Mat Memory Layout with Channel Stride

The cstep field specifies the stride in elements between consecutive channels, accounting for 16-byte alignment:

cstep = alignSize(w × h × d × elemsize, 16) / elemsize

The elempack comment in src/mat.h documents the layout conventions:

c/1-d-h-w-1  → scalar  (elempack=1)
c/4-d-h-w-4  → SSE/NEON pack4 (elempack=4, elemsize=4×scalar_bytes)
c/8-d-h-w-8  → AVX/FP16 pack8 (elempack=8, elemsize=8×scalar_bytes)

`elempack`	Architecture	SIMD width (float32)
1	Scalar	32-bit
4	SSE/NEON	128-bit
8	AVX	256-bit
8	ARM FP16	128-bit (8×16-bit)
16	AVX-512	512-bit

Sources: src/mat.h311-336 src/mat.cpp222-254

Core Fields and Types

Diagram: Mat Class Structure

Sources: src/mat.h304-336

Construction and Lifecycle

Mat objects are created through various constructors:

Diagram: Mat Object Lifecycle

Reference Counting: Mat uses automatic reference counting for memory management. Multiple Mat objects can share the same underlying data:

Copy construction/assignment creates a shallow copy and increments refcount
release() decrements refcount and frees memory when it reaches zero
clone() creates a deep copy with independent memory

Sources: src/mat.h52-92 src/mat.cpp19-53 src/mat.cpp222-255

Data Access Methods

Mat provides multiple access patterns:

Diagram: Mat Data Access Patterns

Key access methods:

channel(int c): Returns a Mat view of a single channel
depth(int z): Returns a Mat view of a single depth slice (for 4D)
row(int y): Returns a pointer to a row
operator[](size_t i): Direct element access for 1D vectors
channel_range(), depth_range(), row_range(), range(): Return views of subsets

All range methods return Mat objects that share the underlying data (shallow views).

Sources: src/mat.h187-206

Pixel Operations

Mat provides extensive support for image pixel data conversion:

Diagram: Mat Pixel Format Conversion Pipeline

Pixel Type Enumeration: The PixelType enum in Mat defines conversion operations:

PIXEL_RGB, PIXEL_BGR, PIXEL_GRAY, PIXEL_RGBA, PIXEL_BGRA
PIXEL_RGB2BGR, PIXEL_RGB2GRAY, PIXEL_BGR2RGB, etc.

Format encoding: type = source_format | (dest_format << 16)

Key pixel methods:

from_pixels(): Convert packed pixel data to CHW float format
from_pixels_resize(): Convert and resize in one operation
from_pixels_roi(): Extract region of interest during conversion
to_pixels(): Convert CHW float format back to packed pixels
substract_mean_normalize(): Apply mean subtraction and normalization

Sources: src/mat.h218-299 src/mat_pixel.cpp16-127 src/mat_pixel_resize.cpp190-374

Reshape and Transformation Operations

Diagram: Mat Reshape and Copy Operations

Sources: src/mat.h140-147 src/mat.cpp60-220

VkMat: GPU Buffer Tensor

VkMat represents tensor data in GPU device memory as Vulkan buffer objects, suitable for compute shader operations.

Diagram: VkMat Class Structure and Relationships

Key Differences from Mat

Aspect	Mat	VkMat
Memory	System RAM (`void*`)	GPU device memory (`VkBufferMemory*`)
Allocator	`Allocator*`	`VkAllocator*`
Access	Direct CPU pointer	Via `mapped()` or buffer handles
Refcount location	After data	In `VkBufferMemory` struct
Creation	Allocates CPU memory	Allocates `VkBuffer` and `VkDeviceMemory`

CPU-GPU Data Transfer

Diagram: VkMat CPU-GPU Data Transfer Flow

Data transfer methods:

mapped(): Returns a CPU-accessible Mat view of the buffer (requires host-visible memory)
mapped_ptr(): Returns the raw CPU-mapped pointer
Upload/download operations use VkStagingAllocator for temporary host-visible buffers
Transfer commands are recorded via VkCompute command buffers

Sources: src/mat.h341-462 src/mat.cpp544-806

VkImageMat: GPU Image Tensor

VkImageMat stores tensor data as Vulkan image objects, optimized for texture sampling and image operations.

Diagram: VkImageMat Class Structure

Image-Specific Characteristics

Property	Details
Memory Layout	2D/3D image layout (GPU-optimized tiling)
Element Access	Via image samplers in shaders
Channel Packing	Stored in image RGBA components
Dimensions	4D tensors stored as 3D images (h×d as height)

The image format is determined by elemsize and elempack:

elempack=1, elemsize=4: R32F (single channel float32)
elempack=4, elemsize=16: RGBA32F (4-channel float32)
elempack=1, elemsize=2: R16F (half precision)
elempack=4, elemsize=8: RGBA16F (4-channel half)

For 4D tensors, the depth dimension is folded into the height: image_height = h * d

Sources: src/mat.h464-589 src/mat.cpp847-1118

Element Size and Precision Modes

Mat, VkMat, and VkImageMat support multiple numeric precisions. The elemsize field holds the total bytes per packed slot — it scales with elempack.

Diagram: Mat elemsize Values by Precision and Packing

Sources: src/mat.h311-322

elembits() = (elemsize × 8) / elempack

Precision	`elempack`	`elemsize`	`elembits()`
float32, scalar	1	4	32
float32, pack4 (SSE/NEON)	4	16	32
float32, pack8 (AVX)	8	32	32
float16/bf16, scalar	1	2	16
float16/bf16, pack8	8	16	16
int8, scalar	1	1	8
int8, pack8	8	8	8

Sources: src/mat.h180-181 src/mat.h311-322

Type Conversion Utilities

Diagram: Type Conversion Functions (mat.h scalar helpers and tensor-level casts)

Conversion	Notes
`float32_to_float16()`	Full IEEE 754 half-precision
`float32_to_bfloat16()`	Truncates mantissa; keeps exponent range
`float8` (E4M3)	4-bit exponent, 3-bit mantissa
`bfloat8` (E5M2)	5-bit exponent, 2-bit mantissa

Tensor-level cast_* functions allocate an output Mat and process all elements, with SIMD optimizations per platform.

Sources: src/mat.h726-771 src/mat.h790-796 src/net.cpp358-549

Memory Management Integration

All three tensor classes integrate with ncnn's allocator system:

Diagram: Tensor Memory Management Integration

Key behaviors:

create() methods check if existing allocation matches requirements before reallocating
Memory is allocated via the allocator's fastMalloc() method
Refcount is stored after the data (Mat) or within the memory structure (VkMat/VkImageMat)
Allocator can be null for default system allocation (Mat only)

Sources: src/mat.h324-325 src/mat.cpp222-500 src/mat.cpp544-806 src/mat.cpp847-1079

Summary Table

Feature	Mat	VkMat	VkImageMat
Memory	System RAM	GPU buffer memory	GPU image memory
Layout	Linear CHW	Linear CHW	Tiled 2D/3D
CPU Access	Direct pointer	Via `mapped()`	Via `mapped()`
GPU Access	N/A	Storage buffer	Sampled image
Allocator Type	`Allocator*`	`VkAllocator*`	`VkAllocator*`
Primary Use	CPU layers, I/O	GPU compute	GPU compute (texture ops)
Channel Packing	elempack channels per element	elempack channels per element	elempack in RGBA components
Precision Support	FP32, FP16, INT8, BF16	FP32, FP16, INT8, BF16	FP32, FP16

Sources: src/mat.h50-589

Mat and Tensor Data Structures

Overview of Tensor Types

Mat: CPU Tensor Structure

Dimensions and Shape

Memory Layout and Element Packing

Core Fields and Types

Construction and Lifecycle

Data Access Methods

Pixel Operations

Reshape and Transformation Operations

VkMat: GPU Buffer Tensor

Key Differences from Mat

CPU-GPU Data Transfer

VkImageMat: GPU Image Tensor

Image-Specific Characteristics

Element Size and Precision Modes

Type Conversion Utilities

Memory Management Integration

Summary Table

On this page

Mat and Tensor Data Structures

Overview of Tensor Types

Mat: CPU Tensor Structure

Dimensions and Shape

Memory Layout and Element Packing

Core Fields and Types

Construction and Lifecycle

Data Access Methods

Pixel Operations

Reshape and Transformation Operations

VkMat: GPU Buffer Tensor

Key Differences from Mat

CPU-GPU Data Transfer

VkImageMat: GPU Image Tensor

Image-Specific Characteristics

Element Size and Precision Modes

Type Conversion Utilities

Memory Management Integration

Summary Table

On this page