This page documents the GPU-side layer classes located in src/layer/vulkan/. Each class wraps a CPU layer with a Vulkan compute pipeline that operates on VkMat tensors. The page covers the shared lifecycle pattern, packing layout selection, and the key per-layer implementation details including weight packing strategies, Winograd convolution shaders, and cooperative-matrix GEMM.
For the Vulkan pipeline and command recording infrastructure (Pipeline, VkCompute, record_pipeline), see 3.2. For GPU memory management (VkAllocator, VkTransfer), see 3.3. For the CPU-side counterparts of these layers, see 4.
Every Vulkan layer class follows a three-phase lifecycle. The methods are defined on the base Layer class and overridden in each _vulkan subclass.
Lifecycle Phases and Method Calls
Sources: src/layer/vulkan/convolution_vulkan.cpp58-176 src/layer/vulkan/convolutiondepthwise_vulkan.cpp316-348 src/layer/vulkan/innerproduct_vulkan.cpp26-225
| Phase | Method | Key Actions |
|---|---|---|
| Parameter load | load_param | Reads layer params; may set support_vulkan = false for dynamic weights |
| Pipeline setup | create_pipeline | Packs weights on CPU; allocates Pipeline objects with specialization constants |
| Weight upload | upload_model | Transfers packed weights to device-local VkMat via VkTransfer::record_upload |
| Inference | forward(VkMat, ...) | Calls VkCompute::record_pipeline for each dispatch; no CPU/GPU sync here |
| Cleanup | destroy_pipeline | Deletes Pipeline objects and any sub-layers |
The Vulkan layers use a pack4 layout when channel counts are divisible by 4. Each shader invocation processes 4 channels at once using GLSL vec4. Two independent packing factors are computed: one for the input tensor and one for the output tensor.
elempack = (num_input % 4 == 0) ? 4 : 1
out_elempack = (num_output % 4 == 0) ? 4 : 1
This yields four possible shader variants selected at create_pipeline time:
elempack | out_elempack | Shader suffix example |
|---|---|---|
| 1 | 1 | convolution |
| 4 | 4 | convolution_pack4 |
| 1 | 4 | convolution_pack1to4 |
| 4 | 1 | convolution_pack4to1 |
The correct LayerShaderType enum value is chosen and passed to Pipeline::create. The pipeline object is stored as a class member (e.g., pipeline_convolution_pack4).
Sources: src/layer/vulkan/convolution_vulkan.cpp97-111 src/layer/vulkan/convolutiondepthwise_vulkan.cpp74-76 src/layer/vulkan/innerproduct_vulkan.cpp33-34
Packing variant selection in create_pipeline
Sources: src/layer/vulkan/convolution_vulkan.cpp97-111 src/layer/vulkan/deconvolution_vulkan.cpp69-70
Class: Convolution_vulkan in src/layer/vulkan/convolution_vulkan.h
Convolution_vulkan inherits from Convolution and selects one of several compute strategies at create_pipeline time, depending on kernel geometry, channel counts, and Option flags.
Sources: src/layer/vulkan/convolution_vulkan.cpp58-180 src/layer/vulkan/convolution_vulkan.h36-50
For 3×3 stride-1 dilation-1 convolutions with at least 16 input and output channels, two Winograd variants are supported:
| Variant | Tile size | Filter domain size | Pipeline members |
|---|---|---|---|
| Winograd F(2,3) | 2×2 output | 4×4 = 16 transforms | pipeline_convolution_3x3s1d1_winograd23_* |
| Winograd F(4,3) | 4×4 output | 6×6 = 36 transforms | pipeline_convolution_3x3s1d1_winograd43_* |
Each variant is a three-stage pipeline:
Kernel weights are transformed on the CPU during create_pipeline and stored in weight_winograd23_data_packed or weight_winograd43_data_packed, then uploaded via upload_model.
Sources: src/layer/vulkan/convolution_vulkan.cpp207-533 src/layer/vulkan/convolution_vulkan.h43-50
When vkdev->info.support_cooperative_matrix() returns true and FP16 storage is active, the Winograd GEMM stage uses the convolution_winograd_gemm_cm shader (src/layer/vulkan/shader/convolution_winograd_gemm_cm.comp) instead of the standard GEMM. This shader uses GL_KHR_cooperative_matrix (or GL_NV_cooperative_matrix as fallback).
The parameters coopmat_M, coopmat_N, coopmat_K and the unroll factors UNROLL_SG_M/N/K, UNROLL_WG_M/N are computed at pipeline creation time by calling vkdev->info.get_optimal_cooperative_matrix_mnk(...). These are passed as Vulkan specialization constants.
Weight data is re-packed in cooperative-matrix-friendly layout:
inch/pa × outch/pb × 36 (for F(4,3))blocks_n × (coopmat_N×coopmat_K×UNROLL_SG_N×UNROLL_WG_N×kk) per frequency positionSources: src/layer/vulkan/convolution_vulkan.cpp180-210 src/layer/vulkan/convolution_vulkan.cpp258-409 src/layer/vulkan/shader/convolution_winograd_gemm_cm.comp1-30
| Strategy | CPU Mat layout (weight_data_packed) |
|---|---|
| Direct / GEMM | maxk × (num_input/ep) × (num_output/oep) with elempack ep*oep |
| Winograd standard | (num_input/ep) × (num_output/oep) × 36 (or 16) with elempack ep*oep |
| Winograd coopmat | (coopmat_N*coopmat_K*...*kk) × blocks_n × 36 (or 16) with elempack 1 |
Sources: src/layer/vulkan/convolution_vulkan.cpp382-409 src/layer/vulkan/convolution_vulkan.cpp710-740
File: src/layer/vulkan/convolutiondepthwise_vulkan.cpp
ConvolutionDepthWise_vulkan handles both true depthwise convolution (channels == group == num_output) and grouped convolution with smaller per-group elempack_g.
The class creates a Padding_vulkan sub-layer for input padding, stored in the member padding.
True depthwise path:
convert_packing(weight_data_r2, weight_data_packed, elempack, opt).pipeline_convolutiondepthwise (pack1) or pipeline_convolutiondepthwise_pack4 (pack4) is created.Group convolution path:
elempack_g and out_elempack_g.pipeline_convolutiondepthwise_group, ..._pack4, ..._pack1to4, ..._pack4to1.weight_data_packed_groups.Sources: src/layer/vulkan/convolutiondepthwise_vulkan.cpp39-284 src/layer/vulkan/convolutiondepthwise_vulkan.cpp316-348
File: src/layer/vulkan/deconvolution_vulkan.cpp
Deconvolution_vulkan implements transposed convolution. It creates two Crop_vulkan sub-layers (crop and output_crop) for post-convolution output trimming.
GEMM + col2im path (when opt.use_sgemm_convolution && num_input >= 8 && maxk*num_output >= 8):
(num_input/ep) × (maxk*num_output/oep) — the kernel and output dimensions are fused.pipeline_deconvolution_gemm (one invocation per input spatial position) and pipeline_deconvolution_col2im (scatter results to output).deconvolution_gemm_cm shader.Direct path: Weights are transposed (kernel order reversed) and packed as maxk × (num_input/ep) × (num_output/oep). A single pipeline_deconvolution handles the full computation.
Sources: src/layer/vulkan/deconvolution_vulkan.cpp48-501 src/layer/vulkan/deconvolution_vulkan.h1-55
File: src/layer/vulkan/innerproduct_vulkan.cpp
InnerProduct_vulkan creates a Flatten_vulkan sub-layer to convert multi-dimensional input into a 1D vector before the weight multiply.
The class chooses among three approaches depending on input shape and num_input:
1. GEMM path (2D input — batch matrix multiply)
Triggered when shape.dims == 2 && shape.w == num_input. Uses pipeline_innerproduct_gemm with shader variants: innerproduct_gemm, innerproduct_gemm_wp4, innerproduct_gemm_wp1to4, innerproduct_gemm_wp4to1.
2. Two-pass sum8 path (large flat input)
Triggered when num_input / in_elempack >= 32. Splits work across two pipelines:
pipeline_innerproduct_sum8 — partial dot products grouped into blocks of 8pipeline_innerproduct_reduce_sum8 — reduces partial sums and applies bias/activation3. Direct path (small flat input)
A single pipeline_innerproduct computes the full dot product in one dispatch.
The weight packing layout is always (num_input/ep) × (num_output/oep) with elempack ep*oep.
Sources: src/layer/vulkan/innerproduct_vulkan.cpp26-392
File: src/layer/vulkan/pooling_vulkan.cpp
Pooling_vulkan creates a Padding_vulkan sub-layer for pre-pooling padding and selects among three compute strategies:
| Mode | Pipelines Created |
|---|---|
| Standard (non-global, non-adaptive) | pipeline_pooling, pipeline_pooling_pack4 |
| Adaptive | pipeline_pooling_adaptive, pipeline_pooling_adaptive_pack4 |
| Global pooling | pipeline_pooling_global_reduce_first[_pack4], pipeline_pooling_global_reduce[_pack4], pipeline_pooling_global_reduce_last[_pack4] |
Global pooling uses a three-stage tree reduction to avoid serializing all spatial positions through a single shader invocation.
Specialization constants encode pooling_type, kernel_w/h, stride_w/h, pad_mode, and avgpool_count_include_pad.
Sources: src/layer/vulkan/pooling_vulkan.cpp33-420
File: src/layer/vulkan/padding_vulkan.cpp
Padding_vulkan is a utility layer used as a sub-layer by Convolution_vulkan, ConvolutionDepthWise_vulkan, and Pooling_vulkan. It handles the GPU-side border extension before convolution or pooling.
Packing awareness:
offset_elempack is computed from the amount to be padded on the leading edge.elempack does not match the required offset packing (e.g., padding 2 elements into a pack4 tensor), an unpacking pass is inserted using pipeline_padding_pack4to1.Pipeline members:
| Member | Condition |
|---|---|
pipeline_padding | pack1 input/output |
pipeline_padding_pack4 | pack4 input/output |
pipeline_padding_pack1to4 | pack1 in → pack4 out |
pipeline_padding_pack4to1 | pack4 in → pack1 out |
pipeline_padding_3d / pipeline_padding_3d_pack4 | 4D (depth) input |
Sources: src/layer/vulkan/padding_vulkan.cpp10-80
File: src/layer/vulkan/interp_vulkan.cpp
Interp_vulkan implements nearest/bilinear/bicubic resize on the GPU.
resize_type == 1 or 2 (nearest/bilinear): A single shader interp / interp_pack4 handles both modes, with resize_type and align_corner passed as specialization constants.resize_type == 3 (bicubic): Requires three pipelines:
pipeline_interp_bicubic_coeffs_x — pre-computes horizontal cubic coefficientspipeline_interp_bicubic_coeffs_y — pre-computes vertical cubic coefficientspipeline_interp_bicubic / pipeline_interp_bicubic_pack4 — applies the coefficientsSources: src/layer/vulkan/interp_vulkan.cpp24-174
| Layer class | File | Strategy |
|---|---|---|
InstanceNorm_vulkan | src/layer/vulkan/instancenorm_vulkan.cpp | Multi-pass: reduce→mean→sub_mean_square→coeffs→norm; pack1 and pack4 variants |
PReLU_vulkan | src/layer/vulkan/prelu_vulkan.cpp | Single pipeline; num_slope and optional single-slope constant as specialization |
Scale_vulkan | src/layer/vulkan/scale_vulkan.cpp | Single pipeline; pipeline_scale / pipeline_scale_pack4 |
Normalize_vulkan | src/layer/vulkan/normalize_vulkan.cpp | Multi-pass reduce and normalize; pack1 and pack4 variants |
LRN_vulkan | src/layer/vulkan/lrn_vulkan.cpp | Two-pass: square-sum then normalize |
DeconvolutionDepthWise_vulkan | src/layer/vulkan/deconvolutiondepthwise_vulkan.cpp | True depthwise and group variants; creates Crop sub-layers |
Sources: src/layer/vulkan/instancenorm_vulkan.cpp10-30 src/layer/vulkan/prelu_vulkan.cpp10-17 src/layer/vulkan/scale_vulkan.cpp10-17
The diagram below maps layer names to their class names, source files, and the pipeline member names that identify which shaders are loaded.
Key Vulkan Layer Classes and Their Pipeline Members
Sources: src/layer/vulkan/convolution_vulkan.h11-67 src/layer/vulkan/convolutiondepthwise_vulkan.cpp11-25 src/layer/vulkan/innerproduct_vulkan.cpp11-24 src/layer/vulkan/pooling_vulkan.cpp13-31 src/layer/vulkan/padding_vulkan.cpp10-22 src/layer/vulkan/deconvolution_vulkan.h10-55
Every pipeline creation follows the same pattern of encoding both static parameters and optional shape hints as specialization constants:
specializations[0..N-1] = layer parameters (kernel size, stride, bias_term, activation_type, ...)
specializations[N+0..N+4] = input shape (dims, w, h, c, cstep)
specializations[N+5..N+9] = output shape (dims, w, h, c, cstep)
Shape-based specializations are zero when shapes are not known at create_pipeline time (i.e., bottom_shapes is empty). In that case, the shader uses runtime push_constant values provided via vk_constant_type in forward.
This two-level configuration allows the shader compiler to constant-fold dimension checks when shapes are known statically, while remaining functional when shapes are dynamic.
Sources: src/layer/vulkan/convolution_vulkan.cpp435-452 src/layer/vulkan/pooling_vulkan.cpp120-165 src/layer/vulkan/interp_vulkan.cpp31-44
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.