This document describes the compiler's SIMD (Single Instruction Multiple Data) and vector operations support, covering how Go SIMD intrinsics are compiled through architecture-specific SSA operations to machine instructions. The system enables vectorized operations on packed data (e.g., 4 floats at once) for improved performance.
For general SSA optimization passes, see 3.4. For architecture-specific backend details, see 3.5.1. For register allocation, see 3.5.
The SIMD compilation pipeline transforms high-level vector operations through multiple lowering stages, each represented in SSA form with increasing hardware specificity.
Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go1-50 src/cmd/compile/internal/ssa/_gen/simdAMD64.rules1-50 src/cmd/compile/internal/amd64/simdssa.go1-50
Generic SIMD operations are architecture-independent SSA operations representing vector computations. These operations are defined in the SSA opcode space and later lowered to architecture-specific instructions.
| Category | Example Ops | Description |
|---|---|---|
| Arithmetic | AddFloat32x4, MulInt16x8, SubUint8x64 | Element-wise arithmetic on vectors |
| Logical | AndInt32x4, OrUint64x2, XorInt8x16 | Bitwise operations on vectors |
| Comparison | EqualFloat64x2, GreaterThanInt32x8 | Element-wise comparison producing masks |
| Conversion | ConvertToFloat32Int32x4, ExtendLo4ToInt64Int16x8 | Type conversions and extensions |
| Permute | ShuffleInt32x4, PermuteFloat64x4 | Rearranging vector elements |
| Memory | LoadFloat32x4, StoreInt16x8 | Vector loads and stores |
| Special | BroadcastFloat32x4, CompressUint32x16 | Broadcasting and compression |
Sources: src/cmd/compile/internal/ssa/_gen/simdgenericOps.go1-100
The compiler supports multiple vector widths corresponding to SSE (128-bit), AVX2 (256-bit), and AVX-512 (512-bit) registers:
Sources: src/cmd/compile/internal/ssa/_gen/simdgenericOps.go1-200
Generic SIMD operations are lowered to architecture-specific SSA operations through rewrite rules. Each architecture implements its own lowering strategy based on available instruction sets.
The AMD64 backend lowers generic operations to x86 vector instructions using pattern-matching rules:
Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules1-100 src/cmd/compile/internal/ssa/rewriteAMD64.go1-100
The compiler selects appropriate instruction variants based on vector width and available CPU features:
| Generic Op | 128-bit (SSE) | 256-bit (AVX2) | 512-bit (AVX-512) |
|---|---|---|---|
AddFloat32xN | VADDPS128 | VADDPS256 | VADDPS512 |
AddFloat64xN | VADDPD128 | VADDPD256 | VADDPD512 |
AddInt32xN | VPADDD128 | VPADDD256 | VPADDD512 |
AddInt16xN | VPADDW128 | VPADDW256 | VPADDW512 |
AddInt8xN | VPADDB128 | VPADDB256 | VPADDB512 |
Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go1-100
The intrinsics system maps Go function calls from the simd/archsimd package to generic SSA operations, enabling direct SIMD usage in Go code.
Intrinsics are registered in simdIntrinsics() which maps package methods to SSA opcodes:
Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go14-50
The intrinsics system uses builder functions to generate SSA values from function calls:
| Builder | Arguments | Example |
|---|---|---|
opLen1 | Single vector arg | Abs, Sqrt, Negate |
opLen2 | Two vector args | Add, Mul, And |
opLen3 | Three vector args | FMA, Select |
opLen1Imm8 | Vector + 8-bit immediate | Shuffle, Permute |
opLen2_21 | Two args, reversed order | AndNot, Sub (non-commutative) |
Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go15-100
Architecture-specific SIMD SSA operations are translated to machine instructions during the final code generation phase.
The ssaGenSIMDValue() function generates machine instructions for SIMD operations:
Sources: src/cmd/compile/internal/amd64/simdssa.go12-100
Different instruction patterns require different emission strategies:
| Pattern | SSA Builder | Register Constraints | Example |
|---|---|---|---|
v11 | simdV11() | 1 input reg → 1 output reg | VPABSB128 (absolute value) |
v21 | simdV21() | 2 input regs → 1 output reg | VADDPS128 (add) |
v31 | simdV31() | 3 input regs → 1 output reg | VFMADD231SS (FMA) |
w2kw | simdW2kw() | 2 inputs + mask → 1 output | VADDPSMasked128 (masked add) |
v21load | simdV21load() | 1 reg + memory → 1 output | VADDPS128load (add from memory) |
Sources: src/cmd/compile/internal/amd64/simdssa.go200-500
SIMD operations use dedicated vector register sets that vary by architecture and instruction set level.
Sources: src/cmd/compile/internal/ssa/_gen/AMD64Ops.go32-96
The register allocator handles SIMD registers with specific constraints:
Sources: src/cmd/compile/internal/ssa/_gen/AMD64Ops.go120-150 src/cmd/compile/internal/amd64/ssa.go46-60
Register-to-register moves depend on source and destination register types:
| Move Type | Instruction | Width Handling |
|---|---|---|
| XMM → XMM | MOVUPS (≤16 bytes) | Prefer 2-byte opcode |
| YMM → YMM | VMOVDQU (≤32 bytes) | Use VEX encoding |
| ZMM → ZMM | VMOVDQU64 (64 bytes) | Use EVEX encoding |
| GP → XMM | MOVQ/MOVL | Width-specific |
| XMM → GP | MOVQ/MOVL | Width-specific |
| K → GP | KMOVQ | 64-bit always |
| GP → K | KMOVQ | 64-bit always |
Sources: src/cmd/compile/internal/amd64/ssa.go116-163
AVX-512 introduces masked operations where a mask register controls which elements are updated, enabling predicated SIMD execution.
Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go17-25
The compiler uses special operations to convert between vector types and mask registers:
| Operation | Purpose | Example SSA Op |
|---|---|---|
VPMOVVec8x16ToM | Vector → Mask | Convert comparison result to mask |
VPMOVMToVec8x16 | Mask → Vector | Convert mask to vector for further ops |
VPMOVVec32x4ToM | 32-bit vector → Mask | For Float32x4 comparisons |
VPMOVVec64x8ToM | 64-bit vector → Mask | For Float64x8 comparisons |
Sources: src/cmd/compile/internal/ssa/opGen.go1193-1216
AVX-512 provides compress/expand instructions for selective packing/unpacking of vector elements:
Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules189-218
Broadcasting replicates a single element across all positions in a vector:
| Generic Op | AMD64 Op | Description |
|---|---|---|
Broadcast1To2Float64x2 | VPBROADCASTQ128 | Duplicate element 0 to element 1 |
Broadcast1To4Float32x4 | VBROADCASTSS128 | Replicate element 0 to all 4 positions |
Broadcast1To8Float32x4 | VBROADCASTSS256 | Replicate element 0 to all 8 positions |
Broadcast1To16Float32x4 | VBROADCASTSS512 | Replicate element 0 to all 16 positions |
Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules143-172
SIMD type conversions handle element-wise type changes with potential width changes:
| Generic Op | AMD64 Op | Conversion |
|---|---|---|
ConvertToFloat32Int32x4 | VCVTDQ2PS128 | int32[4] → float32[4] |
ConvertToFloat64Int32x4 | VCVTDQ2PD256 | int32[4] → float64[4] (width expands) |
ConvertToInt32Float32x4 | VCVTTPS2DQ128 | float32[4] → int32[4] (truncate) |
ConvertToInt64Float64x2 | VCVTTPD2QQ128 | float64[2] → int64[2] |
Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules252-300
Permutation operations rearrange elements within or across vectors:
Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules219-248
The compiler provides intrinsics for AES encryption/decryption operations:
| Generic Op | AMD64 Op | x86 Instruction |
|---|---|---|
AESEncryptOneRoundUint8x16 | VAESENC128 | VAESENC |
AESEncryptLastRoundUint8x16 | VAESENCLAST128 | VAESENCLAST |
AESDecryptOneRoundUint8x16 | VAESDEC128 | VAESDEC |
AESDecryptLastRoundUint8x16 | VAESDECLAST128 | VAESDECLAST |
AESInvMixColumnsUint32x4 | VAESIMC128 | VAESIMC |
Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules3-16
SIMD memory operations load and store vectors with specific alignment and width requirements.
Sources: src/cmd/compile/internal/amd64/ssa.go62-114
AVX-512 supports masked loads and stores controlled by mask registers:
| Operation | SSA Op | Description |
|---|---|---|
| Masked Load | VMOVDQU8Masked128 | Load with merge mask (preserve on 0) |
| Masked Store | VPMASK32store128 | Store only masked elements |
| Gather | VPGATHERDDMasked | Gather from indexed addresses |
| Scatter | VPSCATTERDDMasked | Scatter to indexed addresses |
Sources: src/cmd/compile/internal/ssa/opGen.go1496-1507
SIMD operations participate in standard SSA optimization passes with special considerations.
SIMD operations are eliminated if their results are unused, but side-effects (like stores) are preserved:
Identical SIMD operations can be deduplicated if they don't have side effects:
Sources: Related to general SSA passes documented in 3.4
SIMD operations with constant inputs can be folded at compile time where feasible, though this is limited due to complexity of vector operations.
Overall Sources: src/cmd/compile/internal/ssa/opGen.go1-1000 src/cmd/compile/internal/ssa/rewriteAMD64.go1-100 src/cmd/compile/internal/ssa/_gen/simdAMD64.rules1-400 src/cmd/compile/internal/ssagen/simdintrinsics.go1-200 src/cmd/compile/internal/amd64/simdssa.go1-500 src/cmd/compile/internal/amd64/ssa.go46-163 src/cmd/compile/internal/ssa/_gen/AMD64Ops.go32-150 src/cmd/compile/internal/ssa/_gen/simdgenericOps.go1-200 src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go1-200
Refresh this wiki