SIMD and Vector Operations

Relevant source files

Purpose and Scope

This document describes the compiler's SIMD (Single Instruction Multiple Data) and vector operations support, covering how Go SIMD intrinsics are compiled through architecture-specific SSA operations to machine instructions. The system enables vectorized operations on packed data (e.g., 4 floats at once) for improved performance.

For general SSA optimization passes, see 3.4. For architecture-specific backend details, see 3.5.1. For register allocation, see 3.5.

System Architecture

The SIMD compilation pipeline transforms high-level vector operations through multiple lowering stages, each represented in SSA form with increasing hardware specificity.

Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go1-50 src/cmd/compile/internal/ssa/_gen/simdAMD64.rules1-50 src/cmd/compile/internal/amd64/simdssa.go1-50

Generic SIMD Operations

Generic SIMD operations are architecture-independent SSA operations representing vector computations. These operations are defined in the SSA opcode space and later lowered to architecture-specific instructions.

Operation Categories

Category	Example Ops	Description
Arithmetic	`AddFloat32x4`, `MulInt16x8`, `SubUint8x64`	Element-wise arithmetic on vectors
Logical	`AndInt32x4`, `OrUint64x2`, `XorInt8x16`	Bitwise operations on vectors
Comparison	`EqualFloat64x2`, `GreaterThanInt32x8`	Element-wise comparison producing masks
Conversion	`ConvertToFloat32Int32x4`, `ExtendLo4ToInt64Int16x8`	Type conversions and extensions
Permute	`ShuffleInt32x4`, `PermuteFloat64x4`	Rearranging vector elements
Memory	`LoadFloat32x4`, `StoreInt16x8`	Vector loads and stores
Special	`BroadcastFloat32x4`, `CompressUint32x16`	Broadcasting and compression

Sources: src/cmd/compile/internal/ssa/_gen/simdgenericOps.go1-100

Vector Types and Sizes

The compiler supports multiple vector widths corresponding to SSE (128-bit), AVX2 (256-bit), and AVX-512 (512-bit) registers:

Sources: src/cmd/compile/internal/ssa/_gen/simdgenericOps.go1-200

Architecture-Specific Lowering

Generic SIMD operations are lowered to architecture-specific SSA operations through rewrite rules. Each architecture implements its own lowering strategy based on available instruction sets.

AMD64 Lowering Rules

The AMD64 backend lowers generic operations to x86 vector instructions using pattern-matching rules:

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules1-100 src/cmd/compile/internal/ssa/rewriteAMD64.go1-100

Instruction Set Selection

The compiler selects appropriate instruction variants based on vector width and available CPU features:

Generic Op	128-bit (SSE)	256-bit (AVX2)	512-bit (AVX-512)
`AddFloat32xN`	`VADDPS128`	`VADDPS256`	`VADDPS512`
`AddFloat64xN`	`VADDPD128`	`VADDPD256`	`VADDPD512`
`AddInt32xN`	`VPADDD128`	`VPADDD256`	`VPADDD512`
`AddInt16xN`	`VPADDW128`	`VPADDW256`	`VPADDW512`
`AddInt8xN`	`VPADDB128`	`VPADDB256`	`VPADDB512`

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go1-100

Intrinsics System

The intrinsics system maps Go function calls from the simd/archsimd package to generic SSA operations, enabling direct SIMD usage in Go code.

Intrinsic Registration

Intrinsics are registered in simdIntrinsics() which maps package methods to SSA opcodes:

Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go14-50

Intrinsic Builder Functions

The intrinsics system uses builder functions to generate SSA values from function calls:

Builder	Arguments	Example
`opLen1`	Single vector arg	`Abs`, `Sqrt`, `Negate`
`opLen2`	Two vector args	`Add`, `Mul`, `And`
`opLen3`	Three vector args	`FMA`, `Select`
`opLen1Imm8`	Vector + 8-bit immediate	`Shuffle`, `Permute`
`opLen2_21`	Two args, reversed order	`AndNot`, `Sub` (non-commutative)

Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go15-100

Code Generation

Architecture-specific SIMD SSA operations are translated to machine instructions during the final code generation phase.

SSA to Machine Code

The ssaGenSIMDValue() function generates machine instructions for SIMD operations:

Sources: src/cmd/compile/internal/amd64/simdssa.go12-100

Instruction Emission Patterns

Different instruction patterns require different emission strategies:

Pattern	SSA Builder	Register Constraints	Example
`v11`	`simdV11()`	1 input reg → 1 output reg	`VPABSB128` (absolute value)
`v21`	`simdV21()`	2 input regs → 1 output reg	`VADDPS128` (add)
`v31`	`simdV31()`	3 input regs → 1 output reg	`VFMADD231SS` (FMA)
`w2kw`	`simdW2kw()`	2 inputs + mask → 1 output	`VADDPSMasked128` (masked add)
`v21load`	`simdV21load()`	1 reg + memory → 1 output	`VADDPS128load` (add from memory)

Sources: src/cmd/compile/internal/amd64/simdssa.go200-500

Register Management

SIMD operations use dedicated vector register sets that vary by architecture and instruction set level.

AMD64 Register Types

Sources: src/cmd/compile/internal/ssa/_gen/AMD64Ops.go32-96

Register Allocation Constraints

The register allocator handles SIMD registers with specific constraints:

Sources: src/cmd/compile/internal/ssa/_gen/AMD64Ops.go120-150 src/cmd/compile/internal/amd64/ssa.go46-60

Move Operations by Register Type

Move Type	Instruction	Width Handling
XMM → XMM	`MOVUPS` (≤16 bytes)	Prefer 2-byte opcode
YMM → YMM	`VMOVDQU` (≤32 bytes)	Use VEX encoding
ZMM → ZMM	`VMOVDQU64` (64 bytes)	Use EVEX encoding
GP → XMM	`MOVQ`/`MOVL`	Width-specific
XMM → GP	`MOVQ`/`MOVL`	Width-specific
K → GP	`KMOVQ`	64-bit always
GP → K	`KMOVQ`	64-bit always

Sources: src/cmd/compile/internal/amd64/ssa.go116-163

Masking and Predication (AVX-512)

AVX-512 introduces masked operations where a mask register controls which elements are updated, enabling predicated SIMD execution.

Masked Operation Pattern

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go17-25

Mask Type Conversions

The compiler uses special operations to convert between vector types and mask registers:

Operation	Purpose	Example SSA Op
`VPMOVVec8x16ToM`	Vector → Mask	Convert comparison result to mask
`VPMOVMToVec8x16`	Mask → Vector	Convert mask to vector for further ops
`VPMOVVec32x4ToM`	32-bit vector → Mask	For `Float32x4` comparisons
`VPMOVVec64x8ToM`	64-bit vector → Mask	For `Float64x8` comparisons

Sources: src/cmd/compile/internal/ssa/opGen.go1193-1216

Compress and Expand Operations

AVX-512 provides compress/expand instructions for selective packing/unpacking of vector elements:

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules189-218

Special SIMD Operations

Broadcast Operations

Broadcasting replicates a single element across all positions in a vector:

Generic Op	AMD64 Op	Description
`Broadcast1To2Float64x2`	`VPBROADCASTQ128`	Duplicate element 0 to element 1
`Broadcast1To4Float32x4`	`VBROADCASTSS128`	Replicate element 0 to all 4 positions
`Broadcast1To8Float32x4`	`VBROADCASTSS256`	Replicate element 0 to all 8 positions
`Broadcast1To16Float32x4`	`VBROADCASTSS512`	Replicate element 0 to all 16 positions

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules143-172

Type Conversion Operations

SIMD type conversions handle element-wise type changes with potential width changes:

Generic Op	AMD64 Op	Conversion
`ConvertToFloat32Int32x4`	`VCVTDQ2PS128`	int32[4] → float32[4]
`ConvertToFloat64Int32x4`	`VCVTDQ2PD256`	int32[4] → float64[4] (width expands)
`ConvertToInt32Float32x4`	`VCVTTPS2DQ128`	float32[4] → int32[4] (truncate)
`ConvertToInt64Float64x2`	`VCVTTPD2QQ128`	float64[2] → int64[2]

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules252-300

Permute and Shuffle Operations

Permutation operations rearrange elements within or across vectors:

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules219-248

AES Cryptographic Instructions

The compiler provides intrinsics for AES encryption/decryption operations:

Generic Op	AMD64 Op	x86 Instruction
`AESEncryptOneRoundUint8x16`	`VAESENC128`	`VAESENC`
`AESEncryptLastRoundUint8x16`	`VAESENCLAST128`	`VAESENCLAST`
`AESDecryptOneRoundUint8x16`	`VAESDEC128`	`VAESDEC`
`AESDecryptLastRoundUint8x16`	`VAESDECLAST128`	`VAESDECLAST`
`AESInvMixColumnsUint32x4`	`VAESIMC128`	`VAESIMC`

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules3-16

Memory Operations

SIMD memory operations load and store vectors with specific alignment and width requirements.

Load and Store Instructions

Sources: src/cmd/compile/internal/amd64/ssa.go62-114

Masked Memory Operations (AVX-512)

AVX-512 supports masked loads and stores controlled by mask registers:

Operation	SSA Op	Description
Masked Load	`VMOVDQU8Masked128`	Load with merge mask (preserve on 0)
Masked Store	`VPMASK32store128`	Store only masked elements
Gather	`VPGATHERDDMasked`	Gather from indexed addresses
Scatter	`VPSCATTERDDMasked`	Scatter to indexed addresses

Sources: src/cmd/compile/internal/ssa/opGen.go1496-1507

Optimization Integration

SIMD operations participate in standard SSA optimization passes with special considerations.

Dead Code Elimination

SIMD operations are eliminated if their results are unused, but side-effects (like stores) are preserved:

Common Subexpression Elimination

Identical SIMD operations can be deduplicated if they don't have side effects:

Sources: Related to general SSA passes documented in 3.4

Constant Folding

SIMD operations with constant inputs can be folded at compile time where feasible, though this is limited due to complexity of vector operations.

Overall Sources: src/cmd/compile/internal/ssa/opGen.go1-1000 src/cmd/compile/internal/ssa/rewriteAMD64.go1-100 src/cmd/compile/internal/ssa/_gen/simdAMD64.rules1-400 src/cmd/compile/internal/ssagen/simdintrinsics.go1-200 src/cmd/compile/internal/amd64/simdssa.go1-500 src/cmd/compile/internal/amd64/ssa.go46-163 src/cmd/compile/internal/ssa/_gen/AMD64Ops.go32-150 src/cmd/compile/internal/ssa/_gen/simdgenericOps.go1-200 src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go1-200

SIMD and Vector Operations

Relevant source files

Purpose and Scope

For general SSA optimization passes, see 3.4. For architecture-specific backend details, see 3.5.1. For register allocation, see 3.5.

System Architecture

The SIMD compilation pipeline transforms high-level vector operations through multiple lowering stages, each represented in SSA form with increasing hardware specificity.

Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go1-50 src/cmd/compile/internal/ssa/_gen/simdAMD64.rules1-50 src/cmd/compile/internal/amd64/simdssa.go1-50

Generic SIMD Operations

Operation Categories

Category	Example Ops	Description
Arithmetic	`AddFloat32x4`, `MulInt16x8`, `SubUint8x64`	Element-wise arithmetic on vectors
Logical	`AndInt32x4`, `OrUint64x2`, `XorInt8x16`	Bitwise operations on vectors
Comparison	`EqualFloat64x2`, `GreaterThanInt32x8`	Element-wise comparison producing masks
Conversion	`ConvertToFloat32Int32x4`, `ExtendLo4ToInt64Int16x8`	Type conversions and extensions
Permute	`ShuffleInt32x4`, `PermuteFloat64x4`	Rearranging vector elements
Memory	`LoadFloat32x4`, `StoreInt16x8`	Vector loads and stores
Special	`BroadcastFloat32x4`, `CompressUint32x16`	Broadcasting and compression

Sources: src/cmd/compile/internal/ssa/_gen/simdgenericOps.go1-100

Vector Types and Sizes

The compiler supports multiple vector widths corresponding to SSE (128-bit), AVX2 (256-bit), and AVX-512 (512-bit) registers:

Sources: src/cmd/compile/internal/ssa/_gen/simdgenericOps.go1-200

Architecture-Specific Lowering

Generic SIMD operations are lowered to architecture-specific SSA operations through rewrite rules. Each architecture implements its own lowering strategy based on available instruction sets.

AMD64 Lowering Rules

The AMD64 backend lowers generic operations to x86 vector instructions using pattern-matching rules:

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules1-100 src/cmd/compile/internal/ssa/rewriteAMD64.go1-100

Instruction Set Selection

The compiler selects appropriate instruction variants based on vector width and available CPU features:

Generic Op	128-bit (SSE)	256-bit (AVX2)	512-bit (AVX-512)
`AddFloat32xN`	`VADDPS128`	`VADDPS256`	`VADDPS512`
`AddFloat64xN`	`VADDPD128`	`VADDPD256`	`VADDPD512`
`AddInt32xN`	`VPADDD128`	`VPADDD256`	`VPADDD512`
`AddInt16xN`	`VPADDW128`	`VPADDW256`	`VPADDW512`
`AddInt8xN`	`VPADDB128`	`VPADDB256`	`VPADDB512`

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go1-100

Intrinsics System

The intrinsics system maps Go function calls from the simd/archsimd package to generic SSA operations, enabling direct SIMD usage in Go code.

Intrinsic Registration

Intrinsics are registered in simdIntrinsics() which maps package methods to SSA opcodes:

Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go14-50

Intrinsic Builder Functions

The intrinsics system uses builder functions to generate SSA values from function calls:

Builder	Arguments	Example
`opLen1`	Single vector arg	`Abs`, `Sqrt`, `Negate`
`opLen2`	Two vector args	`Add`, `Mul`, `And`
`opLen3`	Three vector args	`FMA`, `Select`
`opLen1Imm8`	Vector + 8-bit immediate	`Shuffle`, `Permute`
`opLen2_21`	Two args, reversed order	`AndNot`, `Sub` (non-commutative)

Sources: src/cmd/compile/internal/ssagen/simdintrinsics.go15-100

Code Generation

Architecture-specific SIMD SSA operations are translated to machine instructions during the final code generation phase.

SSA to Machine Code

The ssaGenSIMDValue() function generates machine instructions for SIMD operations:

Sources: src/cmd/compile/internal/amd64/simdssa.go12-100

Instruction Emission Patterns

Different instruction patterns require different emission strategies:

Pattern	SSA Builder	Register Constraints	Example
`v11`	`simdV11()`	1 input reg → 1 output reg	`VPABSB128` (absolute value)
`v21`	`simdV21()`	2 input regs → 1 output reg	`VADDPS128` (add)
`v31`	`simdV31()`	3 input regs → 1 output reg	`VFMADD231SS` (FMA)
`w2kw`	`simdW2kw()`	2 inputs + mask → 1 output	`VADDPSMasked128` (masked add)
`v21load`	`simdV21load()`	1 reg + memory → 1 output	`VADDPS128load` (add from memory)

Sources: src/cmd/compile/internal/amd64/simdssa.go200-500

Register Management

SIMD operations use dedicated vector register sets that vary by architecture and instruction set level.

AMD64 Register Types

Sources: src/cmd/compile/internal/ssa/_gen/AMD64Ops.go32-96

Register Allocation Constraints

The register allocator handles SIMD registers with specific constraints:

Sources: src/cmd/compile/internal/ssa/_gen/AMD64Ops.go120-150 src/cmd/compile/internal/amd64/ssa.go46-60

Move Operations by Register Type

Move Type	Instruction	Width Handling
XMM → XMM	`MOVUPS` (≤16 bytes)	Prefer 2-byte opcode
YMM → YMM	`VMOVDQU` (≤32 bytes)	Use VEX encoding
ZMM → ZMM	`VMOVDQU64` (64 bytes)	Use EVEX encoding
GP → XMM	`MOVQ`/`MOVL`	Width-specific
XMM → GP	`MOVQ`/`MOVL`	Width-specific
K → GP	`KMOVQ`	64-bit always
GP → K	`KMOVQ`	64-bit always

Sources: src/cmd/compile/internal/amd64/ssa.go116-163

Masking and Predication (AVX-512)

AVX-512 introduces masked operations where a mask register controls which elements are updated, enabling predicated SIMD execution.

Masked Operation Pattern

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go17-25

Mask Type Conversions

The compiler uses special operations to convert between vector types and mask registers:

Operation	Purpose	Example SSA Op
`VPMOVVec8x16ToM`	Vector → Mask	Convert comparison result to mask
`VPMOVMToVec8x16`	Mask → Vector	Convert mask to vector for further ops
`VPMOVVec32x4ToM`	32-bit vector → Mask	For `Float32x4` comparisons
`VPMOVVec64x8ToM`	64-bit vector → Mask	For `Float64x8` comparisons

Sources: src/cmd/compile/internal/ssa/opGen.go1193-1216

Compress and Expand Operations

AVX-512 provides compress/expand instructions for selective packing/unpacking of vector elements:

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules189-218

Special SIMD Operations

Broadcast Operations

Broadcasting replicates a single element across all positions in a vector:

Generic Op	AMD64 Op	Description
`Broadcast1To2Float64x2`	`VPBROADCASTQ128`	Duplicate element 0 to element 1
`Broadcast1To4Float32x4`	`VBROADCASTSS128`	Replicate element 0 to all 4 positions
`Broadcast1To8Float32x4`	`VBROADCASTSS256`	Replicate element 0 to all 8 positions
`Broadcast1To16Float32x4`	`VBROADCASTSS512`	Replicate element 0 to all 16 positions

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules143-172

Type Conversion Operations

SIMD type conversions handle element-wise type changes with potential width changes:

Generic Op	AMD64 Op	Conversion
`ConvertToFloat32Int32x4`	`VCVTDQ2PS128`	int32[4] → float32[4]
`ConvertToFloat64Int32x4`	`VCVTDQ2PD256`	int32[4] → float64[4] (width expands)
`ConvertToInt32Float32x4`	`VCVTTPS2DQ128`	float32[4] → int32[4] (truncate)
`ConvertToInt64Float64x2`	`VCVTTPD2QQ128`	float64[2] → int64[2]

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules252-300

Permute and Shuffle Operations

Permutation operations rearrange elements within or across vectors:

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules219-248

AES Cryptographic Instructions

The compiler provides intrinsics for AES encryption/decryption operations:

Generic Op	AMD64 Op	x86 Instruction
`AESEncryptOneRoundUint8x16`	`VAESENC128`	`VAESENC`
`AESEncryptLastRoundUint8x16`	`VAESENCLAST128`	`VAESENCLAST`
`AESDecryptOneRoundUint8x16`	`VAESDEC128`	`VAESDEC`
`AESDecryptLastRoundUint8x16`	`VAESDECLAST128`	`VAESDECLAST`
`AESInvMixColumnsUint32x4`	`VAESIMC128`	`VAESIMC`

Sources: src/cmd/compile/internal/ssa/_gen/simdAMD64.rules3-16

Memory Operations

SIMD memory operations load and store vectors with specific alignment and width requirements.

Load and Store Instructions

Sources: src/cmd/compile/internal/amd64/ssa.go62-114

Masked Memory Operations (AVX-512)

AVX-512 supports masked loads and stores controlled by mask registers:

Operation	SSA Op	Description
Masked Load	`VMOVDQU8Masked128`	Load with merge mask (preserve on 0)
Masked Store	`VPMASK32store128`	Store only masked elements
Gather	`VPGATHERDDMasked`	Gather from indexed addresses
Scatter	`VPSCATTERDDMasked`	Scatter to indexed addresses

Sources: src/cmd/compile/internal/ssa/opGen.go1496-1507

Optimization Integration

SIMD operations participate in standard SSA optimization passes with special considerations.

Dead Code Elimination

SIMD operations are eliminated if their results are unused, but side-effects (like stores) are preserved:

Common Subexpression Elimination

Identical SIMD operations can be deduplicated if they don't have side effects:

Sources: Related to general SSA passes documented in 3.4

Constant Folding

SIMD operations with constant inputs can be folded at compile time where feasible, though this is limited due to complexity of vector operations.

SIMD and Vector Operations

Purpose and Scope

System Architecture

Generic SIMD Operations

Operation Categories

Vector Types and Sizes

Architecture-Specific Lowering

AMD64 Lowering Rules

Instruction Set Selection

Intrinsics System

Intrinsic Registration

Intrinsic Builder Functions

Code Generation

SSA to Machine Code

Instruction Emission Patterns

Register Management

AMD64 Register Types

Register Allocation Constraints

Move Operations by Register Type

Masking and Predication (AVX-512)

Masked Operation Pattern

Mask Type Conversions

Compress and Expand Operations

Special SIMD Operations

Broadcast Operations

Type Conversion Operations

Permute and Shuffle Operations

AES Cryptographic Instructions

Memory Operations

Load and Store Instructions

Masked Memory Operations (AVX-512)

Optimization Integration

Dead Code Elimination

Common Subexpression Elimination

Constant Folding

On this page

SIMD and Vector Operations

Purpose and Scope

System Architecture

Generic SIMD Operations

Operation Categories

Vector Types and Sizes

Architecture-Specific Lowering

AMD64 Lowering Rules

Instruction Set Selection

Intrinsics System

Intrinsic Registration

Intrinsic Builder Functions

Code Generation

SSA to Machine Code

Instruction Emission Patterns

Register Management

AMD64 Register Types

Register Allocation Constraints

Move Operations by Register Type

Masking and Predication (AVX-512)

Masked Operation Pattern

Mask Type Conversions

Compress and Expand Operations

Special SIMD Operations

Broadcast Operations

Type Conversion Operations

Permute and Shuffle Operations

AES Cryptographic Instructions

Memory Operations

Load and Store Instructions

Masked Memory Operations (AVX-512)

Optimization Integration

Dead Code Elimination

Common Subexpression Elimination

Constant Folding

On this page