SIMD

2024.09.04

Single Instruction Multiple Data
Vector Intrinsics
Compiler symbols
Subset of vector operators and intrinsics
References

Single Instruction Multiple Data

Single vector instruction sets are extensions to the original architecture instruction sets, and thus you have to invest extra effort to use and support them. With compiled code, you need to write inline assembly, or use vector intrinsics.

Vector Intrinsics

Intrinsics appear like regular libary functions. You include the relevant header, and then you can use it just like any other lib. E.g. assuming a visual c++ compiler, if you wanted to add four float numbers to another four numbers, you would include the `xmmintrin.h` header which contains the following declaration for the _mm_add_ps intrinsic:

extern _m128 _mm_add_ps( _m128 _A, _m128 _B);

However unlike lib functions, intrinsics are implemented directly in compilers. The example SSE intrinsic typically compiles to a single instruction, x86 and amd64 instructions. For the time it takes CPU to call a library function, it might have completed dozen of these functions.

Compiler symbols

For all x86 the current best practice is to use <immintrin.h> which loads in the the correct compatible instruction set.

__MMX__ – X86 MMX
- mmintrin.h - X86 MMX
__SSE__ – X86 SSE
- xmmintrin.h - X86 SSE1
__SSE2__ – X86 SSE2
- emmintrin.h - X86 SSE2
__VEC__ – altivec functions
- altivec.h - Freescale Altivec types & intrinsics
__ARM_NEON__ – neon functions
- arm_neon.h - ARM Neon types & intrinsics

Subset of vector operators and intrinsics

Operation	Altivec	Neon	MMX/SSE/SSE2
loading	vec_ld	vld1q_f32	_mm_set_epi16
vector	vec_splat	vld1q_s16	_mm_set1_epi16
	vec_splat_s16	vsetq_lane_f32	_mm_set1_pi16
	vec_splat_s32	vld1_u8	_mm_set_pi16
	vec_splat_s8	vdupq_lane_s16	_mm_load_ps
	vec_splat_u16	vdupq_n_s16	_mm_set1_ps
	vec_splat_u32	vmovq_n_f32	_mm_loadh_pi
	vec_splat_u8	vset_lane_u8	_mm_loadl_pi
storing	vec_st	vst1_u8
vector		vst1q_s16	_mm_store_ps
		vst1q_f32
		vst1_s16
add	vec_madd	vaddq_s16	_mm_add_epi16
	vec_mladd	vaddq_f32	_mm_add_pi16
	vec_adds	vmlaq_n_f32	_mm_add_ps
subtract	vec_sub	vsubq_s16
multiply	vec_madd	vmulq_n_s16	_mm_mullo_epi16
	vec_mladd	vmulq_s16	_mm_mullo_pi16
		vmulq_f32	_mm_mul_ps
		vmlaq_n_f32
arithmetic	vec_sra	vshrq_n_s16	_mm_srai_epi16
shift	vec_srl		_mm_srai_pi16
	vec_sr
byte	vec_perm	vtbl1_u8	_mm_shuffle_pi16
permutation	vec_sel	vtbx1_u8	_mm_shuffle_ps
	vec_mergeh	vget_high_s16
	vec_mergel	vget_low_s16
		vdupq_lane_s16
		vdupq_n_s16
		vmovq_n_f32
		vbsl_u8
type	vec_cts	vmovl_u8	_mm_packs_pu16
conversion	vec_unpackh	vreinterpretq_s16_u16
	vec_unpackl	vcvtq_u32_f32
	vec_cts	vqmovn_s32	_mm_cvtps_pi16
	vec_ctu	vqmovun_s16	_mm_packus_epi16
		vqmovn_u16
		vcvtq_f32_s32
		vmovl_s16
		vmovq_n_f32
vector	vec_pack	vcombine_u16
combination	vec_packsu	vcombine_u8
		vcombine_s16
maximum			_mm_max_ps
minimum			_mm_min_ps
vector			_mm_andnot_ps
logic			_mm_and_ps
			_mm_or_ps
rounding	vec_trunc
misc			_mm_empty

Single Instruction Multiple Data

Vector Intrinsics

Compiler symbols

Subset of vector operators and intrinsics

References

No notes link to this note