SIMD

Single Instruction Multiple Data

Single vector instruction sets are extensions to the original architecture instruction sets, and thus you have to invest extra effort to use and support them. With compiled code, you need to write inline assembly, or use vector intrinsics.

Vector Intrinsics

Intrinsics appear like regular libary functions. You include the relevant header, and then you can use it just like any other lib. E.g. assuming a visual c++ compiler, if you wanted to add four float numbers to another four numbers, you would include the `xmmintrin.h` header which contains the following declaration for the _mm_add_ps intrinsic:

extern _m128 _mm_add_ps( _m128 _A, _m128 _B);

However unlike lib functions, intrinsics are implemented directly in compilers. The example SSE intrinsic typically compiles to a single instruction, x86 and amd64 instructions. For the time it takes CPU to call a library function, it might have completed dozen of these functions.

Compiler symbols

For all x86 the current best practice is to use <immintrin.h> which loads in the the correct compatible instruction set.

  • __MMX__ – X86 MMX
    • mmintrin.h - X86 MMX
  • __SSE__ – X86 SSE
    • xmmintrin.h - X86 SSE1
  • __SSE2__ – X86 SSE2
    • emmintrin.h - X86 SSE2
  • __VEC__ – altivec functions
    • altivec.h - Freescale Altivec types & intrinsics
  • __ARM_NEON__ – neon functions
    • arm_neon.h - ARM Neon types & intrinsics

Subset of vector operators and intrinsics

Operation Altivec Neon MMX/SSE/SSE2
loading vec_ld vld1q_f32 _mm_set_epi16
vector vec_splat vld1q_s16 _mm_set1_epi16
vec_splat_s16 vsetq_lane_f32 _mm_set1_pi16
vec_splat_s32 vld1_u8 _mm_set_pi16
vec_splat_s8 vdupq_lane_s16 _mm_load_ps
vec_splat_u16 vdupq_n_s16 _mm_set1_ps
vec_splat_u32 vmovq_n_f32 _mm_loadh_pi
vec_splat_u8 vset_lane_u8 _mm_loadl_pi
storing vec_st vst1_u8
vector vst1q_s16 _mm_store_ps
vst1q_f32
vst1_s16
add vec_madd vaddq_s16 _mm_add_epi16
vec_mladd vaddq_f32 _mm_add_pi16
vec_adds vmlaq_n_f32 _mm_add_ps
subtract vec_sub vsubq_s16
multiply vec_madd vmulq_n_s16 _mm_mullo_epi16
vec_mladd vmulq_s16 _mm_mullo_pi16
vmulq_f32 _mm_mul_ps
vmlaq_n_f32
arithmetic vec_sra vshrq_n_s16 _mm_srai_epi16
shift vec_srl _mm_srai_pi16
vec_sr
byte vec_perm vtbl1_u8 _mm_shuffle_pi16
permutation vec_sel vtbx1_u8 _mm_shuffle_ps
vec_mergeh vget_high_s16
vec_mergel vget_low_s16
vdupq_lane_s16
vdupq_n_s16
vmovq_n_f32
vbsl_u8
type vec_cts vmovl_u8 _mm_packs_pu16
conversion vec_unpackh vreinterpretq_s16_u16
vec_unpackl vcvtq_u32_f32
vec_cts vqmovn_s32 _mm_cvtps_pi16
vec_ctu vqmovun_s16 _mm_packus_epi16
vqmovn_u16
vcvtq_f32_s32
vmovl_s16
vmovq_n_f32
vector vec_pack vcombine_u16
combination vec_packsu vcombine_u8
vcombine_s16
maximum _mm_max_ps
minimum _mm_min_ps
vector _mm_andnot_ps
logic _mm_and_ps
_mm_or_ps
rounding vec_trunc
misc _mm_empty

References


No notes link to this note