SIMD
Single Instruction Multiple Data
Single vector instruction sets are extensions to the original architecture instruction sets, and thus you have to invest extra effort to use and support them. With compiled code, you need to write inline assembly, or use vector intrinsics.
Vector Intrinsics
Intrinsics appear like regular libary functions. You include the relevant header, and then you can use it just like any other lib. E.g. assuming a visual c++ compiler, if you wanted to add four float numbers to another four numbers, you would include the `xmmintrin.h` header which contains the following declaration for the _mm_add_ps
intrinsic:
extern _m128 _mm_add_ps( _m128 _A, _m128 _B);
However unlike lib functions, intrinsics are implemented directly in compilers. The example SSE intrinsic typically compiles to a single instruction, x86 and amd64 instructions. For the time it takes CPU to call a library function, it might have completed dozen of these functions.
Compiler symbols
For all x86 the current best practice is to use <immintrin.h>
which loads in the the correct compatible instruction set.
__MMX__
– X86 MMX- mmintrin.h - X86 MMX
__SSE__
– X86 SSE- xmmintrin.h - X86 SSE1
__SSE2__
– X86 SSE2- emmintrin.h - X86 SSE2
__VEC__
– altivec functions- altivec.h - Freescale Altivec types & intrinsics
__ARM_NEON__
– neon functions- arm_neon.h - ARM Neon types & intrinsics
Subset of vector operators and intrinsics
Operation | Altivec | Neon | MMX/SSE/SSE2 |
---|---|---|---|
loading | vec_ld | vld1q_f32 | _mm_set_epi16 |
vector | vec_splat | vld1q_s16 | _mm_set1_epi16 |
vec_splat_s16 | vsetq_lane_f32 | _mm_set1_pi16 | |
vec_splat_s32 | vld1_u8 | _mm_set_pi16 | |
vec_splat_s8 | vdupq_lane_s16 | _mm_load_ps | |
vec_splat_u16 | vdupq_n_s16 | _mm_set1_ps | |
vec_splat_u32 | vmovq_n_f32 | _mm_loadh_pi | |
vec_splat_u8 | vset_lane_u8 | _mm_loadl_pi | |
storing | vec_st | vst1_u8 | |
vector | vst1q_s16 | _mm_store_ps | |
vst1q_f32 | |||
vst1_s16 | |||
add | vec_madd | vaddq_s16 | _mm_add_epi16 |
vec_mladd | vaddq_f32 | _mm_add_pi16 | |
vec_adds | vmlaq_n_f32 | _mm_add_ps | |
subtract | vec_sub | vsubq_s16 | |
multiply | vec_madd | vmulq_n_s16 | _mm_mullo_epi16 |
vec_mladd | vmulq_s16 | _mm_mullo_pi16 | |
vmulq_f32 | _mm_mul_ps | ||
vmlaq_n_f32 | |||
arithmetic | vec_sra | vshrq_n_s16 | _mm_srai_epi16 |
shift | vec_srl | _mm_srai_pi16 | |
vec_sr | |||
byte | vec_perm | vtbl1_u8 | _mm_shuffle_pi16 |
permutation | vec_sel | vtbx1_u8 | _mm_shuffle_ps |
vec_mergeh | vget_high_s16 | ||
vec_mergel | vget_low_s16 | ||
vdupq_lane_s16 | |||
vdupq_n_s16 | |||
vmovq_n_f32 | |||
vbsl_u8 | |||
type | vec_cts | vmovl_u8 | _mm_packs_pu16 |
conversion | vec_unpackh | vreinterpretq_s16_u16 | |
vec_unpackl | vcvtq_u32_f32 | ||
vec_cts | vqmovn_s32 | _mm_cvtps_pi16 | |
vec_ctu | vqmovun_s16 | _mm_packus_epi16 | |
vqmovn_u16 | |||
vcvtq_f32_s32 | |||
vmovl_s16 | |||
vmovq_n_f32 | |||
vector | vec_pack | vcombine_u16 | |
combination | vec_packsu | vcombine_u8 | |
vcombine_s16 | |||
maximum | _mm_max_ps | ||
minimum | _mm_min_ps | ||
vector | _mm_andnot_ps | ||
logic | _mm_and_ps | ||
_mm_or_ps | |||
rounding | vec_trunc | ||
misc | _mm_empty |
References
- Utilizing the other 80% of your system’s performance: Starting with Vectorization
- Improving perf with simd intrinsics in three use cases
- Basics of SIMD programming
- SIMD Keyword OpenMP, R, Parallelism
- SIMD a practical guide
- Organizing SIMD code - stackoverflow
- SSE SSE2 and SSE3 for GNU C++ [closed] - stackoverflow
- SIMD, memory locality, vectorization, and branch point prediction
- Code in ARM assembly: Rounding and arithmetic (scalar)
- Comparing SIMD on x86-64 and arm64