Emscripten supports the WebAssembly SIMD proposal when using the WebAssembly LLVM backend. To enable SIMD, pass the -msimd128 flag at compile time. This will also turn on LLVM’s autovectorization passes, so no source modifications are necessary to benefit from SIMD.
At the source level, the GCC/Clang SIMD Vector Extensions can be used and will be lowered to WebAssembly SIMD instructions where possible. In addition, there is a portable intrinsics header file that can be used.
#include <wasm_simd128.h>
Separate documentation for the intrinsics header is a work in progress, but its usage is straightforward and its source can be found at wasm_simd128.h. These intrinsics are under active development in parallel with the SIMD proposal and should not be considered any more stable than the proposal itself. Note that most engines will also require an extra flag to enable SIMD. For example, Node requires –experimental-wasm-simd.
WebAssembly SIMD is not supported when using the Fastcomp backend.
When porting native SIMD code, it should be noted that because of portability concerns, the WebAssembly SIMD specification does not expose the full native instruction sets. In particular the following changes exist:
- Emscripten does not support x86 or any other native inline SIMD assembly or building .s assembly files, so all code should be written to use SIMD intrinsic functions or compiler vector extensions.
- WebAssembly SIMD does not have control over managing floating point rounding modes or handling denormals.
- Cache line prefetch instructions are not available, and calls to these functions will compile, but are treated as no-ops.
- Asymmetric memory fence operations are not available, but will be implemented as fully synchronous memory fences when SharedArrayBuffer is enabled (-s USE_PTHREADS=1) or as no-ops when multithreading is not enabled (default, -s USE_PTHREADS=0).
SIMD-related bug reports are tracked in the Emscripten bug tracker with the label SIMD.
Emscripten supports compiling existing codebases that use x86 SSE by passing the -msse directive to the compiler, and including the header <xmmintrin.h>.
Currently only the SSE1 and SSE2 instruction sets are supported.
The following table highlights the availability and expected performance of different SSE1 intrinsics. Even if you are directly targeting the native Wasm SIMD opcodes via wasm_simd128.h header, this table can be useful for understanding the performance limitations that the Wasm SIMD specification has when running on x86 hardware.
For detailed information on each SSE intrinsic function, visit the excellent Intel Intrinsics Guide on SSE1.
Certain intrinsics in the table below are marked “virtual”. This means that there does not actually exist a native x86 SSE instruction set opcode to implement them, but native compilers offer the function as a convenience. Different compilers might generate a different instruction sequence for these.
Intrinsic name | WebAssembly SIMD support |
---|---|
_mm_set_ps | ✅ wasm_f32x4_make |
_mm_setr_ps | ✅ wasm_f32x4_make |
_mm_set_ss | 💡 emulated with wasm_f32x4_make |
_mm_set_ps1 (_mm_set1_ps) | ✅ wasm_f32x4_splat |
_mm_setzero_ps | 💡 emulated with wasm_f32x4_const(0) |
_mm_load_ps | 🟡 wasm_v128_load. VM must guess type. Unaligned load on x86 CPUs. |
_mm_loadl_pi | ❌ No Wasm SIMD support. Emulated with scalar loads + shuffle. |
_mm_loadh_pi | ❌ No Wasm SIMD support. Emulated with scalar loads + shuffle. |
_mm_loadr_ps | 💡 Virtual. Simd load + shuffle. |
_mm_loadu_ps | 🟡 wasm_v128_load. VM must guess type. |
_mm_load_ps1 (_mm_load1_ps) | 🟡 Virtual. Simd load + shuffle. |
_mm_load_ss | ❌ emulated with wasm_f32x4_make |
_mm_storel_pi | ❌ scalar stores |
_mm_storeh_pi | ❌ shuffle + scalar stores |
_mm_store_ps | 🟡 wasm_v128_store. VM must guess type. Unaligned store on x86 CPUs. |
_mm_stream_ps | 🟡 wasm_v128_store. VM must guess type. No cache control in Wasm SIMD. |
_mm_prefetch | 💭 No-op. |
_mm_sfence | ⚠️ A full barrier in multithreaded builds. |
_mm_shuffle_ps | 🟡 wasm_v32x4_shuffle. VM must guess type. |
_mm_storer_ps | 💡 Virtual. Shuffle + Simd store. |
_mm_store_ps1 (_mm_store1_ps) | 🟡 Virtual. Emulated with shuffle. Unaligned store on x86 CPUs. |
_mm_store_ss | 💡 emulated with scalar store |
_mm_storeu_ps | 🟡 wasm_v128_store. VM must guess type. |
_mm_storeu_si16 | 💡 emulated with scalar store |
_mm_storeu_si64 | 💡 emulated with scalar store |
_mm_movemask_ps | 💣 No Wasm SIMD support. Emulated in scalar. simd/#131 |
_mm_move_ss | 💡 emulated with a shuffle. VM must guess type. |
_mm_add_ps | ✅ wasm_f32x4_add |
_mm_add_ss | ⚠️ emulated with a shuffle |
_mm_sub_ps | ✅ wasm_f32x4_sub |
_mm_sub_ss | ⚠️ emulated with a shuffle |
_mm_mul_ps | ✅ wasm_f32x4_mul |
_mm_mul_ss | ⚠️ emulated with a shuffle |
_mm_div_ps | ✅ wasm_f32x4_div |
_mm_div_ss | ⚠️ emulated with a shuffle |
_mm_min_ps | TODO: pmin once it works |
_mm_min_ss | ⚠️ emulated with a shuffle |
_mm_max_ps | TODO: pmax once it works |
_mm_max_ss | ⚠️ emulated with a shuffle |
_mm_rcp_ps | ❌ No Wasm SIMD support. Emulated with full precision div. simd/#3 |
_mm_rcp_ss | ❌ No Wasm SIMD support. Emulated with full precision div+shuffle simd/#3 |
_mm_sqrt_ps | ✅ wasm_f32x4_sqrt |
_mm_sqrt_ss | ⚠️ emulated with a shuffle |
_mm_rsqrt_ps | ❌ No Wasm SIMD support. Emulated with full precision div+sqrt. simd/#3 |
_mm_rsqrt_ss | ❌ No Wasm SIMD support. Emulated with full precision div+sqrt+shuffle. simd/#3 |
_mm_unpackhi_ps | 💡 emulated with a shuffle |
_mm_unpacklo_ps | 💡 emulated with a shuffle |
_mm_movehl_ps | 💡 emulated with a shuffle |
_mm_movelh_ps | 💡 emulated with a shuffle |
_MM_TRANSPOSE4_PS | 💡 emulated with a shuffle |
_mm_cmplt_ps | ✅ wasm_f32x4_lt |
_mm_cmplt_ss | ⚠️ emulated with a shuffle |
_mm_cmple_ps | ✅ wasm_f32x4_le |
_mm_cmple_ss | ⚠️ emulated with a shuffle |
_mm_cmpeq_ps | ✅ wasm_f32x4_eq |
_mm_cmpeq_ss | ⚠️ emulated with a shuffle |
_mm_cmpge_ps | ✅ wasm_f32x4_ge |
_mm_cmpge_ss | ⚠️ emulated with a shuffle |
_mm_cmpgt_ps | ✅ wasm_f32x4_gt |
_mm_cmpgt_ss | ⚠️ emulated with a shuffle |
_mm_cmpord_ps | ❌ emulated with 2xcmp+and |
_mm_cmpord_ss | ❌ emulated with 2xcmp+and+shuffle |
_mm_cmpunord_ps | ❌ emulated with 2xcmp+or |
_mm_cmpunord_ss | ❌ emulated with 2xcmp+or+shuffle |
_mm_and_ps | 🟡 wasm_v128_and. VM must guess type. |
_mm_andnot_ps | 🟡 wasm_v128_andnot. VM must guess type. |
_mm_or_ps | 🟡 wasm_v128_or. VM must guess type. |
_mm_xor_ps | 🟡 wasm_v128_xor. VM must guess type. |
_mm_cmpneq_ps | ✅ wasm_f32x4_ne |
_mm_cmpneq_ss | ⚠️ emulated with a shuffle |
_mm_cmpnge_ps | ⚠️ emulated with not+ge |
_mm_cmpnge_ss | ⚠️ emulated with not+ge+shuffle |
_mm_cmpngt_ps | ⚠️ emulated with not+gt |
_mm_cmpngt_ss | ⚠️ emulated with not+gt+shuffle |
_mm_cmpnle_ps | ⚠️ emulated with not+le |
_mm_cmpnle_ss | ⚠️ emulated with not+le+shuffle |
_mm_cmpnlt_ps | ⚠️ emulated with not+lt |
_mm_cmpnlt_ss | ⚠️ emulated with not+lt+shuffle |
_mm_comieq_ss | ❌ scalarized |
_mm_comige_ss | ❌ scalarized |
_mm_comigt_ss | ❌ scalarized |
_mm_comile_ss | ❌ scalarized |
_mm_comilt_ss | ❌ scalarized |
_mm_comineq_ss | ❌ scalarized |
_mm_ucomieq_ss | ❌ scalarized |
_mm_ucomige_ss | ❌ scalarized |
_mm_ucomigt_ss | ❌ scalarized |
_mm_ucomile_ss | ❌ scalarized |
_mm_ucomilt_ss | ❌ scalarized |
_mm_ucomineq_ss | ❌ scalarized |
_mm_cvtsi32_ss (_mm_cvt_si2ss) | ❌ scalarized |
_mm_cvtss_si32 (_mm_cvt_ss2si) | 💣 scalar with complex emulated semantics |
_mm_cvttss_si32 (_mm_cvtt_ss2si) | 💣 scalar with complex emulated semantics |
_mm_cvtsi64_ss | ❌ scalarized |
_mm_cvtss_si64 | 💣 scalar with complex emulated semantics |
_mm_cvttss_si64 | 💣 scalar with complex emulated semantics |
_mm_cvtss_f32 | 💡 scalar get |
_mm_malloc | ✅ Allocates memory with specified alignment. |
_mm_free | ✅ Aliases to free(). |
_MM_GET_EXCEPTION_MASK | ✅ Always returns all exceptions masked (0x1f80). |
_MM_GET_EXCEPTION_STATE | ❌ Exception state is not tracked. Always returns 0. |
_MM_GET_FLUSH_ZERO_MODE | ✅ Always returns _MM_FLUSH_ZERO_OFF. |
_MM_GET_ROUNDING_MODE | ✅ Always returns _MM_ROUND_NEAREST. |
_mm_getcsr | ✅ Always returns _MM_FLUSH_ZERO_OFF | _MM_ROUND_NEAREST | all exceptions masked (0x1f80). |
_MM_SET_EXCEPTION_MASK | ⚫ Not available. Fixed to all exceptions masked. |
_MM_SET_EXCEPTION_STATE | ⚫ Not available. Fixed to zero/clear state. |
_MM_SET_FLUSH_ZERO_MODE | ⚫ Not available. Fixed to _MM_FLUSH_ZERO_OFF. |
_MM_SET_ROUNDING_MODE | ⚫ Not available. Fixed to _MM_ROUND_NEAREST. |
_mm_setcsr | ⚫ Not available. |
_mm_undefined_ps | ✅ Virtual |
Any code referencing these intrinsics will not compile.
The following table highlights the availability and expected performance of different SSE2 intrinsics. Refer to Intel Intrinsics Guide on SSE2.
Intrinsic name | WebAssembly SIMD support |
---|---|
_mm_add_epi16 | ✅ wasm_i16x8_add |
_mm_add_epi32 | ✅ wasm_i32x4_add |
_mm_add_epi64 | ✅ wasm_i64x2_add |
_mm_add_epi8 | ✅ wasm_i8x16_add |
_mm_add_pd | ✅ wasm_f64x2_add |
_mm_add_sd | ⚠️ emulated with a shuffle |
_mm_adds_epi16 | ✅ wasm_i16x8_add_saturate |
_mm_adds_epi8 | ✅ wasm_i8x16_add_saturate |
_mm_adds_epu16 | ✅ wasm_u16x8_add_saturate |
_mm_adds_epu8 | ✅ wasm_u8x16_add_saturate |
_mm_and_pd | 🟡 wasm_v128_and. VM must guess type. |
_mm_and_si128 | 🟡 wasm_v128_and. VM must guess type. |
_mm_andnot_pd | 🟡 wasm_v128_andnot. VM must guess type. |
_mm_andnot_si128 | 🟡 wasm_v128_andnot. VM must guess type. |
_mm_avg_epu16 | ✅ wasm_u16x8_avgr |
_mm_avg_epu8 | ✅ wasm_u8x16_avgr |
_mm_castpd_ps | ✅ no-op |
_mm_castpd_si128 | ✅ no-op |
_mm_castps_pd | ✅ no-op |
_mm_castps_si128 | ✅ no-op |
_mm_castsi128_pd | ✅ no-op |
_mm_castsi128_ps | ✅ no-op |
_mm_clflush | 💭 No-op. No cache hinting in Wasm SIMD. |
_mm_cmpeq_epi16 | ✅ wasm_i16x8_eq |
_mm_cmpeq_epi32 | ✅ wasm_i32x4_eq |
_mm_cmpeq_epi8 | ✅ wasm_i8x16_eq |
_mm_cmpeq_pd | ✅ wasm_f64x2_eq |
_mm_cmpeq_sd | ⚠️ emulated with a shuffle |
_mm_cmpge_pd | ✅ wasm_f64x2_ge |
_mm_cmpge_sd | ⚠️ emulated with a shuffle |
_mm_cmpgt_epi16 | ✅ wasm_i16x8_gt |
_mm_cmpgt_epi32 | ✅ wasm_i32x4_gt |
_mm_cmpgt_epi8 | ✅ wasm_i8x16_gt |
_mm_cmpgt_pd | ✅ wasm_f64x2_gt |
_mm_cmpgt_sd | ⚠️ emulated with a shuffle |
_mm_cmple_pd | ✅ wasm_f64x2_le |
_mm_cmple_sd | ⚠️ emulated with a shuffle |
_mm_cmplt_epi16 | ✅ wasm_i16x8_lt |
_mm_cmplt_epi32 | ✅ wasm_i32x4_lt |
_mm_cmplt_epi8 | ✅ wasm_i8x16_lt |
_mm_cmplt_pd | ✅ wasm_f64x2_lt |
_mm_cmplt_sd | ⚠️ emulated with a shuffle |
_mm_cmpneq_pd | ✅ wasm_f64x2_ne |
_mm_cmpneq_sd | ⚠️ emulated with a shuffle |
_mm_cmpnge_pd | ⚠️ emulated with not+ge |
_mm_cmpnge_sd | ⚠️ emulated with not+ge+shuffle |
_mm_cmpngt_pd | ⚠️ emulated with not+gt |
_mm_cmpngt_sd | ⚠️ emulated with not+gt+shuffle |
_mm_cmpnle_pd | ⚠️ emulated with not+le |
_mm_cmpnle_sd | ⚠️ emulated with not+le+shuffle |
_mm_cmpnlt_pd | ⚠️ emulated with not+lt |
_mm_cmpnlt_sd | ⚠️ emulated with not+lt+shuffle |
_mm_cmpord_pd | ❌ emulated with 2xcmp+and |
_mm_cmpord_sd | ❌ emulated with 2xcmp+and+shuffle |
_mm_cmpunord_pd | ❌ emulated with 2xcmp+or |
_mm_cmpunord_sd | ❌ emulated with 2xcmp+or+shuffle |
_mm_comieq_sd | ❌ scalarized |
_mm_comige_sd | ❌ scalarized |
_mm_comigt_sd | ❌ scalarized |
_mm_comile_sd | ❌ scalarized |
_mm_comilt_sd | ❌ scalarized |
_mm_comineq_sd | ❌ scalarized |
_mm_cvtepi32_pd | ❌ scalarized |
_mm_cvtepi32_ps | ✅ wasm_f32x4_convert_i32x4 |
_mm_cvtpd_epi32 | ❌ scalarized |
_mm_cvtpd_ps | ❌ scalarized |
_mm_cvtps_epi32 | ❌ scalarized |
_mm_cvtps_pd | ❌ scalarized |
_mm_cvtsd_f64 | ✅ wasm_f64x2_extract_lane |
_mm_cvtsd_si32 | ❌ scalarized |
_mm_cvtsd_si64 | ❌ scalarized |
_mm_cvtsd_si64x | ❌ scalarized |
_mm_cvtsd_ss | ❌ scalarized |
_mm_cvtsi128_si32 | ✅ wasm_i32x4_extract_lane |
_mm_cvtsi128_si64 (_mm_cvtsi128_si64x) | ✅ wasm_i64x2_extract_lane |
_mm_cvtsi32_sd | ❌ scalarized |
_mm_cvtsi32_si128 | 💡 emulated with wasm_i32x4_make |
_mm_cvtsi64_sd (_mm_cvtsi64x_sd) | ❌ scalarized |
_mm_cvtsi64_si128 (_mm_cvtsi64x_si128) | 💡 emulated with wasm_i64x2_make |
_mm_cvtss_sd | ❌ scalarized |
_mm_cvttpd_epi32 | ❌ scalarized |
_mm_cvttps_epi32 | ❌ scalarized |
_mm_cvttsd_si32 | ❌ scalarized |
_mm_cvttsd_si64 (_mm_cvttsd_si64x) | ❌ scalarized |
_mm_div_pd | ✅ wasm_f64x2_div |
_mm_div_sd | ⚠️ emulated with a shuffle |
_mm_extract_epi16 | ✅ wasm_u16x8_extract_lane |
_mm_insert_epi16 | ✅ wasm_i16x8_replace_lane |
_mm_lfence | ⚠️ A full barrier in multithreaded builds. |
_mm_load_pd | 🟡 wasm_v128_load. VM must guess type. Unaligned load on x86 CPUs. |
_mm_load1_pd (_mm_load_pd1) | 🟡 Virtual. wasm_v64x2_load_splat, VM must guess type. |
_mm_load_sd | ❌ emulated with wasm_f64x2_make |
_mm_load_si128 | 🟡 wasm_v128_load. VM must guess type. Unaligned load on x86 CPUs. |
_mm_loadh_pd | ❌ No Wasm SIMD support. Emulated with scalar loads + shuffle. |
_mm_loadl_epi64 | ❌ No Wasm SIMD support. Emulated with scalar loads + shuffle. |
_mm_loadl_pd | ❌ No Wasm SIMD support. Emulated with scalar loads + shuffle. |
_mm_loadr_pd | 💡 Virtual. Simd load + shuffle. |
_mm_loadu_pd | 🟡 wasm_v128_load. VM must guess type. |
_mm_loadu_si128 | 🟡 wasm_v128_load. VM must guess type. |
_mm_loadu_si64 | ❌ emulated with const+scalar load+replace lane |
_mm_loadu_si32 | ❌ emulated with const+scalar load+replace lane |
_mm_loadu_si16 | ❌ emulated with const+scalar load+replace lane |
_mm_madd_epi16 | ✅ wasm_dot_s_i32x4_i16x8 |
_mm_maskmoveu_si128 | ❌ scalarized |
_mm_max_epi16 | ✅ wasm_i16x8_max |
_mm_max_epu8 | ✅ wasm_u8x16_max |
_mm_max_pd | TODO: migrate to wasm_f64x2_pmax |
_mm_max_sd | ⚠️ emulated with a shuffle |
_mm_mfence | ⚠️ A full barrier in multithreaded builds. |
_mm_min_epi16 | ✅ wasm_i16x8_min |
_mm_min_epu8 | ✅ wasm_u8x16_min |
_mm_min_pd | TODO: migrate to wasm_f64x2_pmin |
_mm_min_sd | ⚠️ emulated with a shuffle |
_mm_move_epi64 | 💡 emulated with a shuffle. VM must guess type. |
_mm_move_sd | 💡 emulated with a shuffle. VM must guess type. |
_mm_movemask_epi8 | ❌ scalarized |
_mm_movemask_pd | ❌ scalarized |
_mm_mul_epu32 | ❌ scalarized |
_mm_mul_pd | ✅ wasm_f64x2_mul |
_mm_mul_sd | ⚠️ emulated with a shuffle |
_mm_mulhi_epi16 | ⚠️ emulated with a SIMD four widen+two mul+generic shuffle |
_mm_mulhi_epu16 | ⚠️ emulated with a SIMD four widen+two mul+generic shuffle |
_mm_mullo_epi16 | ✅ wasm_i16x8_mul |
_mm_or_pd | 🟡 wasm_v128_or. VM must guess type. |
_mm_or_si128 | 🟡 wasm_v128_or. VM must guess type. |
_mm_packs_epi16 | ✅ wasm_i8x16_narrow_i16x8 |
_mm_packs_epi32 | ✅ wasm_i16x8_narrow_i32x4 |
_mm_packus_epi16 | ✅ wasm_u8x16_narrow_i16x8 |
_mm_pause | 💭 No-op. |
_mm_sad_epu8 | ⚠️ emulated with eleven SIMD instructions+const |
_mm_set_epi16 | ✅ wasm_i16x8_make |
_mm_set_epi32 | ✅ wasm_i32x4_make |
_mm_set_epi64 (_mm_set_epi64x) | ✅ wasm_i64x2_make |
_mm_set_epi8 | ✅ wasm_i8x16_make |
_mm_set_pd | ✅ wasm_f64x2_make |
_mm_set_sd | 💡 emulated with wasm_f64x2_make |
_mm_set1_epi16 | ✅ wasm_i16x8_splat |
_mm_set1_epi32 | ✅ wasm_i32x4_splat |
_mm_set1_epi64 (_mm_set1_epi64x) | ✅ wasm_i64x2_splat |
_mm_set1_epi8 | ✅ wasm_i8x16_splat |
_mm_set1_pd (_mm_set_pd1) | ✅ wasm_f64x2_splat |
_mm_setr_epi16 | ✅ wasm_i16x8_make |
_mm_setr_epi32 | ✅ wasm_i32x4_make |
_mm_setr_epi64 | ✅ wasm_i64x2_make |
_mm_setr_epi8 | ✅ wasm_i8x16_make |
_mm_setr_pd | ✅ wasm_f64x2_make |
_mm_setzero_pd | 💡 emulated with wasm_f64x2_const |
_mm_setzero_si128 | 💡 emulated with wasm_i64x2_const |
_mm_shuffle_epi32 | 💡 emulated with a general shuffle |
_mm_shuffle_pd | 💡 emulated with a general shuffle |
_mm_shufflehi_epi16 | 💡 emulated with a general shuffle |
_mm_shufflelo_epi16 | 💡 emulated with a general shuffle |
_mm_sll_epi16 | ❌ scalarized |
_mm_sll_epi32 | ❌ scalarized |
_mm_sll_epi64 | ❌ scalarized |
_mm_slli_epi16 | 💡 wasm_i16x8_shl ✅ if shift count is immediate constant. |
_mm_slli_epi32 | 💡 wasm_i32x4_shl ✅ if shift count is immediate constant. |
_mm_slli_epi64 | 💡 wasm_i64x2_shl ✅ if shift count is immediate constant. |
_mm_slli_si128 (_mm_bslli_si128) | 💡 emulated with a general shuffle |
_mm_sqrt_pd | ✅ wasm_f64x2_sqrt |
_mm_sqrt_sd | ⚠️ emulated with a shuffle |
_mm_sra_epi16 | ❌ scalarized |
_mm_sra_epi32 | ❌ scalarized |
_mm_srai_epi16 | 💡 wasm_i16x8_shr ✅ if shift count is immediate constant. |
_mm_srai_epi32 | 💡 wasm_i32x4_shr ✅ if shift count is immediate constant. |
_mm_srl_epi16 | ❌ scalarized |
_mm_srl_epi32 | ❌ scalarized |
_mm_srl_epi64 | ❌ scalarized |
_mm_srli_epi16 | 💡 wasm_u16x8_shr ✅ if shift count is immediate constant. |
_mm_srli_epi32 | 💡 wasm_u32x4_shr ✅ if shift count is immediate constant. |
_mm_srli_epi64 | 💡 wasm_u64x2_shr ✅ if shift count is immediate constant. |
_mm_srli_si128 (_mm_bsrli_si128) | 💡 emulated with a general shuffle |
_mm_store_pd | 🟡 wasm_v128_store. VM must guess type. Unaligned store on x86 CPUs. |
_mm_store_sd | 💡 emulated with scalar store |
_mm_store_si128 | 🟡 wasm_v128_store. VM must guess type. Unaligned store on x86 CPUs. |
_mm_store1_pd (_mm_store_pd1) | 🟡 Virtual. Emulated with shuffle. Unaligned store on x86 CPUs. |
_mm_storeh_pd | ❌ shuffle + scalar stores |
_mm_storel_epi64 | ❌ scalar store |
_mm_storel_pd | ❌ scalar store |
_mm_storer_pd | ❌ shuffle + scalar stores |
_mm_storeu_pd | 🟡 wasm_v128_store. VM must guess type. |
_mm_storeu_si128 | 🟡 wasm_v128_store. VM must guess type. |
_mm_storeu_si64 | 💡 emulated with extract lane+scalar store |
_mm_storeu_si32 | 💡 emulated with extract lane+scalar store |
_mm_storeu_si16 | 💡 emulated with extract lane+scalar store |
_mm_stream_pd | 🟡 wasm_v128_store. VM must guess type. No cache control in Wasm SIMD. |
_mm_stream_si128 | 🟡 wasm_v128_store. VM must guess type. No cache control in Wasm SIMD. |
_mm_stream_si32 | 🟡 wasm_v128_store. VM must guess type. No cache control in Wasm SIMD. |
_mm_stream_si64 | 🟡 wasm_v128_store. VM must guess type. No cache control in Wasm SIMD. |
_mm_sub_epi16 | ✅ wasm_i16x8_sub |
_mm_sub_epi32 | ✅ wasm_i32x4_sub |
_mm_sub_epi64 | ✅ wasm_i64x2_sub |
_mm_sub_epi8 | ✅ wasm_i8x16_sub |
_mm_sub_pd | ✅ wasm_f64x2_sub |
_mm_sub_sd | ⚠️ emulated with a shuffle |
_mm_subs_epi16 | ✅ wasm_i16x8_sub_saturate |
_mm_subs_epi8 | ✅ wasm_i8x16_sub_saturate |
_mm_subs_epu16 | ✅ wasm_u16x8_sub_saturate |
_mm_subs_epu8 | ✅ wasm_u8x16_sub_saturate |
_mm_ucomieq_sd | ❌ scalarized |
_mm_ucomige_sd | ❌ scalarized |
_mm_ucomigt_sd | ❌ scalarized |
_mm_ucomile_sd | ❌ scalarized |
_mm_ucomilt_sd | ❌ scalarized |
_mm_ucomineq_sd | ❌ scalarized |
_mm_undefined_pd | ✅ Virtual |
_mm_undefined_si128 | ✅ Virtual |
_mm_unpackhi_epi16 | 💡 emulated with a shuffle |
_mm_unpackhi_epi32 | 💡 emulated with a shuffle |
_mm_unpackhi_epi64 | 💡 emulated with a shuffle |
_mm_unpackhi_epi8 | 💡 emulated with a shuffle |
_mm_unpachi_pd | 💡 emulated with a shuffle |
_mm_unpacklo_epi16 | 💡 emulated with a shuffle |
_mm_unpacklo_epi32 | 💡 emulated with a shuffle |
_mm_unpacklo_epi64 | 💡 emulated with a shuffle |
_mm_unpacklo_epi8 | 💡 emulated with a shuffle |
_mm_unpacklo_pd | 💡 emulated with a shuffle |
_mm_xor_pd | 🟡 wasm_v128_or. VM must guess type. |
_mm_xor_si128 | 🟡 wasm_v128_or. VM must guess type. |
Any code referencing these intrinsics will not compile.
The following table highlights the availability and expected performance of different SSE3 intrinsics. Refer to Intel Intrinsics Guide on SSE3.
Intrinsic name | WebAssembly SIMD support |
---|---|
_mm_lddqu_si128 | ✅ wasm_v128_load. |
_mm_addsub_ps | ⚠️ emulated with a SIMD add+mul+const |
_mm_hadd_ps | ⚠️ emulated with a SIMD add+two shuffles |
_mm_hsub_ps | ⚠️ emulated with a SIMD sub+two shuffles |
_mm_movehdup_ps | 💡 emulated with a general shuffle |
_mm_moveldup_ps | 💡 emulated with a general shuffle |
_mm_addsub_pd | ⚠️ emulated with a SIMD add+mul+const |
_mm_hadd_pd | ⚠️ emulated with a SIMD add+two shuffles |
_mm_hsub_pd | ⚠️ emulated with a SIMD add+two shuffles |
_mm_loaddup_pd | 🟡 Virtual. wasm_v64x2_load_splat, VM must guess type. |
_mm_movedup_pd | 💡 emulated with a general shuffle |
_MM_GET_DENORMALS_ZERO_MODE | ✅ Always returns _MM_DENORMALS_ZERO_ON. I.e. denormals are available. |
_MM_SET_DENORMALS_ZERO_MODE | ⚫ Not available. Fixed to _MM_DENORMALS_ZERO_ON. |
_mm_monitor | ⚫ Not available. |
_mm_mwait | ⚫ Not available. |
The following table highlights the availability and expected performance of different SSSE3 intrinsics. Refer to Intel Intrinsics Guide on SSSE3.
Intrinsic name | WebAssembly SIMD support |
---|---|
_mm_abs_epi8 | ✅ wasm_i8x16_abs |
_mm_abs_epi16 | ✅ wasm_i16x8_abs |
_mm_abs_epi32 | ✅ wasm_i32x4_abs |
_mm_alignr_epi8 | ⚠️ emulated with a SIMD or+two shifts |
_mm_hadd_epi16 | ⚠️ emulated with a SIMD add+two shuffles |
_mm_hadd_epi32 | ⚠️ emulated with a SIMD add+two shuffles |
_mm_hadds_epi16 | ⚠️ emulated with a SIMD adds+two shuffles |
_mm_hsub_epi16 | ⚠️ emulated with a SIMD sub+two shuffles |
_mm_hsub_epi32 | ⚠️ emulated with a SIMD sub+two shuffles |
_mm_hsubs_epi16 | ⚠️ emulated with a SIMD subs+two shuffles |
_mm_maddubs_epi16 | ⚠️ emulated with SIMD saturated add+four shifts+two muls+and+const |
_mm_mulhrs_epi16 | ⚠️ emulated with SIMD four widen+two muls+four adds+complex shuffle+const |
_mm_shuffle_epi8 | ⚠️ emulated with a SIMD swizzle+and+const |
_mm_sign_epi8 | ⚠️ emulated with SIMD two cmp+two logical+add |
_mm_sign_epi16 | ⚠️ emulated with SIMD two cmp+two logical+add |
_mm_sign_epi32 | ⚠️ emulated with SIMD two cmp+two logical+add |
Any code referencing these intrinsics will not compile.
The following table highlights the availability and expected performance of different SSE4.1 intrinsics. Refer to Intel Intrinsics Guide on SSE4.1.
Intrinsic name | WebAssembly SIMD support |
---|---|
_mm_blend_epi16 | 💡 emulated with a general shuffle |
_mm_blend_pd | 💡 emulated with a general shuffle |
_mm_blend_ps | 💡 emulated with a general shuffle |
_mm_blendv_epi8 | ⚠️ emulated with a SIMD shr+and+andnot+or |
_mm_blendv_pd | ⚠️ emulated with a SIMD shr+and+andnot+or |
_mm_blendv_ps | ⚠️ emulated with a SIMD shr+and+andnot+or |
_mm_ceil_pd | ❌ scalarized |
_mm_ceil_ps | ❌ scalarized |
_mm_ceil_sd | ❌ scalarized |
_mm_ceil_ss | ❌ scalarized |
_mm_cmpeq_epi64 | ⚠️ emulated with a SIMD cmp+and+shuffle |
_mm_cvtepi16_epi32 | ✅ wasm_i32x4_widen_low_i16x8 |
_mm_cvtepi16_epi64 | ⚠️ emulated with a SIMD widen+const+cmp+shuffle |
_mm_cvtepi32_epi64 | ⚠️ emulated with SIMD const+cmp+shuffle |
_mm_cvtepi8_epi16 | ✅ wasm_i16x8_widen_low_i8x16 |
_mm_cvtepi8_epi32 | ⚠️ emulated with two SIMD widens |
_mm_cvtepi8_epi64 | ⚠️ emulated with two SIMD widens+const+cmp+shuffle |
_mm_cvtepu16_epi32 | ✅ wasm_i32x4_widen_low_u16x8 |
_mm_cvtepu16_epi64 | ⚠️ emulated with SIMD const+two shuffles |
_mm_cvtepu32_epi64 | ⚠️ emulated with SIMD const+shuffle |
_mm_cvtepu8_epi16 | ✅ wasm_i16x8_widen_low_u8x16 |
_mm_cvtepu8_epi32 | ⚠️ emulated with two SIMD widens |
_mm_cvtepu8_epi64 | ⚠️ emulated with SIMD const+three shuffles |
_mm_dp_pd | ⚠️ emulated with SIMD mul+add+setzero+2xblend |
_mm_dp_ps | ⚠️ emulated with SIMD mul+add+setzero+2xblend |
_mm_extract_epi32 | ✅ wasm_i32x4_extract_lane |
_mm_extract_epi64 | ✅ wasm_i64x2_extract_lane |
_mm_extract_epi8 | ✅ wasm_u8x16_extract_lane |
_mm_extract_ps | ✅ wasm_i32x4_extract_lane |
_mm_floor_pd | ❌ scalarized |
_mm_floor_ps | ❌ scalarized |
_mm_floor_sd | ❌ scalarized |
_mm_floor_ss | ❌ scalarized |
_mm_insert_epi32 | ✅ wasm_i32x4_replace_lane |
_mm_insert_epi64 | ✅ wasm_i64x2_replace_lane |
_mm_insert_epi8 | ✅ wasm_i8x16_replace_lane |
_mm_insert_ps | ⚠️ emulated with generic non-SIMD-mapping shuffles |
_mm_max_epi32 | ✅ wasm_i32x4_max |
_mm_max_epi8 | ✅ wasm_i8x16_max |
_mm_max_epu16 | ✅ wasm_u16x8_max |
_mm_max_epu32 | ✅ wasm_u32x4_max |
_mm_min_epi32 | ✅ wasm_i32x4_min |
_mm_min_epi8 | ✅ wasm_i8x16_min |
_mm_min_epu16 | ✅ wasm_u16x8_min |
_mm_min_epu32 | ✅ wasm_u32x4_min |
_mm_minpos_epu16 | 💣 scalarized |
_mm_mpsadbw_epu8 | 💣 scalarized |
_mm_mul_epi32 | ❌ scalarized |
_mm_mullo_epi32 | ✅ wasm_i32x4_mul |
_mm_packus_epi32 | ✅ wasm_u16x8_narrow_i32x4 |
_mm_round_pd | 💣 scalarized |
_mm_round_ps | 💣 scalarized |
_mm_round_sd | 💣 scalarized |
_mm_round_ss | 💣 scalarized |
_mm_stream_load_si128 | 🟡 wasm_v128_load. VM must guess type. Unaligned load on x86 CPUs. |
_mm_test_all_ones | ❌ scalarized |
_mm_test_all_zeros | ❌ scalarized |
_mm_test_mix_ones_zeros | ❌ scalarized |
_mm_testc_si128 | ❌ scalarized |
_mm_test_nzc_si128 | ❌ scalarized |
_mm_testz_si128 | ❌ scalarized |
The following table highlights the availability and expected performance of different SSE4.2 intrinsics. Refer to Intel Intrinsics Guide on SSE4.2.
Intrinsic name | WebAssembly SIMD support |
---|---|
_mm_cmpgt_epi64 | ❌ scalarized |
Any code referencing these intrinsics will not compile.
The following table highlights the availability and expected performance of different AVX intrinsics. Refer to Intel Intrinsics Guide on AVX.
Intrinsic name | WebAssembly SIMD support |
---|---|
_mm_broadcast_ss | ✅ wasm_v32x4_load_splat |
_mm_cmp_pd | ⚠️ emulated with 1-2 SIMD cmp+and/or |
_mm_cmp_ps | ⚠️ emulated with 1-2 SIMD cmp+and/or |
_mm_cmp_sd | ⚠️ emulated with 1-2 SIMD cmp+and/or+move |
_mm_cmp_ss | ⚠️ emulated with 1-2 SIMD cmp+and/or+move |
_mm_maskload_pd | ⚠️ emulated with SIMD load+shift+and |
_mm_maskload_ps | ⚠️ emulated with SIMD load+shift+and |
_mm_maskstore_pd | ❌ scalarized |
_mm_maskstore_ps | ❌ scalarized |
_mm_permute_pd | 💡 emulated with a general shuffle |
_mm_permute_ps | 💡 emulated with a general shuffle |
_mm_permutevar_pd | 💣 scalarized |
_mm_permutevar_ps | 💣 scalarized |
_mm_testc_pd | 💣 emulated with complex SIMD+scalar sequence |
_mm_testc_ps | 💣 emulated with complex SIMD+scalar sequence |
_mm_testnzc_pd | 💣 emulated with complex SIMD+scalar sequence |
_mm_testnzc_ps | 💣 emulated with complex SIMD+scalar sequence |
_mm_testz_pd | 💣 emulated with complex SIMD+scalar sequence |
_mm_testz_ps | 💣 emulated with complex SIMD+scalar sequence |
Only the 128-bit wide instructions from AVX instruction set are available. 256-bit wide AVX instructions are not provided.
Emscripten supports compiling existing codebases that use ARM NEON by passing the -mfpu=neon directive to the compiler, and including the header <arm_neon.h>.
In terms of performance, it is very important to note that only instructions which operate on 128-bit wide vectors are supported cleanly. This means that nearly any instruction which is not of a “q” variant (i.e. “vaddq” as opposed to “vadd”) will be scalarized.
These are pulled from SIMDe repository on Github. To update emscripten with the latest SIMDe version, run tools/simde_update.py.
The following table highlights the availability of various 128-bit wide intrinsics.
For detailed information on each intrinsic function, refer to NEON Intrinsics Reference.
Intrinsic name | Wasm SIMD Support |
---|---|
vaba | ⚫ Not implemented, will trigger compiler error |
vabal | ⚫ Not implemented, will trigger compiler error |
vabd | ⚫ Not implemented, will trigger compiler error |
vabdl | ⚫ Not implemented, will trigger compiler error |
vabs | native |
vadd | native |
vaddl | ⚫ Not implemented, will trigger compiler error |
vaddlv | ⚫ Not implemented, will trigger compiler error |
vaddv | ⚫ Not implemented, will trigger compiler error |
vaddw | ❌ Will be emulated with slow instructions, or scalarized |
vand | native |
vbic | ⚫ Not implemented, will trigger compiler error |
vbsl | native |
vcagt | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vceq | 💡 Depends on a smart enough compiler, but should be near native |
vceqz | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vcge | native |
vcgez | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vcgt | native |
vcgtz | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vcle | native |
vclez | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vcls | ⚫ Not implemented, will trigger compiler error |
vclt | native |
vcltz | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vcnt | ⚫ Not implemented, will trigger compiler error |
vclz | ⚫ Not implemented, will trigger compiler error |
vcombine | ❌ Will be emulated with slow instructions, or scalarized |
vcreate | ❌ Will be emulated with slow instructions, or scalarized |
vdot | ❌ Will be emulated with slow instructions, or scalarized |
vdot_lane | ❌ Will be emulated with slow instructions, or scalarized |
vdup | ⚫ Not implemented, will trigger compiler error |
vdup_n | native |
veor | native |
vext | ❌ Will be emulated with slow instructions, or scalarized |
vget_lane | native |
vhadd | ⚫ Not implemented, will trigger compiler error |
vhsub | ⚫ Not implemented, will trigger compiler error |
vld1 | native |
vld2 | ⚫ Not implemented, will trigger compiler error |
vld3 | 💡 Depends on a smart enough compiler, but should be near native |
vld4 | 💡 Depends on a smart enough compiler, but should be near native |
vmax | native |
vmaxv | ⚫ Not implemented, will trigger compiler error |
vmin | native |
vminv | ⚫ Not implemented, will trigger compiler error |
vmla | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vmlal | ❌ Will be emulated with slow instructions, or scalarized |
vmls | ⚫ Not implemented, will trigger compiler error |
vmlsl | ⚫ Not implemented, will trigger compiler error |
vmovl | native |
vmul | native |
vmul_n | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vmull | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vmull_n | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vmull_high | ❌ Will be emulated with slow instructions, or scalarized |
vmvn | native |
vneg | native |
vorn | ⚫ Not implemented, will trigger compiler error |
vorr | native |
vpadal | ❌ Will be emulated with slow instructions, or scalarized |
vpadd | ❌ Will be emulated with slow instructions, or scalarized |
vpaddl | ❌ Will be emulated with slow instructions, or scalarized |
vpmax | ❌ Will be emulated with slow instructions, or scalarized |
vpmin | ❌ Will be emulated with slow instructions, or scalarized |
vpminnm | ⚫ Not implemented, will trigger compiler error |
vqabs | ⚫ Not implemented, will trigger compiler error |
vqabsb | ⚫ Not implemented, will trigger compiler error |
vqadd | 💡 Depends on a smart enough compiler, but should be near native |
vqaddb | ⚫ Not implemented, will trigger compiler error |
vqdmulh | ❌ Will be emulated with slow instructions, or scalarized |
vqneg | ⚫ Not implemented, will trigger compiler error |
vqnegb | ⚫ Not implemented, will trigger compiler error |
vqrdmulh | ⚫ Not implemented, will trigger compiler error |
vqrshl | ⚫ Not implemented, will trigger compiler error |
vqrshlb | ⚫ Not implemented, will trigger compiler error |
vqshl | ⚫ Not implemented, will trigger compiler error |
vqshlb | ⚫ Not implemented, will trigger compiler error |
vqsub | ⚫ Not implemented, will trigger compiler error |
vqsubb | ⚫ Not implemented, will trigger compiler error |
vqtbl1 | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vqtbl2 | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vqtbl3 | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vqtbl4 | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vqtbx1 | ❌ Will be emulated with slow instructions, or scalarized |
vqtbx2 | ❌ Will be emulated with slow instructions, or scalarized |
vqtbx3 | ❌ Will be emulated with slow instructions, or scalarized |
vqtbx4 | ❌ Will be emulated with slow instructions, or scalarized |
vrbit | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vreinterpret | 💡 Depends on a smart enough compiler, but should be near native |
vrev16 | native |
vrev32 | native |
vrev64 | native |
vrhadd | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vrshl | ❌ Will be emulated with slow instructions, or scalarized |
vrshr_n | ❌ Will be emulated with slow instructions, or scalarized |
vrsra_n | ❌ Will be emulated with slow instructions, or scalarized |
vset_lane | native |
vshl | scalaried |
vshl_n | ❌ Will be emulated with slow instructions, or scalarized |
vshr_n | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vsra_n | ❌ Will be emulated with slow instructions, or scalarized |
vst1 | native |
vst1_lane | 💡 Depends on a smart enough compiler, but should be near native |
vst2 | ⚫ Not implemented, will trigger compiler error |
vst3 | 💡 Depends on a smart enough compiler, but should be near native |
vst4 | 💡 Depends on a smart enough compiler, but should be near native |
vsub | native |
vsubl | ⚠ Does not have direct implementation, but is emulated using fast NEON instructions |
vsubw | ⚫ Not implemented, will trigger compiler error |
vtbl1 | ❌ Will be emulated with slow instructions, or scalarized |
vtbl2 | ❌ Will be emulated with slow instructions, or scalarized |
vtbl3 | ❌ Will be emulated with slow instructions, or scalarized |
vtbl4 | ❌ Will be emulated with slow instructions, or scalarized |
vtbx1 | ❌ Will be emulated with slow instructions, or scalarized |
vtbx2 | ❌ Will be emulated with slow instructions, or scalarized |
vtbx3 | ❌ Will be emulated with slow instructions, or scalarized |
vtbx4 | ❌ Will be emulated with slow instructions, or scalarized |
vtrn | ❌ Will be emulated with slow instructions, or scalarized |
vtrn1 | ❌ Will be emulated with slow instructions, or scalarized |
vtrn2 | ❌ Will be emulated with slow instructions, or scalarized |
vtst | ❌ Will be emulated with slow instructions, or scalarized |
vuqadd | ⚫ Not implemented, will trigger compiler error |
vuqaddb | ⚫ Not implemented, will trigger compiler error |
vuzp | ❌ Will be emulated with slow instructions, or scalarized |
vuzp1 | ❌ Will be emulated with slow instructions, or scalarized |
vuzp2 | ❌ Will be emulated with slow instructions, or scalarized |
vzip | ❌ Will be emulated with slow instructions, or scalarized |
vzip1 | ❌ Will be emulated with slow instructions, or scalarized |
vzip2 | ❌ Will be emulated with slow instructions, or scalarized |