Fast, parallel applications with WebAssembly SIMD

Published · Updated · Tagged with WebAssembly

SIMD stands for Single Instruction, Multiple Data. SIMD instructions are a special class of instructions that exploit data parallelism in applications by simultaneously performing the same operation on multiple data elements. Compute intensive applications like audio/video codecs, image processors, are all examples of applications that take advantage of SIMD instructions to accelerate performance. Most modern architectures support some variants of SIMD instructions.

The WebAssembly SIMD proposal defines a portable, performant subset of SIMD operations that are available across most modern architectures. This proposal derived many elements from the SIMD.js proposal, which in turn was originally derived from the Dart SIMD specification. The SIMD.js proposal was an API proposed at TC39 with new types and functions for performing SIMD computations, but this was archived in favor of supporting SIMD operations more transparently in WebAssembly. The WebAssembly SIMD proposal was introduced as a way for browsers to take advantage of the data level parallelism using the underlying hardware.

WebAssembly SIMD proposal #

The high-level goal of the WebAssembly SIMD proposal is to introduce vector operations to the WebAssembly Specification, in a way that guarantees portable performance.

The set of SIMD instructions is large, and varied across architectures. The set of operations included in the WebAssembly SIMD proposal consist of operations that are well supported on a wide variety of platforms, and are proven to be performant. To this end, the current proposal is limited to standardizing Fixed-Width 128-bit SIMD operations.

The current proposal introduces a new v128 value type, and a number of new operations that operate on this type. The criteria used to determine these operations are:

  • The operations should be well supported across multiple modern architectures.
  • Performance wins should be positive across multiple relevant architectures within an instruction group.
  • The chosen set of operations should minimize performance cliffs if any.

The proposal is now in finalized state (phase 4), both V8 and the toolchain have working implementations.

Enabling SIMD support #

Feature detection #

First of all, note that SIMD is a new feature and isn't yet available in all browsers with WebAssembly support. You can find which browsers support new WebAssembly features on the webassembly.org website.

To ensure that all users can load your application, you'll need to build two different versions - one with SIMD enabled and one without it - and load the corresponding version depending on feature detection results. To detect SIMD at runtime, you can use wasm-feature-detect library and load the corresponding module like this:

import { simd } from 'wasm-feature-detect';

(async () => {
const hasSIMD = await simd();
const module = await (
hasSIMD
? import('./module-with-simd.js')
: import('./module-without-simd.js')
);
// …now use `module` as you normally would
})();

To learn about building code with SIMD support, check the section below.

SIMD support in browsers #

WebAssembly SIMD support is available by default from Chrome 91. Make sure to use the latest version of the toolchain as detailed below, as well as latest wasm-feature-detect to detect engines that support the final version of the spec. If something doesn’t look right, please file a bug.

WebAssembly SIMD is also supported in Firefox 89 and above.

Building with SIMD support #

Building C / C++ to target SIMD #

WebAssembly’s SIMD support depends on using a recent build of clang with the WebAssembly LLVM backend enabled. Emscripten has support for the WebAssembly SIMD proposal as well. Install and activate the latest distribution of emscripten using emsdk to use the SIMD features.

./emsdk install latest
./emsdk activate latest

There are a couple of different ways to enable generating SIMD code when porting your application to use SIMD. Once the latest upstream emscripten version has been installed, compile using emscripten, and pass the -msimd128 flag to enable SIMD.

emcc -msimd128 -O3 foo.c -o foo.js

Applications that have already been ported to use WebAssembly may benefit from SIMD with no source modifications thanks to LLVM’s autovectorization optimizations.

These optimizations can automatically transform loops that perform arithmetic operations on each iteration into equivalent loops that perform the same arithmetic operations on multiple inputs at a time using SIMD instructions. LLVM’s autovectorizers are enabled by default at optimization levels -O2 and -O3 when the -msimd128 flag is supplied.

For example, consider the following function that multiplies the elements of two input arrays together and stores the results in an output array.

void multiply_arrays(int* out, int* in_a, int* in_b, int size) {
for (int i = 0; i < size; i++) {
out[i] = in_a[i] * in_b[i];
}
}

Without passing the -msimd128 flag, the compiler emits this WebAssembly loop:

(loop
(i32.store
… get address in `out` …
(i32.mul
(i32.load … get address in `in_a` …)
(i32.load … get address in `in_b` …)

)

But when the -msimd128 flag is used, the autovectorizer turns this into code that includes the following loop:

(loop
(v128.store align=4
… get address in `out` …
(i32x4.mul
(v128.load align=4 … get address in `in_a` …)
(v128.load align=4 … get address in `in_b` …)

)
)

The loop body has the same structure but SIMD instructions are being used to load, multiply, and store four elements at a time inside the loop body.

For finer grained control over the SIMD instructions generated by the compiler, include the wasm_simd128.h header file, which defines a set of intrinsics. Intrinsics are special functions that, when called, will be turned by the compiler into the corresponding WebAssembly SIMD instructions, unless it can make further optimizations.

As an example, here is the same function from before manually rewritten to use the SIMD intrinsics.

#include <wasm_simd128.h>

void multiply_arrays(int* out, int* in_a, int* in_b, int size) {
for (int i = 0; i < size; i += 4) {
v128_t a = wasm_v128_load(&in_a[i]);
v128_t b = wasm_v128_load(&in_b[i]);
v128_t prod = wasm_i32x4_mul(a, b);
wasm_v128_store(&out[i], prod);
}
}

This manually rewritten code assumes that the input and output arrays are aligned and do not alias and that size is a multiple of four. The autovectorizer cannot make these assumptions and has to generate extra code to handle the cases where they are not true, so hand-written SIMD code often ends up being smaller than autovectorized SIMD code.

Cross-compiling existing C / C++ projects #

Many existing projects already support SIMD when targeting other platforms, in particular SSE and AVX instructions on x86 / x86-64 platforms and NEON instructions on ARM platforms. There are two ways those are usually implemented.

First one is via assembly files that take care of SIMD operations and are linked together with C / C++ during the build process. The assembly syntax and instructions are highly platform-dependant and not portable, so, to make use of SIMD, such projects need to add WebAssembly as an additional supported target and reimplement corresponding functions using either WebAssembly text format or intrinsics described above.

Another common approach is to use SSE / SSE2 / AVX / NEON intrinsics directly from C / C++ code and here Emscripten can help. Emscripten provides compatible headers and an emulation layer for all those instruction sets, and an emulation layer that compiles them directly to Wasm intrinsics where possible, or scalarized code otherwise.

To cross-compile such projects, first enable SIMD via project-specific configuration flags, e.g. ./configure --enable-simd so that it passes -msse, -msse2, -mavx or -mfpu=neon to the compiler and calls corresponding intrinsics. Then, additionally pass -msimd128 to enable WebAssembly SIMD too either by using CFLAGS=-msimd128 make … / CXXFLAGS="-msimd128 make … or by modifying the build config directly when targeting Wasm.

Building Rust to target SIMD #

When compiling Rust code to target WebAssembly SIMD, you'll need to enable the same simd128 LLVM feature as in Emscripten above.

If you can control rustc flags directly or via environment variable RUSTFLAGS, pass -C target-feature=+simd128:

rustc … -C target-feature=+simd128 -o out.wasm

or

RUSTFLAGS="-C target-feature=+simd128" cargo build

Like in Clang / Emscripten, LLVM’s autovectorizers are enabled by default for optimized code when simd128 feature is enabled.

For example, Rust equivalent of the multiply_arrays example above

pub fn multiply_arrays(out: &mut [i32], in_a: &[i32], in_b: &[i32]) {
in_a.iter()
.zip(in_b)
.zip(out)
.for_each(|((a, b), dst)| {
*dst = a * b;
});
}

would produce similar autovectorized code for the aligned part of the inputs.

In order to have manual control over the SIMD operations, you can use the nightly toolchain, enable Rust feature wasm_simd and invoke the intrinsics from the std::arch::wasm32 namespace directly:

#![feature(wasm_simd)]

use std::arch::wasm32::*;

pub unsafe fn multiply_arrays(out: &mut [i32], in_a: &[i32], in_b: &[i32]) {
in_a.chunks(4)
.zip(in_b.chunks(4))
.zip(out.chunks_mut(4))
.for_each(|((a, b), dst)| {
let a = v128_load(a.as_ptr() as *const v128);
let b = v128_load(b.as_ptr() as *const v128);
let prod = i32x4_mul(a, b);
v128_store(dst.as_mut_ptr() as *mut v128, prod);
});
}

Alternatively, use a helper crate like packed_simd that abstracts over SIMD implementations on various platforms.

Compelling use cases #

The WebAssembly SIMD proposal seeks to accelerate high compute applications like audio/video codecs, image processing applications, cryptographic applications, etc. Currently WebAssembly SIMD is experimentally supported in widely used open source projects like Halide, OpenCV.js, and XNNPACK.

Some interesting demos come from the MediaPipe project by the Google Research team.

As per their description, MediaPipe is a framework for building multimodal (eg. video, audio, any time series data) applied ML pipelines. And they have a Web version, too!

One of the most visually appealing demos where it’s easy to observe the difference in performance SIMD makes, is a CPU-only (non-GPU) build of a hand-tracking system. Without SIMD, you can get only around 14-15 FPS (frames per second) on a modern laptop, while with SIMD enabled in Chrome Canary you get a much smoother experience at 38-40 FPS.

Another interesting set of demos that makes use of SIMD for smooth experience, come from OpenCV - a popular computer vision library that can also be compiled to WebAssembly. They’re available by link, or you can check out the pre-recorded versions below:

Card reading
Invisibility cloak
Emoji replacement

Future work #

The current fixed-width SIMD proposal is in Phase 4, so it's considered complete.

Some explorations of future SIMD extensions have started in Relaxed SIMD and Flexible Vectors proposals, which, at the moment of writing, are in Phase 1.