BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Boosting WebAssembly Performance with SIMD and Multi-Threading

Boosting WebAssembly Performance with SIMD and Multi-Threading

Bookmarks

Key Takeaways

  • WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms. 
  • The WebAssembly's SIMD and multi-threading proposals have the potential to bring WebAssembly closer to realizing that vision by leveraging SIMD operations and multi-core architectures.
  • Early implementations of the proposals have been used in compute-intensive tasks (machine-learning, bio-informatics, scientific computing), with significant performance improvement observed.
  • The proposals are being implemented by a growing number of browser and non-browser runtimes and continue their upward path in the standardization process.

Recent efforts have shown how Single Instruction, Multiple Data (SIMD) parallelism, and multi-threading may radically improve the performance profile of software ported to WebAssembly to run in browser and non-browser environments. Support for SIMD and multi-threading is however unequal across platforms and still experimental.

Ann Yuan and Marat Dukhan, software engineers at Google, detailed in September last year how SIMD instructions and multi-threading resulted in 10x speed improvement on the TensorFlow WebAssembly backend. The pair gave the following performance data for the MobileNet V2, a medium-sized model with 3.5 million parameters and roughly 300 million multiply-add operations:

TensorFlow.js is an open-source machine-learning JavaScript library for training and deploying models in the browser and on Node.js. MobileNet V2 is a family of neural network architectures for efficient on-device image classification, detection, and segmentation. The TensorFlow module that leverages MobileNet V2 contains a trained instance of the network. The WebAssembly TensorFlow backend is an alternative to the WebGL backend, bringing fast CPU execution with minimal code changes, and improving performance on lower-end mobile devices that may lack WebGL support or have a slow GPU.

SIMD and multi-threading are supported in TensorFlow 2.3.0 and above. The unequal support of the features across runtimes results in three binary versions covering the scenarios in which part or none of the two technologies is available. Readers that are interested in assessing the performance improvement from a user point of view can compare an optimized version (using the new Wasm backend) vs. the unoptimized binary with SIMD and multi-threading turned off (Chrome users should enable the experimental SIMD flag—#enable-webassembly-simd). On the computer on which this piece was written, the optimized version had twice as many frames per second (around 50fps) as the unoptimized version.

While SIMD combines well with multi-threading, the performance gains that each enables are independent. While multi-threading generates extra performance by leveraging multi-core architectures, SIMD is tied to a single processor instruction. The previously mentioned TensorFlow Wasm benchmarks reveal that SIMD is responsible for a 1.7 to 4.5 performance improvement factor vs. vanilla Wasm, while multi-threading produces an additional 1.8-2.9x speedup on top of that.

Google Research showcased a few applications involving complex computations (hand-tracking, document scanning, credit card recognition) that the OpenCV computer vision library speeds up with WebAssembly SIMD. Thomas Nattestad, Google V8 product manager, explained at a previous web.dev LIVE event how the Zoom video-conferencing application leverages SIMD to replace the user’s real background with a virtual background. Robert Aboukhalil, a bioinformatics software engineer for the medical genetic testing company Invitae, recently revealed that SIMD is often used in genomics—in tools that align DNA sequences to a genome, including SSWbowtie2, and minimap2. Aboukhalil experimented on porting SSW to Wasm, and commented on the sizeable performance difference with SIMD enabled when aligning short DNA sequences to the reference genome of the Lambda phage:

The results are striking: without WebAssembly SIMD, the code is hundreds of times slower to run in the browser than it is natively on the command line:

(Source: Medium blog post)

In June 2019, as Wasmer became the first Wasm runtime to fully support WASI and SIMD, Nick Lewycky, LLVM core contributor, compared native SIMD, Wasmer SIMD, and Wasmer non-SIMD performance profiles for the emulation of particles physics:

Here are the run times of our physics simulation:

Time to execute the particles emulation program (lower is better)

As you can see, the speed when running the SIMD in the native executable versus running it with Wasmer… is almost the same!

We mentioned previously that SIMD stands for Single Instruction, Multiple Data. As the name implies, SIMD allows executing in one instruction the same operation on multiple pieces of data. 128-bit SIMD registers can for instance multiply four 32-bit numbers in one instruction rather than using four instructions involving 32-bit numbers. The parallelism implemented by the processor running the SIMD instruction results in a better performance profile (at most improved 4x) vs. sequentially running four instructions.

SIMD is not new. Hewlett-Packard introduced Multimedia Acceleration eXtensions instructions into PA-RISC desktops in 1994 to accelerate MPEG decoding. Intel did the same with its MMX extensions to the x86 architecture in 1996.

SIMD is not limited to a specific data length. The Advanced Vector Extensions (AVX) to the x86 instruction set architecture provide 256-bit registers, while AVX512 provide 512-bit registers. ARM introduced the Scalable Vector Extension (SVE) SIMD instruction set architecture in which the size of registers is not known at compilation time and is hardware-dependent. The WebAssembly SIMD proposal only addresses 128-bit data. It includes a subset of operations that can be ported to common SIMD instruction set architectures (ISA). The fixed-width WebAssembly SIMD proposal introduces a new v128 value type. In an emscripten context, the multiplication of two arrays can be stored in a third one as follows:

/* */
#include <wasm_simd128.h>

void multiply_arrays(int* out, int* in_a, int* in_b, int size) {
  for (int i = 0; i < size; i += 4) {
    v128_t a = wasm_v128_load(&in_a[i]);
    v128_t b = wasm_v128_load(&in_b[i]);
    v128_t prod = wasm_i32x4_mul(a, b);
    wasm_v128_store(&out[i], prod);
  }
}

However, toolchains strive when possible to apply automatic vectorization optimizations. With LLVM, it is possible to use standard programming constructs, and let the compiler transform them into SIMD instructions. The previous array multiplication could thus be written as follows:

void multiply_arrays(int* out, int* in_a, int* in_b, int size) {
  for (int i = 0; i < size; i++) {
    out[i] = in_a[i] * in_b[i];
  }
}

As SIMD can only apply a single operation to multiple data pieces, it particularly fits computations that heavily consist of operations on vectors. The Apple developer blog documented how a 128-bit SIMD library can be used to perform common computations on vectors, matrices, and quaternions that arise in 3D-graphics programming. Multi-threading allows developers to leverage multi-core architectures to run tasks in parallel rather than simply concurrently. SIMD and multi-threading parallelism have been used for image, audio, and video processing; and in cryptographic applications.

WebAssembly SIMD support is still experimental in browser environments. No browser enables it by default; developers may however enable it by modifying flags in Chromium-based and Firefox browsers. WebAssembly threads (and atomics) are supported in evergreen Chrome and Firefox. Aboukhalil points at the necessity to plan for the absence of support of desired features:

Note, however, that if SIMD is not enabled, your app will simply crash. To address that, you can compile two versions of your app: one with SIMD and one without. Then you can use wasm-feature-detect to load the correct version. Just make sure you don’t use too many new features or you’ll end up with a lot of permutations of your app .

In non-browser environments, the Wasmer and Wasmtime runtimes provide the feature through an --enable-simd option. Wasmer also has a --enable-threads option to enable multi-threading. The previously mentioned particle physics simulation can be run with Wasmer as follows:

wasmer run --backend=llvm --enable-simd particle-repel-simd.wasm

The WebAssembly Threads proposal is in stage 2, i.e., the specification draft is complete, with at least one implementation. The WebAssembly SIMD proposal is in stage 3. Deepti Gandluri and Thomas Lively mentioned in a blog article possible future work that takes advantage of the larger-width or variable-width SIMD operations offered by some instruction set architectures:

Fixed-width SIMD gives significant performance gains over scalar, but it doesn’t effectively leverage wider width vector operations that are available in modern hardware. As the current proposal moves forward, some future-facing work here is to determine the feasibility of extending the proposal with longer width operations.

WebAssembly SIMD and multi-threading are key components to realize the WebAssembly vision of producing programs that run at native speed, on a wide range of platforms. WebAssembly is getting closer to achieving its vision, as proposals are being implemented, test suites and specifications are being refined, and critical feedback is gathered from early experimentations. Early adopters have already showcased non-trivial applications of the technology with encouraging performance gains. It will be interesting to follow how the WebAssembly standard evolves in 2021, the applications that the increased performance unleashes, and how to incorporate the new advances without excessively increasing the complexity that developers have to handle when targeting WebAssembly.

About the Author

Bruno Couriol holds a Msc in Telecommunications, a BsC in Mathematics and a MBA by INSEAD. Starting with Accenture, most of his career has been spent as a consultant, helping large companies addressing their critical strategical, organizational and technical issues. In the last few years, he developed a focus on the intersection of business, technology and entrepreneurship.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT