Supercharging your Rust static executables with mimalloc

10 August 2023 — by Cheng Shao

Why link statically against musl?

Have you ever faced compatibility issues when dealing with Linux binary executables? The culprit is often the libc implementation, glibc. Acting as the backbone of nearly all Linux distros, glibc is the library responsible for providing standard C functions. Yet, its version compatibility often poses a challenge. Binaries compiled with a newer version of glibc may not function on systems running an older one, creating a compatibility headache.

So, how can we circumvent this issue?

The approach we’ll focus on today involves the creation of static executables. These executables are self-contained and do not dynamically load libc, thereby offering enhanced portability.

However, we’re not out of the woods yet. Glibc is not well-suited for static linking because it heavily relies on dynamic loading of locale data and DNS service plugins. That’s where musl, a lightweight alternative libc implementation, shines. Musl is highly proficient at creating portable, static Linux executables.

But, when it comes to musl, there’s a certain performance trap: its allocator does not perform well in multi-core environments. Do we have to live with it? Absolutely not! By replacing musl’s malloc implementation with a cutting-edge allocator like mimalloc, we can supercharge its performance in multi-core setups.

In this blog post, I’ll take you on a deep dive into this performance-enhancing strategy using a real-world Rust project as an example. We’ll also take a look at some intriguing benchmark results that demonstrate the significant improvement this change can drive.

Are you ready to turbocharge your Rust static executables? Let’s dive right in!

Creating Rust static executables based on musl

Let’s explore how to compile wasmtime into a static Linux executable that links with musl. For those who aren’t familiar, wasmtime is a standalone runtime for WebAssembly. Although the upstream repo doesn’t distribute static wasmtime executables yet, I’ve found myself needing a version of wasmtime that functions across all distros, including those not based on glibc, like Alpine.

Luckily, wasmtime is written in Rust, a language known for its toolchain being particularly friendly to cross-compiling. Here’s how to compile wasmtime to a static executable in an Ubuntu 23.04 work environment:

$ sudo apt install -y musl-tools # Needed only if the Rust project has C dependencies
$ rustup target add x86_64-unknown-linux-musl # Downloads a static build of musl to be linked later
$ cargo build --target x86_64-unknown-linux-musl --release --bin wasmtime # Time for a short break

With just these commands, you can generate a static wasmtime executable from the checked out repo.

Testing our result shows it is indeed a static executable:

$ ./target/x86_64-unknown-linux-musl/release/wasmtime --version
wasmtime-cli 11.0.0
$ ldd ./target/x86_64-unknown-linux-musl/release/wasmtime
        statically linked

However, before we celebrate, there’s a catch. While the executable works, its performance leaves something to be desired. In fact, the performance of this static wasmtime falls short when compared to ordinary wasmtime builds that link against the system-wide glibc. In the next section, we’ll explore why this happens and how to improve the performance.

A static executable based on musl may be ssllooww

In order to compare the performance of different wasmtime builds, we will execute a simple benchmark using a small tool called bench. This tool runs the same command multiple times and outputs statistical data such as mean time and variance.

The benchmark involves compiling and running a large hello.wasm module. This module is around 26MiB in size, large enough to demonstrate the difference in wasmtime run-time performance.

The benchmark is executed with wasmtime run --disable-cache, as wasmtime caches JIT-compiled code by default. However, we want different runs to be independent from each other.

The test is performed on a server with 48 logical cores and 128GiB memory. When wasmtime does JIT compilation and optimization, it automatically uses all CPU cores by default.

Here’s the result for an ordinary wasmtime build that dynamically links against system-wide glibc:

benchmarking /workspace/wasmtime/target/release/wasmtime run --disable-cache hello.wasm
time                 4.311 s    (4.106 s .. 4.714 s)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 4.452 s    (4.341 s .. 4.599 s)
std dev              145.8 ms   (49.87 ms .. 198.2 ms)
variance introduced by outliers: 19% (moderately inflated)

Next, here’s the result for the static wasmtime build produced earlier:

benchmarking /workspace/wasmtime/target/x86_64-unknown-linux-musl/release/wasmtime run --disable-cache hello.wasm
time                 93.07 s    (91.75 s .. 93.74 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 92.16 s    (91.35 s .. 92.60 s)
std dev              790.0 ms   (79.30 ms .. 1.017 s)
variance introduced by outliers: 19% (moderately inflated)

So, on a machine with many cores, a static wasmtime build can take 20x more time to compile and run a wasm module! Clearly, this is not the type of build we want.

malloc contention is the culprit

Let’s observe htop while the static wasmtime build is struggling and consuming all the CPU cores. The CPU bars are filled with red, which is undesirable. This color signifies that CPU time is wasted in kernel mode, when we ideally want the green color that indicates CPU time is properly spent in user mode, doing useful work.

A common cause of the above symptom is thread contention: multiple threads are fighting to acquire the same resource instead of doing useful work. Although it’s better than deadlock or livelock, since the application runs to completion, it’s still a significant waste of CPU.

To confirm this hypothesis, we rerun the benchmark with --disable-parallel-compilation. Here are the numbers for the glibc build:

benchmarking /workspace/wasmtime/target/release/wasmtime run --disable-cache --disable-parallel-compilation hello.wasm
time                 41.24 s    (40.59 s .. 41.82 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 41.40 s    (41.30 s .. 41.55 s)
std dev              146.4 ms   (37.24 ms .. 194.9 ms)
variance introduced by outliers: 19% (moderately inflated)

And here are the numbers for the musl build:

benchmarking /workspace/wasmtime/target/release/wasmtime run --disable-cache --disable-parallel-compilation hello.wasm
time                 41.32 s    (40.99 s .. 41.55 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 41.38 s    (41.32 s .. 41.41 s)
std dev              47.04 ms   (16.59 ms .. 63.66 ms)
variance introduced by outliers: 19% (moderately inflated)

Now the numbers are comparable, and interestingly, the musl build performs much faster than when all CPU cores are used. Therefore, it’s indeed a thread contention issue. Inappropriate use of locks in user code can lead to thread contention. However, this is not the case in wasmtime, as the glibc build shows significant speedup when operating in multi-core mode.

The real source of thread contention is in the malloc implementation of musl. A malloc implementation must be thread-safe, as multiple threads may allocate memory at once or even free memory allocated in other threads. Thus, the thread synchronization logic could become a bottleneck.

Although I have not deep-dived into the musl malloc codebase to locate the contention, replacing it with a cutting-edge malloc implementation like mimalloc is sufficient to rectify the problem and enhance performance. Mimalloc minimizes contention and enhances multi-threaded performance. In fact, this replacement can make the application run even faster than the default glibc build.

Building `mimalloc` and linking against it

First, checkout the latest release of mimalloc and apply this patch, which accomplishes several things:

Disables -fPIC when building libmimalloc.a to prioritize performance in our static build. The -fPIC option can create unnecessary overhead.
Turns off the override of C++ new/delete operators. The original implementation will still invoke malloc/free in libmimalloc.a, so overriding these is unnecessary and could trigger linker errors when linking with C++.
Ensures the __libc_ wrappers are also created when compiling for musl, these wrappers are indeed used by musl internally.

Now, you can build libmimalloc.a using the command below:

$ cmake \
    -Bout \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=x86_64-linux-musl-gcc \
    -DMI_BUILD_SHARED=OFF \
    -DMI_BUILD_OBJECT=OFF \
    -DMI_BUILD_TESTS=OFF \
    .
$ cmake --build out

The next challenge is to link libmimalloc.a into wasmtime, ensuring it overrides musl’s default malloc implementation.

In a C/C++ project, you may pass -Wl,--push-state,--whole-archive,/workspace/mimalloc/out/libmimalloc.a,--pop-state to the linker arguments. This works well for glibc, since symbols like malloc are defined as weak symbols so they can be easily overridden by malloc defined in user code.

However, messing with linker commands doesn’t work so well for musl. It defines malloc as strong symbols, so overriding malloc may or may not work depending on where the build system places the above line in the list of linker arguments. Bad news for us, it doesn’t work in cargo:

RUSTFLAGS="-C link-arg=-Wl,--push-state,--whole-archive,/workspace/mimalloc/out/libmimalloc.a,--pop-state" cargo build \
  --target x86_64-unknown-linux-musl \
  --release \
  --bin wasmtime
...
/usr/bin/ld: /workspace/mimalloc/out/libmimalloc.a(alloc.c.o): in function `posix_memalign':
          alloc.c:(.text+0x1b0): multiple definition of `posix_memalign'; /home/gitpod/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-musl/lib/self-contained/libc.a(posix_memalign.lo):/build/musl-cross-make/build/local/x86_64-linux-musl/obj_musl/../src_musl/src/malloc/posix_memalign.c:5: first defined here
...

Patching musl `libc.a` in-place

Could we patch the musl sources, replace the existing malloc implementation with mimalloc and rebuild the entire thing? Yes, that can work, but there is a faster way: we can patch the musl libc.a shipped by rustup in-place!

$ LIBC_PATH=$(find ~/.rustup -name libc.a)
$ {
    echo "CREATE libc.a"
    echo "ADDLIB $LIBC_PATH"
    echo "DELETE aligned_alloc.lo calloc.lo donate.lo free.lo libc_calloc.lo lite_malloc.lo malloc.lo malloc_usable_size.lo memalign.lo posix_memalign.lo realloc.lo reallocarray.lo valloc.lo"
    echo "ADDLIB out/libmimalloc.a"
    echo "SAVE"
  } | ar -M
$ mv libc.a $LIBC_PATH

The above commands uses ar scripts to strip the original malloc implementation completely and replaces it with object files from libmimalloc.a.

ar scripts is a legacy feature that originates in the era of Motorola 68000 assembly, whose toolchain contains a scriptable “librarian” program that creates archives from object files. The language is self-explanatory, you may take a look at the documentation in either gnu ar or llvm-ar for more details.

Despite being a historical artifact, the script interface does provide one feature absent in the command-line interface: the ADDLIB command, which allows appending all members of an archive to the result, instead of appending the archive file as a member, which is exactly what we want in this use case.

Another interesting question is: where did the list of objects to be deleted from the original libc.a come from? Thankfully, the musl codebase is organized in a fine-grained manner, with roughly one C source file implementing one C standard library function. So I simply did a local musl build, checked defined symbols in each object file and found these object files. There’s no collateral damage to remove them, libmimalloc.a provides the same allocator with conforming ABI.

Anyway, now that libc.a has been patched, the original build command that targets musl will be capable of building a performant static executable:

benchmarking /workspace/wasmtime/target/x86_64-unknown-linux-musl/release/wasmtime run --disable-cache hello.wasm
time                 3.734 s    (3.537 s .. 3.878 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 3.825 s    (3.763 s .. 3.936 s)
std dev              106.5 ms   (2.226 ms .. 129.0 ms)
variance introduced by outliers: 19% (moderately inflated)

See, it’s even faster than the build that links against glibc!

Conclusion

The practice of statically linking against musl commonly enhances the portability of Linux executables. However, for substantial multi-core performance, it’s vital to substitute the malloc implementation with a more efficient one. While patching libc.a in-place might seem unconventional, it proves to be the most straightforward solution:

It eliminates concerns about duplicate symbol errors during linking.
There is no need to interfere with linker arguments in high-level build systems like cargo.
This same technique can be used with Alpine’s own /usr/lib/libc.a, allowing mimalloc to function automatically with other languages such as C, C++, or Haskell.

mimalloc also comes in the form of a Rust crate and can be utilized to replace the Rust global allocator. In languages like Rust, C++, Zig, and potentially others, the heap allocator features an interface defined in the same language and can be overridden. This requires additional adjustments in the user’s project but is suitable for scenarios that don’t involve intensive interoperation with C.

Nonetheless, there could be unusual situations that assume the Rust/C heaps are identical, and the C side might free pointers that originate from the other side. In such circumstances, it’s advisable to override the global libc allocator and allow mimalloc to manage everything, utilizing the trick introduced in this blog post.

We also have a repo that provides a Dockerfile, it layers upon the official rust:alpine image and performs the patching logic described in this post. You could build and use the image per se in your CI pipelines in order to build static C/C++/Rust executables, or adopt the build script for your own needs. Happy hacking!

Behind the scenes

Cheng Shao

Cheng is a Software Engineer who specializes in the implementation of functional programming languages. He is the project lead and main developer of Tweag's Haskell-to-WebAssembly compiler project codenamed Asterius. He also maintains other Haskell projects and makes contributions to GHC(Glasgow Haskell Compiler). Outside of work, Cheng spends his time exploring Paris and watching anime.

Tech Group

Programming Languages and Compilers

Research, create, improve and maintain programming languages and their tooling to enhance developer productivity and to deliver reliable, maintainable, correct and performant software with minimum effort.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← Building a Rust workspace with Bazel Behind the scenes with FawltyDeps v0.13.0: Matching imports with dependencies →