Why link statically against musl?
Have you ever faced compatibility issues when dealing with Linux binary executables? The culprit is often the libc implementation, glibc. Acting as the backbone of nearly all Linux distros, glibc is the library responsible for providing standard C functions. Yet, its version compatibility often poses a challenge. Binaries compiled with a newer version of glibc may not function on systems running an older one, creating a compatibility headache.
So, how can we circumvent this issue?
The approach we’ll focus on today involves the creation of static executables. These executables are self-contained and do not dynamically load libc, thereby offering enhanced portability.
However, we’re not out of the woods yet. Glibc is not well-suited for static linking because it heavily relies on dynamic loading of locale data and DNS service plugins. That’s where musl, a lightweight alternative libc implementation, shines. Musl is highly proficient at creating portable, static Linux executables.
But, when it comes to musl, there’s a certain performance trap: its allocator does not perform well in multi-core environments. Do we have to live with it? Absolutely not! By replacing musl’s malloc implementation with a cutting-edge allocator like mimalloc, we can supercharge its performance in multi-core setups.
In this blog post, I’ll take you on a deep dive into this performance-enhancing strategy using a real-world Rust project as an example. We’ll also take a look at some intriguing benchmark results that demonstrate the significant improvement this change can drive.
Are you ready to turbocharge your Rust static executables? Let’s dive right in!
Creating Rust static executables based on musl
Let’s explore how to compile wasmtime
into a static
Linux executable that links with musl. For those who aren’t familiar,
wasmtime
is a standalone runtime for WebAssembly. Although
the upstream repo doesn’t distribute static wasmtime
executables
yet, I’ve found myself needing a version of wasmtime
that functions
across all distros, including those not based on glibc, like Alpine.
Luckily, wasmtime
is written in Rust, a language known for
its toolchain being particularly friendly to cross-compiling. Here’s
how to compile wasmtime
to a static executable in an Ubuntu 23.04
work environment:
$ sudo apt install -y musl-tools # Needed only if the Rust project has C dependencies
$ rustup target add x86_64-unknown-linux-musl # Downloads a static build of musl to be linked later
$ cargo build --target x86_64-unknown-linux-musl --release --bin wasmtime # Time for a short break
With just these commands, you can generate a static wasmtime
executable from the checked out repo.
Testing our result shows it is indeed a static executable:
$ ./target/x86_64-unknown-linux-musl/release/wasmtime --version
wasmtime-cli 11.0.0
$ ldd ./target/x86_64-unknown-linux-musl/release/wasmtime
statically linked
However, before we celebrate, there’s a catch. While the executable works, its performance leaves something to be desired. In fact, the performance of this static wasmtime falls short when compared to ordinary wasmtime builds that link against the system-wide glibc. In the next section, we’ll explore why this happens and how to improve the performance.
A static executable based on musl may be ssllooww
In order to compare the performance of different wasmtime
builds, we
will execute a simple benchmark using a small tool called
bench. This tool runs the same command multiple times and
outputs statistical data such as mean time and variance.
The benchmark involves compiling and running a large hello.wasm
module. This module is around 26MiB in size, large enough to
demonstrate the difference in wasmtime
run-time performance.
The benchmark is executed with wasmtime run --disable-cache
, as
wasmtime
caches JIT-compiled code by default. However, we want
different runs to be independent from each other.
The test is performed on a server with 48 logical cores and 128GiB memory. When wasmtime does JIT compilation and optimization, it automatically uses all CPU cores by default.
Here’s the result for an ordinary wasmtime
build that dynamically
links against system-wide glibc:
benchmarking /workspace/wasmtime/target/release/wasmtime run --disable-cache hello.wasm
time 4.311 s (4.106 s .. 4.714 s)
0.999 R² (0.998 R² .. 1.000 R²)
mean 4.452 s (4.341 s .. 4.599 s)
std dev 145.8 ms (49.87 ms .. 198.2 ms)
variance introduced by outliers: 19% (moderately inflated)
Next, here’s the result for the static wasmtime
build produced
earlier:
benchmarking /workspace/wasmtime/target/x86_64-unknown-linux-musl/release/wasmtime run --disable-cache hello.wasm
time 93.07 s (91.75 s .. 93.74 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 92.16 s (91.35 s .. 92.60 s)
std dev 790.0 ms (79.30 ms .. 1.017 s)
variance introduced by outliers: 19% (moderately inflated)
So, on a machine with many cores, a static wasmtime
build can take
20x more time to compile and run a wasm module! Clearly, this is not
the type of build we want.
malloc contention is the culprit
Let’s observe htop
while the static wasmtime build is struggling and
consuming all the CPU cores. The CPU bars are filled with red,
which is undesirable. This color signifies that CPU time is
wasted in kernel mode, when we ideally want the green color that
indicates CPU time is properly spent in user mode, doing useful work.
A common cause of the above symptom is thread contention: multiple threads are fighting to acquire the same resource instead of doing useful work. Although it’s better than deadlock or livelock, since the application runs to completion, it’s still a significant waste of CPU.
To confirm this hypothesis, we rerun the benchmark with
--disable-parallel-compilation
. Here are the numbers for the glibc
build:
benchmarking /workspace/wasmtime/target/release/wasmtime run --disable-cache --disable-parallel-compilation hello.wasm
time 41.24 s (40.59 s .. 41.82 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 41.40 s (41.30 s .. 41.55 s)
std dev 146.4 ms (37.24 ms .. 194.9 ms)
variance introduced by outliers: 19% (moderately inflated)
And here are the numbers for the musl build:
benchmarking /workspace/wasmtime/target/release/wasmtime run --disable-cache --disable-parallel-compilation hello.wasm
time 41.32 s (40.99 s .. 41.55 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 41.38 s (41.32 s .. 41.41 s)
std dev 47.04 ms (16.59 ms .. 63.66 ms)
variance introduced by outliers: 19% (moderately inflated)
Now the numbers are comparable, and interestingly, the musl build
performs much faster than when all CPU cores are used. Therefore, it’s
indeed a thread contention issue. Inappropriate use of locks in user
code can lead to thread contention. However, this is not the case in
wasmtime
, as the glibc build shows significant speedup when
operating in multi-core mode.
The real source of thread contention is in the malloc implementation of musl. A malloc implementation must be thread-safe, as multiple threads may allocate memory at once or even free memory allocated in other threads. Thus, the thread synchronization logic could become a bottleneck.
Although I have not deep-dived into the musl malloc codebase to locate the contention, replacing it with a cutting-edge malloc implementation like mimalloc is sufficient to rectify the problem and enhance performance. Mimalloc minimizes contention and enhances multi-threaded performance. In fact, this replacement can make the application run even faster than the default glibc build.
Building mimalloc
and linking against it
First, checkout the latest release of mimalloc
and apply this
patch, which accomplishes several things:
- Disables
-fPIC
when buildinglibmimalloc.a
to prioritize performance in our static build. The-fPIC
option can create unnecessary overhead. - Turns off the override of C++
new
/delete
operators. The original implementation will still invokemalloc
/free
inlibmimalloc.a
, so overriding these is unnecessary and could trigger linker errors when linking with C++. - Ensures the
__libc_
wrappers are also created when compiling for musl, these wrappers are indeed used by musl internally.
Now, you can build libmimalloc.a
using the command below:
$ cmake \
-Bout \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=x86_64-linux-musl-gcc \
-DMI_BUILD_SHARED=OFF \
-DMI_BUILD_OBJECT=OFF \
-DMI_BUILD_TESTS=OFF \
.
$ cmake --build out
The next challenge is to link libmimalloc.a
into wasmtime
,
ensuring it overrides musl’s default malloc implementation.
In a C/C++ project, you may pass
-Wl,--push-state,--whole-archive,/workspace/mimalloc/out/libmimalloc.a,--pop-state
to the linker arguments. This works well for glibc, since symbols like
malloc
are defined as weak symbols so they can be easily overridden
by malloc
defined in user code.
However, messing with linker commands doesn’t work so well for musl.
It defines malloc
as strong symbols, so overriding malloc
may or
may not work depending on where the build system places the above line
in the list of linker arguments. Bad news for us, it doesn’t work in
cargo
:
RUSTFLAGS="-C link-arg=-Wl,--push-state,--whole-archive,/workspace/mimalloc/out/libmimalloc.a,--pop-state" cargo build \
--target x86_64-unknown-linux-musl \
--release \
--bin wasmtime
...
/usr/bin/ld: /workspace/mimalloc/out/libmimalloc.a(alloc.c.o): in function `posix_memalign':
alloc.c:(.text+0x1b0): multiple definition of `posix_memalign'; /home/gitpod/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-musl/lib/self-contained/libc.a(posix_memalign.lo):/build/musl-cross-make/build/local/x86_64-linux-musl/obj_musl/../src_musl/src/malloc/posix_memalign.c:5: first defined here
...
Patching musl libc.a
in-place
Could we patch the musl sources, replace the existing malloc
implementation with mimalloc
and rebuild the entire thing? Yes, that
can work, but there is a faster way: we can patch the musl libc.a
shipped by rustup
in-place!
$ LIBC_PATH=$(find ~/.rustup -name libc.a)
$ {
echo "CREATE libc.a"
echo "ADDLIB $LIBC_PATH"
echo "DELETE aligned_alloc.lo calloc.lo donate.lo free.lo libc_calloc.lo lite_malloc.lo malloc.lo malloc_usable_size.lo memalign.lo posix_memalign.lo realloc.lo reallocarray.lo valloc.lo"
echo "ADDLIB out/libmimalloc.a"
echo "SAVE"
} | ar -M
$ mv libc.a $LIBC_PATH
The above commands uses ar
scripts to strip the original malloc
implementation completely and replaces it with object files from
libmimalloc.a
.
ar
scripts is a legacy feature that originates in the era of
Motorola 68000 assembly, whose toolchain contains a scriptable
“librarian” program that creates archives from object files. The
language is self-explanatory, you may take a look at the documentation
in either gnu ar
or llvm-ar
for
more details.
Despite being a historical artifact, the script interface does provide
one feature absent in the command-line interface: the ADDLIB
command, which allows appending all members of an archive to the
result, instead of appending the archive file as a member, which is
exactly what we want in this use case.
Another interesting question is: where did the list of objects to be deleted
from the original libc.a
come from? Thankfully, the
musl codebase is organized in a fine-grained manner, with roughly one
C source file implementing one C standard library function. So I
simply did a local musl
build, checked defined symbols in each
object file and found these object files. There’s no collateral damage
to remove them, libmimalloc.a
provides the same allocator with
conforming ABI.
Anyway, now that libc.a
has been patched, the original build command
that targets musl will be capable of building a performant static
executable:
benchmarking /workspace/wasmtime/target/x86_64-unknown-linux-musl/release/wasmtime run --disable-cache hello.wasm
time 3.734 s (3.537 s .. 3.878 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 3.825 s (3.763 s .. 3.936 s)
std dev 106.5 ms (2.226 ms .. 129.0 ms)
variance introduced by outliers: 19% (moderately inflated)
See, it’s even faster than the build that links against glibc!
Conclusion
The practice of statically linking against musl commonly enhances the
portability of Linux executables. However, for substantial multi-core
performance, it’s vital to substitute the malloc implementation with a
more efficient one. While patching libc.a
in-place might seem
unconventional, it proves to be the most straightforward solution:
- It eliminates concerns about duplicate symbol errors during linking.
- There is no need to interfere with linker arguments in high-level
build systems like
cargo
. - This same technique can be used with Alpine’s own
/usr/lib/libc.a
, allowingmimalloc
to function automatically with other languages such as C, C++, or Haskell.
mimalloc
also comes in the form of a Rust crate
and can be utilized to replace the Rust global allocator. In languages
like Rust, C++, Zig, and potentially others, the heap allocator
features an interface defined in the same language and can be
overridden. This requires additional adjustments in the user’s project
but is suitable for scenarios that don’t involve intensive
interoperation with C.
Nonetheless, there could be unusual situations that assume the Rust/C
heaps are identical, and the C side might free pointers that originate
from the other side. In such circumstances, it’s advisable to override
the global libc allocator and allow mimalloc
to manage everything,
utilizing the trick introduced in this blog post.
We also have a repo that provides a
Dockerfile
, it layers upon the official rust:alpine
image and
performs the patching logic described in this post. You could build
and use the image per se in your CI pipelines in order to build static
C/C++/Rust executables, or adopt the build script for your own needs.
Happy hacking!
About the author
Cheng is a Software Engineer who specializes in the implementation of functional programming languages. He is the project lead and main developer of Tweag's Haskell-to-WebAssembly compiler project codenamed Asterius. He also maintains other Haskell projects and makes contributions to GHC(Glasgow Haskell Compiler). Outside of work, Cheng spends his time exploring Paris and watching anime.
If you enjoyed this article, you might be interested in joining the Tweag team.