mirror of
https://github.com/dani-garcia/vaultwarden.git
synced 2026-02-05 00:29:40 +03:00
32-bit ARM builds fail as single process uses >3 GiB memory #1843
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @MichaIng on GitHub (Feb 6, 2024).
Subject of the issue
When building vaultwarden on 32-bit ARM systems, it fails at the last compilation step, when assembling/linking the final
vaultwardenbinary. I first recognised that our GitHub Actions workflow failed building vaultwarden v1.30.3. This compiles on the public GitHub Actions runners within a QEMU-emulated container, throwing the following errors:I then tested it natively on an Odroid XU4, which fails with:
I suspect both to be the same underlying issues, but the build inside the container is probably aborted by the host/container engine.
The same works well with
x86_64andaarch64builds, natively and within the same QEMU container setup.I monitored some stats during the build on the Odroid XU4:
These are the seconds around the failure. RAM size is 2 GiB, and I created an 8 GiB swap space. The last build step of the
vaultwardencrate/binary utilises a single CPU core (the XU4 has 8 cores, so 1 core maxed is 12.5% CPU usage, the way I obtained it above) with a single process, and RAM + swap usage seems to crack the 3 GiB limit for a single process, which would explain the issue. LPAE allows an overall larger memory size/usage, but the utilisation for a single process is still limited.I verified this again by monitoring the process resident and virtual memory usage in

htop:One
rustcprocess during the last build step, with 4 threads. And the build fails when the virtual memory usage crosses 3052 MiB, i.e. quite precisely the 32-bit per-process memory limit.Since we have successful builds/packages with vaultwarden v1.30.1, I tried building v1.30.2, which expectedly fails the same way, as it differs with just 2 tiny surely unrelated commits from v1.30.3. v1.30.1 still builds fine, so the culprit is to be found between v1.30.1 and v1.30.2, probably just dependency crates which raised in size.
Not sure whether there is an easy solution/workaround. Of course we could try cross-compiling, but I actually would like to avoid that, as it is difficult to assure that correctly shared libraries are linked, especially when building for ARMv6 Raspbian systems.
Deployment environment
Install method: source build
Clients used:
Reverse proxy and version:
MySQL/MariaDB or PostgreSQL version:
Other relevant details:
Steps to reproduce
On Debian (any version):
Expected behaviour
Actual behaviour
Troubleshooting data
@BlackDex commented on GitHub (Feb 6, 2024):
Just a quick note. What do you expect us to do on how Rust builds the binaries? Or on how library crates are build. Must be something in there if it suddenly happend.
Also, all the building of the binaries do work on GitHub at least.
So it's not broken perse I think.
@dani-garcia commented on GitHub (Feb 6, 2024):
If the failure point is at the linking step, maybe disabling LTO can help? Does compiling in debug mode finish correctly at least?
@BlackDex commented on GitHub (Feb 6, 2024):
Also, note #4308, which isn't in 1.30.3.
@BlackDex commented on GitHub (Feb 6, 2024):
What happens if you use main?
@MichaIng commented on GitHub (Feb 6, 2024):
I asked this myself. At least I wanted to make you aware of the issue, as this should affect others as well. And probably someone with more Rust build/cargo knowledge has an idea how to work around the issue.
I see you use Docker
buildx. I guess it does something differently. Would be of course good to have someone replicating this withcargodirectly, so we know it is not just me, or the Debian systems we use for building.I am not 100% sure whether it is the linking step, but the
rustc --crate-name vaultwardenprocess at least, i.e. all dependencies have been compiled already.LTO is disabled when e.g. using
--profile dev, right? I'll redo a build, adding also-v, and see what happens.Will try it as well.
@BlackDex commented on GitHub (Feb 6, 2024):
It's always good to point this out of course. I just wondered since not much changed to Vaultwarden it self except the crates.
buildxis not that different from just any other container or basic system. So that shouldn't affect the building of course. It's probably more the libraries or versions of the tools which could make a difference per platform. Might even be a recent openssl update that might causes issues right now, since there were some deb updates the past few days.If should indeed.
👍🏻
@FlakyPi commented on GitHub (Feb 6, 2024):
The same happened to me. Apart from the problem with Handlebars, which I fixed using 5.1.0 version, I could compile Vaultwarden with
--profile release-micro.@MichaIng commented on GitHub (Feb 6, 2024):
Although, it affects Debian Bullseye, Bookworm and Trixie all together, Bullseye with LibSSL1.1 and the others with LibSSL3, and quite different toolchain (C) versions as well. Rust itself is installed via rustup, instead of Debian packages. And vaultwarden v1.30.1 still builds fine, so to me it looks more like crate dependency versions which make the difference. And this is of course nasty to track down.
I should have tested
maindirectly. I saw https://github.com/dani-garcia/vaultwarden/pull/4308 but thought would fail right at the start with dependency conflicts, if this was the issue, like happened at the issue linked to the PR. But it indirectly could fix the raised memory usage as well, of course.... okay the
--profile dev -vbuild failed as well. This time I see the LLVM memory allocation error as well in the container build on GitHub:Is there a way to verify that LTO is disabled? I see the
-C lto=fatoption is now missing, but might there be a (faster) default, which needs to be disabled explicitly? The docs do not give a hint: https://doc.rust-lang.org/cargo/reference/profiles.html#ltoBut I'll try to set it to
offexplicitly in another build.Building with
mainbranch (butreleasetarget) fails the same way.That is interesting. I saw that this new profile was recently added and thought trying it, though do not expect it to produce much smaller binaries in our case, since we run
stripon the resulting binary anyway. Interesting that this solves it for you (I started a build just now), as it does even heavier optimisations, isn't it? Probably what helps is that the dependencies are size-optimised as well, taking less memory when they are finally assembled into thevaultwardenbinary?@BlackDex commented on GitHub (Feb 6, 2024):
Could you provide a bit more details on the hosts you build it on?
So which distro/version, how much memory, and how many cpu cores? You mentioned bookworm, bullseye and trixy. So if you can provide the details of them, i might be able to reproduce the setup.
And, you also mentioned qemu and different architectures.
If you could describe these setups that would be nice 🙂
@BlackDex commented on GitHub (Feb 6, 2024):
I also wonder what happens if you try an
cargo updateThat will update all crates to the latest available working version.I tested it locally and that should works fine.
@MichaIng commented on GitHub (Feb 6, 2024):
I use two different hosts:
vaultwardenbinary build, where only a single worker runs. Hence limiting the amount of build workers does not help here.On the Odroid XU4, I am using this image. It has some scripts for system setup purpose, but is at its core a minimal Debian Bookworm
armhfimage. On GitHub, I use the container images form here, which have the same userland setup, and boot them from the Ubuntu GitHub runner withsystemd-container+qemu-user-static+binfmt-supportviasystemd-nspawn -bD path/to/loop/mount, then trigger the vaultwarden build within this container via systemd unit or autologin script. As we support Debian oldstable, stable and testing, we do builds for all these Debian version, hence Debian Bullseye/oldstable from 2021, Debian Bookworm/stable from 2023 and Debian Trixie/testing, which will be released in 2025. The container images (as well as all other images) are initially generated viadeboostrap, hence the tool used by Debian itself for their images as well. Ah, and the same issue happens on the ARMv6 container images, which are not based on Debian, but using the Raspbian repository instead, moreless a Debian clone for thearmv6hfarchitecture used by the first Raspberry Pi models, which is not supported by Debian. But otherwise the setup is identical, and QEMU has not ARMv6hf emulator, hence boots then with ARMv7 emulation as well.I can actually try to replicate it on any other Debian-based live image via GitHub Actions, or even Ubuntu (which is 99% the same in relevant regards). But Debian does not seem to offer them for ARM: https://www.debian.org/CD/live/
The regular (installer) images required interactive input, hence are not suitable for GitHub Actions. Another approach is to use a Debian-slim Docker
armv7image based container, not usingbuildxbut dorustupinstall andcargo build"manually" viaDockerfile. I am just not 100% sure yet how to invoke QEMU there, so that this can run on ax86_64host.I will try to do a build with proceeding
cargo update. Therelease-microtarget btw also works here. I tested it only on GitHub so far, will do the same on the XU4 and monitor/compare memory usage.@BlackDex commented on GitHub (Feb 6, 2024):
If you want docker to be able to run armxxx images locally you need binfmt support on your host.
It is all explained here: https://github.com/dani-garcia/vaultwarden/blob/main/docker/README.md
We use that same principle to create the final containers per architecture. We just pull in the armv6 or armv7 or aarch64 container and run
apt-get updateetc.. like it is a armxxx system.Technically you can do the same with a docker image.
As an example here below, i have binfmt installed so this works just fine for me.
That will use QEMU emulation for all binaries within that container.
So you can install rust use apt etc.. as-if it is that architecture.
The same happens on GitHub for us in the workflows.
897bdf8343/.github/workflows/release.yml (L65-L68)There we load binfmt support so we can use the same way to just run that architecture.
@MichaIng commented on GitHub (Feb 6, 2024):
Okay great. Installing the
binfmt-supportpackage on Ubuntu host, like we do in our workflows, should then work as well, instead ofdocker/setup-qemu-action. Your command seems to go one step further, emulating a particular CPU instead of just user mode emulation, likeqemu-systemvsqemu-user-static. However, this should not make a difference.@BlackDex commented on GitHub (Feb 6, 2024):
Looking at your odroid, it should be
QEMU_CPU=cortex-a7now i think of it :).@BlackDex commented on GitHub (Feb 6, 2024):
and probably also
linux/arm/v7@MichaIng commented on GitHub (Feb 6, 2024):
Yes, but it does not matter, as it fails on all 32-bit ARM systems the same way (when using Debian and the same set of commands).
@FlakyPi commented on GitHub (Feb 6, 2024):
Have you tried the cross-compiler
arm-linux-gnueabihfon a amd64 machine to build the armv7 binary (armhf on Debian)? It should be a lot faster than qemu and not have any memory limitations.For the Raspberry Pi armv6 I haven't found a better solution than the qemu-builder for the moment, and I think sooner or latter it will be impossible to build Vaultwarden on 32 bits machines, as the crates will keep growing and growing.
@BlackDex commented on GitHub (Feb 6, 2024):
That is how we do it. We cross-compile for the target architecture.
And only the final image is handled via qemu.
@MichaIng commented on GitHub (Feb 6, 2024):
First of all,
cargo updatecaused the handlebars error with1.30.3, but worked withmain, as expected. However, doing this withmaindoes not solve the build issue withreleasetarget.Cross-compiling is of course an option. But as said, to rule out surprises and assure linked and available shared library do 100% match, also on Raspbian systems, I prefer to do builds within the target userland. But indeed, as long as there is no way to somehow reduce
rustcmemory consumption without changing optimisation options, earlier or later there won't be another way, I'm afraid.Currently running the test with the Debian Bookworm Docker container:
... I did this within a VirtualBox VM, running Debian Bookworm. I should have enabled nested virtualization first (requires disabled core isolation > memory integrity security feature on Windows 11) to speed things up, it is running, but quite slowly ...
I forgot to add a workaround for https://github.com/rust-lang/cargo/issues/8719. But aside of some thrown warnings that some temporary files could not be removed, the crates index update ran through, so I hope this does not lead to an abortion later, before the build step of interest.
Btw, the
release-microprofile roughly halved the binary size here, so quite a significant difference to removing symbols afterwards viastriponly. @FlakyPi you did not do performance comparisons, did you?@BlackDex commented on GitHub (Feb 7, 2024):
@MichaIng we do cross-compiling too. Building via qemu takes a very long time. While not measured, it was certainly more then double the time.
Also, sometimes qemu can cause strange issues which are hard to debug. But moste or the time it works, but much slower.
@MichaIng commented on GitHub (Feb 7, 2024):
Sure emulation is slow, especially without nested virtualization support. It usually does not play a role when things run on GitHub. And yes, there are SO many issues I needed to work around, in build scripts, in testing scripts, etc etc. You see already
cargoissue I linked, and the wrapping container setup scripts for testing and builds are full of workarounds like that, including many for individual software titles. However, since we do test our software install options as well on GitHub, there is no way around fiddling with QEMU emulation, as long as there are no native ARM runners available, or we start financing a battery of SBCs as self-hosted runners. For true ARMv6 tests, it is even harder since there are no fast ARMv6-only SBCs, and testing on real RPi 1 or Zero is slower than emulating on a GitHub runner 😄. And ARMv6 + Raspbian tests are pretty important, since binaries provided by software developers often suddenly loose ARMv6 compatibility, or it is intentionally dropped from release assets. But we still have about 9% ARMv6-only RPi model systems among our user base 🙈.So applying workarounds for known issues with QEMU issues to the build scripts/workflows as well does not really increase the trouble, while cross-compiling by times adds trouble. Probably not with
vaultwarden, but in other cases we had issues with mismatching shared libraries, and the container setup code is shared between all build scripts.... btw, surprising situation with the Docker build:

Little small to see, however, from the host end, the
rustcprocess takes already 4.5 GiB virtual memory. But this could be due to QEMU overhead. Build is still running ...@MichaIng commented on GitHub (Feb 7, 2024):
And there it failed the same way within Docker container:
If
buildxworks for compiling the binaries, then it does something different. But if I understand it correctly, you do not buildvaultwardenin an emulated container, but via cross-compiling, and only the final Docker images (logically) via emulation? That would of course explain why your builds are not affected by the 3 GiB limit.@BlackDex commented on GitHub (Feb 7, 2024):
But, what is the reason for not cross-compiling?
@MichaIng commented on GitHub (Feb 7, 2024):
Quoting myself:
Basically to assure that the userland on the build system, hence the linked libraries on the build host, do exactly match the one on the target system. And depending on the toolchain, it is also much easier to setup, compared to installing cross-compiler and multiarch libraries, assuring it is used throughout the toolchain. E.g. Python builds with Rust code have an issue of loosing architecture information along the way. 32-bit ARM wheels compiled on 32-bit userland/OS with 64-bit kernel (default since Raspberry Pi 4 and 5, even on 32-bit userland/OS) are strangely marked as
aarch64, while they (of course) are in fact 32-bit wheels running on 32-bit ARM systems.@FlakyPi commented on GitHub (Feb 7, 2024):
Nothing very thorough. In my old RPi 1 everything is as slow as geology, so i didn't notice a very significant drop in performance.
@BlackDex commented on GitHub (Feb 7, 2024):
Another item. Why not use the per-compiled MUSL binaries. Those are distro indipendant.
@MichaIng commented on GitHub (Feb 7, 2024):
Where do you provide those? Or do you mean to extract them from Docker images? However, as we have own build scripts and GitHub workflows already, it feels better to also use them, and control the builds, in case flags, profiles etc. And I guess you e.g. do not provide RISC-V
andbinaries?armv6lEDIT: I see
linux/arm/v6containers are there 👍.@BlackDex commented on GitHub (Feb 7, 2024):
@MichaIng could you try the following please?
Replace the following part in Cargo.toml:
With:
And test again?
@BlackDex commented on GitHub (Feb 7, 2024):
Actually, this might be better for your use case, run this before you run the
cargo build@BlackDex commented on GitHub (Feb 7, 2024):
I Actually think that my previous post will help you.
I was looking at the diff between
1.30.1and1.30.2, checking some of the crates we updated if there was something there regarding similar issues, until i found the changes we did to the release profile.The main benefit will be the
CARGO_PROFILE_RELEASE_LTOenv, since that is what probably is eating your memory, since you mention it is in the latest step of the build process.I also set the
CARGO_PROFILE_RELEASE_CODEGEN_UNITSto it's default16, but you could set that to8or6. The results will probably differ depending on which system you run it on, but on1.30.1it was16.I also added the
CARGO_PROFILE_RELEASE_STRIP=symbolsthere, since i saw you mentioning to runstripon the resulting binary, this would prevent that step from being needed at all.I tested this my self on my system via a docker container. And it looked like it didn't came above 4GiB.
@FlakyPi commented on GitHub (Feb 8, 2024):
That works for me.
@BlackDex commented on GitHub (Feb 8, 2024):
I also added a new release profile
release-lowto this PR: https://github.com/dani-garcia/vaultwarden/pull/4328That might be useful too once merged.
@FlakyPi commented on GitHub (Feb 9, 2024):
Thank you for that, it's going to be very useful for me.
@BlackDex commented on GitHub (Feb 9, 2024):
@FlakyPi can you verify if this works?
If so, then we can close this issue since the PR for using
--profile release-lowwill be in the next version, and already is in main.@FlakyPi commented on GitHub (Feb 9, 2024):
It crashes with
CARGO_PROFILE_RELEASE_CODEGEN_UNITS=1It worked with
CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16@BlackDex commented on GitHub (Feb 10, 2024):
Hmm then ill have to change the profile.
@BlackDex commented on GitHub (Feb 10, 2024):
Ok that is merged. Since that seems to solve the issue I'm going to close this one.
If not please reopen.
@MichaIng commented on GitHub (Feb 11, 2024):
Many thanks guys, and sorry for my late reply. Since >1 codegen units and thin LTO seem to potentially worsen performance, and are both not present in the
release-microprofile, I wonder which would perform better, and hence whether the new profile has any benefit overrelease-micro:release-micro: 1 codegen unit andfatLTO, butopt-level = "z"release-low: 16 codegen units,thinLTO, butopt-level = 3Although, docs say that
thinis achieving similar performance results thanfat🤔: https://doc.rust-lang.org/cargo/reference/profiles.html#ltoI am also confused why more parallelisation (codegen units) uses less memory, while I would usually expect it to consume more memory. Did someone test
fatLTO with 16 codegen units? Out of interest, I think I'll try all combinations and see which results in which memory usage, so we get a better idea of the effects of each option.@BlackDex commented on GitHub (Feb 11, 2024):
@MichaIng it's difficult for me too really test it actually.
I do not have the same hardware. I could try mimicking GitHub.
The main thing is, the profile changed since you noticed compiling went wrong. We changed from
thintofat. And the main difference is that fat will try a bit harder to optimize links. While in theory both with one codegen should not make a difference i think fat will still use more resources.Also, 16 codegen units also release memory when they are done, and it's not one proces, that might help on low-end systems maybe?
I think thin with 16 is the best bet, since that was the previous default.
@MichaIng commented on GitHub (Mar 1, 2024):
I tested the
releasetarget withthinLTO but 1 codegen unit, and it worked. Max memory usage during the build was 2.15 GiB.Then I tested with
fatLTO but 16 codegen units, and it worked as well with max memory usage at 2.11 GiB.... currently running without both, and afterwards with both, just to have the full picture.
EDIT: Okay, now I am confused, as the build went through without any of both settings changed, using max 2.14 GiB memory, hence even a little lower than with
thinLTO. Probably the reason for the high memory usage has been fixed among dependencies, or even Rust toolchain itself 🤔. Currently running the build withthinand 16 codegen units.EDIT2:
thin+ 16 codegen units result in 2.08 GiB max memory usage.@polyzen commented on GitHub (Mar 7, 2024):
On my Odroid XU4, I get:
terminate called after throwing an instance of 'std::bad_alloc'with--profile release-lowLLVM ERROR: out of memorywith--profile release-microThis is on Arch Linux ARM armv7 with
cb935a5591/PKGBUILD, but modified for1.30.5and withbuild()edited with eg.@polyzen commented on GitHub (Mar 7, 2024):
Builds pass with
--profile release-lowwhen using just--features sqlite,--features mysql,--features postgresql,--features sqlite,mysql,--features sqlite,postgresql, or--features mysql,postgresql.@BlackDex commented on GitHub (Mar 7, 2024):
@polyzen probably because there is a lot of extra code per database feature and it also needs to link with one extra library.
@MichaIng commented on GitHub (Mar 12, 2024):
I accidentally built with an older version when doing above tests, which explains why it succeeded with the
releaseprofile. I did let my Odroid XU4 run through a bunch of optimisation option combinations with latest vaultwarden1.30.5, all of them with--features sqliteonly:release: LTO=fat, codegen units=1, opt-level=3, strip=debuginfoLLVM ERROR: out of memory!release-micro: LTO=fat, codegen units=1, opt-level=z, strip=symbolsrelease-low: LTO=thin, codegen units=16, opt-level=3, strip=symbolsrelease+CARGO_PROFILE_RELEASE_LTO=thinLLVM ERROR: out of memory!release+CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16LLVM ERROR: out of memory!release+CARGO_PROFILE_RELEASE_STRIP=symbolsLLVM ERROR: out of memory!release+CARGO_PROFILE_RELEASE_OPT_LEVEL=zfatal runtime error: Rust cannot catch foreign exceptions!release+CARGO_PROFILE_RELEASE_LTO=thin+CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16release+CARGO_PROFILE_RELEASE_OPT_LEVEL=z+CARGO_PROFILE_RELEASE_STRIP=symbolsfatal runtime error: Rust cannot catch foreign exceptionsrelease+CARGO_PROFILE_RELEASE_OPT_LEVEL=z+CARGO_PROFILE_RELEASE_STRIP=symbols+CARGO_PROFILE_RELEASE_PANIC=abort(equalsrelease-micro)release+CARGO_PROFILE_RELEASE_LTO=thin+CARGO_PROFILE_RELEASE_STRIP=symbols+CARGO_PROFILE_RELEASE_PANIC=abortrelease+CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16+CARGO_PROFILE_RELEASE_STRIP=symbols+CARGO_PROFILE_RELEASE_PANIC=abortLLVM ERROR: out of memoryrelease+CARGO_PROFILE_RELEASE_LTO=thin+CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16+CARGO_PROFILE_RELEASE_STRIP=symbols+CARGO_PROFILE_RELEASE_PANIC=abortrelease+CARGO_PROFILE_RELEASE_LTO=thin+CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16+CARGO_PROFILE_RELEASE_OPT_LEVEL=z+CARGO_PROFILE_RELEASE_STRIP=symbols+CARGO_PROFILE_RELEASE_PANIC=abortMax memory was btw obtained in a loop which checked
free -tbevery 0.5 seconds (the total memory usage, hence RAM + swap). The by times <3 GiB on runs with crash (and some other results) indicates that the memory usage ramps up and drop back quickly at a certain stage, so that the peak cannot be caught well. Hence take the numbers with a grain of salt.What I take from this, is that the max memory usage with
releaseprofile would be significantly above 3 GiB, since a single optimisation change does not prevent the crash, but applying at least two of them suddenly can drop it to 2 GiB. It also shows that therelease-microprofile, while it currently works, is quite close to 3 GiB, so it likely won't work forever, and hence therelease-lowprofile has indeed some value. Stripping allsymbols(instead of onlydebuginfo), seems to have only a small effect, unless combined withpanic=abort(the symbols enhance the panic stack trace). Since the removal of symbols has no negative effect on performance, a positive effect on size, and a negative effect for debugging only, I will personally prefer this for our builds.Does someone know whether the panic stack trace gives any meaningful information, when all symbols are removed? Else I suggest to add
panic=abortto therelease-lowprofile as well, or removestrip=symbols, which alone has no significant effect on memory usage.@BlackDex commented on GitHub (Mar 12, 2024):
Thanks for all the testing.
Just a quick question, did you do a
cargo cleanbefore every test?@MichaIng commented on GitHub (Mar 12, 2024):
Sure 🙂. I guess otherwise the differences were smaller.
@BlackDex commented on GitHub (Mar 12, 2024):
Not perse, since you changed building parameters.
@MichaIng commented on GitHub (Mar 12, 2024):
Err right, depends on with which flags the dependencies were compiled, respectively as far as we know simply which size they have. However, everything was recompiled on every build. So we have an idea now which flag/option has which effect.