mirror of
https://github.com/dani-garcia/vaultwarden.git
synced 2026-02-05 00:29:40 +03:00
Server softlock after organization operations #1619
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Bert-Proesmans on GitHub (Jun 29, 2023).
Subject of the issue
The server doesn't execute operations anymore after manipulations on the organization vault. It's not possible to perform new logins, nor record creation/deletion/moves.
Deployment environment
Your environment (Generated via diagnostics page)
Config (Generated via diagnostics page)
Show Running Config
Environment settings which are overridden:
Steps to reproduce
Expected behaviour
The selected individual records are present in the organization vault, and/or removed if step 5 has been conditionally executed.
No issues with login in/off and no issues with manipulations on the individual and organization vault.
Actual behaviour
The server softlocks; it's not possible to login, nor is it possible to perform any operations on the vault data.
The only thing left to do is close the web vault tabs.
The server recovers within 10 minutes, but I'm not sure in which capacity. The selected individual records are entirely or partially moved! Records that haven't been moved are still present in the respective users' individual vault.
Troubleshooting data
There are no warnings or errors reported anywhere, nothing in the browser console, nothing in the container logs.
The only thing I can add is that the server recovers and the vault is in a consistent state; the vault opens and there is no data loss.
@BlackDex commented on GitHub (Jul 4, 2023):
Could you please test this on the
:testingtagged images which are based upon themainbranch.That uses a newer web-vault version, and some other updates. etc...
@Bert-Proesmans commented on GitHub (Jul 5, 2023):
I'm gonna setup a test replica and report back by the end of this week.
@BlackDex commented on GitHub (Jul 17, 2023):
@Bert-Proesmans any result you want/can share?
@Bert-Proesmans commented on GitHub (Jul 17, 2023):
Hi, sorry for the wait. I've updated to the latest version just now and the issue persists.
We performed the same steps as described in the original post with 2 people. The same symptoms happened after two users moved records into the company vault at the same time.
Request log
[2023-07-17 15:00:12.642][vaultwarden::api::identity][INFO] User bert.proesmans@e-powerinternational.com logged in successfully. IP: 10.0.0.132Workaround is closing the web vault tabs performing the record move and waiting for some time. Afterwards the server recovered.
Other important requests in the log
[2023-07-17 14:55:32.238][vaultwarden::api::notifications][INFO] Closing WS connection from 172.29.0.3[2023-07-17 14:55:40.694][vaultwarden::api::notifications][INFO] Closing WS connection from 172.29.0.3^Both users closed their browser tab that performed the record move. The underlying issue might be websocket related because you'll notice the server moves records after the websockets are closed (don't compete/deadlock anymore?) with very little delay.
[2023-07-17 14:58:40.367][vaultwarden::api::core::two_factor][INFO] User joachim.bos@e-powerinternational.com did not complete a 2FA login within the configured time limit. IP: 10.0.0.22[2023-07-17 14:58:41.031][vaultwarden::api::core::two_factor][INFO] User bert.proesmans@e-powerinternational.com did not complete a 2FA login within the configured time limit. IP: 10.0.0.132^Both users 2FA timeout, because after entering username and password there is no prompt for security key until server recovers.
Company event log
Edit; If required, please tell me how I should collect and deliver the required data to debug this and I'll test again tomorrow.
@BlackDex commented on GitHub (Jul 18, 2023):
The web-sockets are separated from the rest of the code in the sense that it only triggers a call, it shouldn't block anything else. It also is the last call done so all database transactions should be finished already.
WebSocket connections are closed because you close the tabs, which is expected behavior, and those are connections running parallel in a different thread.
Are there any restrictions on the container/pod in the amounts of memory or CPU's?
What kind of storage is used for the sqlite database?
Are there any browser console errors/warnings during these actions?
Are there any limitations or restrictions configured at the reverse proxy? Like ModSecurity, WAF or any other security tools?
@BlackDex commented on GitHub (Jul 18, 2023):
@Bert-Proesmans Cloud you provide us with the amount of ciphers/vault-items listed for the users in question and the amount for the org. You can extract data information from the
/adminpanel@Bert-Proesmans commented on GitHub (Jul 18, 2023):
That was just a guess, I recognize correlation in time is not always causation.
No container restrictions, the host VM has no CPU/RAM contention. There is still unused disk space left. The persistent store is a volume mount on the local VM docker host.
No console errors/warnings, no special mention in the container (request) logs. The proxy restricts IPs and blocks common exploits at application level ('Block common exploits' in Nginx Proxy Manager)
ORG; 32 users/373 entries
bertp; 32 entries
joachimb; 15 entries
Have you tried replicating this scenario on your own machine? Could you reproduce?
@BlackDex commented on GitHub (Jul 18, 2023):
@Bert-Proesmans i have tried to reproduce it, but I'm not able to.
I created an empty org, two new users, imported a list of 3000 vault items (1500 Login and 1500 Secure Notes) for both users.
Both those users now had 3000 items each, so in total 6000 items.
The org had 2 items, one in each collection.
I then clicked on the select all (Which actually selects the first 500 items, not more), and selected the option to move to an organization and prepared the web-vault interface so that it has the save option ready to be clicked for both users, where for user1 i selected collection x, and for user2 i selected collection y.
I then pressed the save buttons right after each other, and they both almost simultaneously posted that data to Vaultwarden.
Vaultwarden was processing both requests at the same time, and it took them both around 5.7/5.8 seconds to fully process those requests.
I also had a third client open in which i switched between the
groupsandvaulttab's, that kinda forces the vault to re-sync for that users, and thus request a data from the database.It all seems to work just fine no issues. The only thing I notice is that it takes a long time to load the vault, because of the huge amount of ciphers I have now haha.
I'm using SQLite too of course my system doesn't have that much memory, and it even uses swap currently, and i was running both the server and the clients on the exact same system, so all were using a lot of resources.
So, either something strange is happening with your database, or the storage you are using for some reason does not support locking, which could cause strange issues.
Maybe you can check/verify/vacuum your database file, after creating a backup of course.
Please try to run the following on the sqlite database, make sure you stop Vaultwarden before running these queries, and that you created a backup!
Run
sqlite3 db.sqlite:If it still causes issues, i have no clue on what it could be.
It could still be your underlying storage for the VM, if that is a NFS or CIFS/Samba share that might cause issues.
@Bert-Proesmans commented on GitHub (Jul 18, 2023):
hmm, your data scope is way larger so that rules out locking issues in application layer.
I'll run those commands on the database and see if it improves anything. If no effect I'll play a bit with sqlite, I have zero experience with the lib but needed an excuse to do some digging. I'll give feedback in a few days.
The weird thing is that all VM resources are local to the host, we're not using any high latency protocol for compute nor storage in this instance.
I want to reiterate that this specific user story is a minor issue (low priority), the server works as expected 99.9% of the time. That's also the reason why I'm not as responsive as I'd like to be.
If my inspection on the sqlite db doesn't improve the symptoms I propose to close this issue as not reproducible. This could still very well be a platform issue which is out of scope for this tracker.
@Bert-Proesmans commented on GitHub (Jul 20, 2023):
I've executed all commands on our database and the console didn't report any errors.
We ran through the scenario again and still the same symptoms. Our issue is not solved by database maintenance, sadly.
Thanks for helping out! I'll have to take a deeper look into the machinery running this container.
@BlackDex commented on GitHub (Jul 20, 2023):
Very strange. I would indeed suggest to gather some metrics like, CPU, Mem, Disk IO to see if you see anything strange there.
On both the Host and VM.
Ill move this to discussions so that other people can see it, and if you want, you can update us on it :).