🐛 Bug Report: Use a process manager in the container to ensure apps restart if they crash #306

Closed
opened 2025-10-09 16:38:33 +03:00 by OVERLORD · 17 comments
Owner

Originally created by @ItalyPaleAle on GitHub.

Reproduction steps

  1. Witness/cause a crash in one of the processes running in the container, either node, pocket-id-backend, or caddy
  2. The container is not restarted

Expected behavior

Either the entire container should crash, or the process should be restarted within the container

Actual Behavior

Process is not restarted, but the container is still up because the entrypoint is still up

Recommended fix: Use a process manager within the container, which ensures that:

  • Processes that crashes are restarted
  • At shutdown, all sub-processes are terminated correctly

PS: If looking for suggestions, I like supervisord.

Version and Environment

v0.39.0

Log Output

No response

Originally created by @ItalyPaleAle on GitHub. ### Reproduction steps 1. Witness/cause a crash in one of the processes running in the container, either `node`, `pocket-id-backend`, or `caddy` 2. The container is not restarted ### Expected behavior Either the entire container should crash, or the process should be restarted within the container ### Actual Behavior Process is not restarted, but the container is still up because the entrypoint is still up **Recommended fix:** Use a process manager within the container, which ensures that: - Processes that crashes are restarted - At shutdown, all sub-processes are terminated correctly PS: If looking for suggestions, I like [supervisord](https://supervisord.org/). ### Version and Environment v0.39.0 ### Log Output _No response_
OVERLORD added the bug label 2025-10-09 16:38:33 +03:00
Author
Owner

@stonith404 commented on GitHub:

Yeah I think it would makes sense that the whole container stops when one of the three services fails.

Regarding spawning the frontend from the backend, I wouldn't do this. This would complicate non-containerized setups unnecessarily. While we could make it optional, it adds coupling between components that should remain separate and offers little advantage over our current entrypoint script approach.

@stonith404 commented on GitHub: Yeah I think it would makes sense that the whole container stops when one of the three services fails. Regarding spawning the frontend from the backend, I wouldn't do this. This would complicate non-containerized setups unnecessarily. While we could make it optional, it adds coupling between components that should remain separate and offers little advantage over our current entrypoint script approach.
Author
Owner

@kmendell commented on GitHub:

Was this in pocket-id? or a different app? Just curious if this would be worth the effort to implement. I see where you are coming from though , just want to get the full picture.

@kmendell commented on GitHub: Was this in pocket-id? or a different app? Just curious if this would be worth the effort to implement. I see where you are coming from though , just want to get the full picture.
Author
Owner

@kmendell commented on GitHub:

Have you physically experienced something crash like this? If so could you provide some context in how (within some reason i know you probably don't have logs or anything) it happened?

@kmendell commented on GitHub: Have you physically experienced something crash like this? If so could you provide some context in how (within some reason i know you probably don't have logs or anything) it happened?
Author
Owner

@ItalyPaleAle commented on GitHub:

I actually have - albeit in a dev scenario.

I built a container with a modified backend app. I made a mistake and the app crashed upon startup. However the container kept running and podman still marked it as running because the main process (entrypoint.sh) was still running.

While this was at least partly my fault, there can be situations where one of the apps crashed in the container due to runtime bugs or others. I have not personally experienced that myself (yet?), but it is possible, and it would cause a situation like the one I did experience.

@ItalyPaleAle commented on GitHub: I actually have - albeit in a dev scenario. I built a container with a modified backend app. I made a mistake and the app crashed upon startup. However the container kept running and podman still marked it as running because the main process (entrypoint.sh) was still running. While this was at least partly my fault, there can be situations where one of the apps crashed in the container due to runtime bugs or others. I have not personally experienced that myself (yet?), but it is possible, and it would cause a situation like the one I did experience.
Author
Owner

@ItalyPaleAle commented on GitHub:

pocket-id, I'm working on a PR to fix another bug that will be ready shortly.

However I have experience with situations like these (1 container, multiple processes) where I had to make sure that for example one process was restarted automatically upon a crash, and another one caused the entire container to crash.

I do think we could fix both this problem and #324 together, by making the Go app itself spawn (and keep alive) the Node.js app.

@ItalyPaleAle commented on GitHub: pocket-id, I'm working on a PR to fix another bug that will be ready shortly. However I have experience with situations like these (1 container, multiple processes) where I had to make sure that for example one process was restarted automatically upon a crash, and another one caused the entire container to crash. I do think we could fix both this problem and #324 together, by making the Go app itself spawn (and keep alive) the Node.js app.
Author
Owner

@ItalyPaleAle commented on GitHub:

@stonith404 yes, I can confirm that's the case. Here's a repro (using v0.40):

  1. Run the container with a configuration error that would cause it to crash. For example, you can run it with --read-only (read-only root file system) and without mounting a volume for /app/backend/data. This will make it so the backend will crash because it can't write the key file.
  2. Do not configure healthchecks.

The container will be up:

$ podman ps | grep pocket-id
61f2137f6e61  ghcr.io/pocket-id/pocket-id:v0.40           sh ./scripts/dock...  2 minutes ago  Up 2 minutes           80/tcp 
   pocket-id-pocket-id

However, you can see in the logs that the backend isn't running:

$ podman logs -f pocket-id-pocket-id
Starting frontend...
Starting backend...
Caddy is disabled. Skipping...
2025/03/14 15:07:52 Error copying file: mkdir data: read-only file system
Listening on http://0.0.0.0:3000

(this happens with or without caddy)

You can invoke the frontend (or caddy if it's running) and you'll see an error like this, indicating the frontend is running, so at least one service is running:

Image

(localhost:4000 is because of port forwarding)

@ItalyPaleAle commented on GitHub: @stonith404 yes, I can confirm that's the case. Here's a repro (using v0.40): 1. Run the container with a configuration error that would cause it to crash. For example, you can run it with `--read-only` (read-only root file system) and __without__ mounting a volume for `/app/backend/data`. This will make it so the backend will crash because it can't write the key file. 2. Do not configure healthchecks. The container will be up: ``` $ podman ps | grep pocket-id 61f2137f6e61 ghcr.io/pocket-id/pocket-id:v0.40 sh ./scripts/dock... 2 minutes ago Up 2 minutes 80/tcp pocket-id-pocket-id ``` However, you can see in the logs that the backend isn't running: ``` $ podman logs -f pocket-id-pocket-id Starting frontend... Starting backend... Caddy is disabled. Skipping... 2025/03/14 15:07:52 Error copying file: mkdir data: read-only file system Listening on http://0.0.0.0:3000 ``` (this happens with or without caddy) You can invoke the frontend (or caddy if it's running) and you'll see an error like this, indicating the frontend is running, so at least one service is running: <img width="852" alt="Image" src="https://github.com/user-attachments/assets/d435ae50-aa82-4584-9389-5c953c3140e5" /> (`localhost:4000` is because of port forwarding)
Author
Owner

@kmendell commented on GitHub:

We Answered why we have one container is this older issue: https://github.com/pocket-id/pocket-id/issues/148#issuecomment-2605789073. But to summarize: We want to have one image to simplify the steup process, as one of the core aspects of Pocket-ID is to be simple, and not as complex (setup and usage) as other OIDC providers.

@kmendell commented on GitHub: We Answered why we have one container is this older issue: https://github.com/pocket-id/pocket-id/issues/148#issuecomment-2605789073. But to summarize: We want to have one image to simplify the steup process, as one of the core aspects of Pocket-ID is to be simple, and not as complex (setup and usage) as other OIDC providers.
Author
Owner

@stonith404 commented on GitHub:

Thanks @kmendell

@Pitasi It would be ideal if we would have an all-in-one image, separate images, support Kubernetes and Podman but this would require us to maintain all those installation methods even though we don't use them. T

Because of that I would like to outsource those methods and then we just link to them in the docs.

@stonith404 commented on GitHub: Thanks @kmendell @Pitasi It would be ideal if we would have an all-in-one image, separate images, support Kubernetes and Podman but this would require us to maintain all those installation methods even though we don't use them. T Because of that I would like to outsource those methods and then we just link to them in the docs.
Author
Owner

@Pitasi commented on GitHub:

My 2 cents: taking a step back, why having a single container running three services instead of two/three containers?
It's common for services to provide a sample docker compose for easy self-hosting, e.g., https://github.com/plausible/community-edition/blob/v2.1.5/compose.yml.

Having granular containers lets you use docker as the supervisor without having to care about all of this.

@Pitasi commented on GitHub: My 2 cents: taking a step back, why having a single container running three services instead of two/three containers? It's common for services to provide a sample docker compose for easy self-hosting, e.g., https://github.com/plausible/community-edition/blob/v2.1.5/compose.yml. Having granular containers lets you use docker as the supervisor without having to care about all of this.
Author
Owner

@stonith404 commented on GitHub:

@ItalyPaleAle Are you sure that the container doesn't stop if one of the three services crash?

wait -n at the end of the entrypoint should wait until one process finishes and then return the status code of the finished process.

@stonith404 commented on GitHub: @ItalyPaleAle Are you sure that the container doesn't stop if one of the three services crash? `wait -n` at the end of the entrypoint should wait until one process finishes and then return the status code of the finished process.
Author
Owner

@kmendell commented on GitHub:

@ItalyPaleAle I think to stoniths point, If some thing crashes i think the entire container should stop, i havent looked much into this yet, but we will want to implement this in the simplest way possible.

@kmendell commented on GitHub: @ItalyPaleAle I think to stoniths point, If some thing crashes i think the entire container should stop, i havent looked much into this yet, but we will want to implement this in the simplest way possible.
Author
Owner

@ItalyPaleAle commented on GitHub:

@kmendell Whether you run a process manager within the container, or let the entire container crash and have the orchestrator restart it, it should still be ok.

However, neither of the above is happening today, as per repro above.

@ItalyPaleAle commented on GitHub: @kmendell Whether you run a process manager within the container, or let the entire container crash and have the orchestrator restart it, it should still be ok. However, neither of the above is happening today, as per repro above.
Author
Owner

@ItalyPaleAle commented on GitHub:

@kmendell read-only FS was just one example of how to repro this. You can repro in any other way that would make the backend crash, for example an incorrect Postgres connection string.

That said, read-only is for the root file system only. Mounted volumes can be read-write if the host allows it (they are mounted as RW on the host) and the user has permissions (and SELinux isn't in the way). You can read more about it here: https://medium.com/datamindedbe/improve-the-security-of-pods-on-kubernetes-3e4a81534674 (this was written for K8s, but it's supported in Docker too if using --read-only). (Off topic, but using a read-only root FS is quite useful for security, and many security scanners will flag containers that don't do that)

@ItalyPaleAle commented on GitHub: @kmendell read-only FS was just one example of how to repro this. You can repro in any other way that would make the backend crash, for example an incorrect Postgres connection string. That said, read-only is for the root file system only. Mounted volumes can be read-write if the host allows it (they are mounted as RW on the host) and the user has permissions (and SELinux isn't in the way). You can read more about it here: https://medium.com/datamindedbe/improve-the-security-of-pods-on-kubernetes-3e4a81534674 (this was written for K8s, but it's supported in Docker too if using `--read-only`). (Off topic, but using a read-only root FS is quite useful for security, and many security scanners will flag containers that don't do that)
Author
Owner

@stonith404 commented on GitHub:

@ItalyPaleAle It seems that only Podman doesn't stop the container if some service crashes.

If you run docker run --read-only pocket-id/pocket-id the container will stop. So Docker stops the container if some service crashes.

Do you have any clue why Podman could handle this differently?

@stonith404 commented on GitHub: @ItalyPaleAle It seems that only Podman doesn't stop the container if some service crashes. If you run `docker run --read-only pocket-id/pocket-id` the container will stop. So Docker stops the container if some service crashes. Do you have any clue why Podman could handle this differently?
Author
Owner

@kmendell commented on GitHub:

@ItalyPaleAle Ive never used read-only file system though, how does this really work? eve nif you mount a voume wouldnt it still be read-only?

@kmendell commented on GitHub: @ItalyPaleAle Ive never used read-only file system though, how does this really work? eve nif you mount a voume wouldnt it still be read-only?
Author
Owner

@ItalyPaleAle commented on GitHub:

@stonith404 I can repro with docker too:

$ docker run --read-only ghcr.io/pocket-id/pocket-id:v0.43.1
Creating user and group...
addgroup: /etc/group: Read-only file system
adduser: unknown group pocket-id-group
mkdir: can't create directory '/app/backend/data': Read-only file system
find: /app/backend/data: No such file or directory
Starting frontend...
Starting backend...
Starting Caddy...
2025/03/23 20:34:20 Error copying file: mkdir data: read-only file system
{"level":"info","ts":1742762060.3332598,"msg":"using config from file","file":"/etc/caddy/Caddyfile"}
{"level":"info","ts":1742762060.33735,"msg":"adapted config to JSON","adapter":"caddyfile"}
{"level":"info","ts":1742762060.3431003,"logger":"admin","msg":"admin endpoint started","address":"localhost:2019","enforce_origin":false,"origins":["//localhost:2019","//[::1]:2019","//127.0.0.1:2019"]}
{"level":"warn","ts":1742762060.3439674,"logger":"http.auto_https","msg":"server is listening only on the HTTP port, so no automatic HTTPS will be applied to this server","server_name":"srv0","http_port":80}
{"level":"info","ts":1742762060.3461628,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc0002b7700"}
{"level":"info","ts":1742762060.34703,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1","h2","h3"]}
{"level":"error","ts":1742762060.3472958,"msg":"unable to create folder for config autosave","dir":"/.config/caddy","error":"mkdir /.config: read-only file system"}
{"level":"info","ts":1742762060.3474617,"msg":"serving initial configuration"}
Successfully started Caddy (pid=37) - Caddy is running in the background
{"level":"warn","ts":1742762060.348481,"logger":"tls","msg":"unable to get instance ID; storage clean stamps will be incomplete","error":"mkdir /.local: read-only file system"}
{"level":"error","ts":1742762060.3492997,"logger":"tls","msg":"could not clean default/global storage","error":"unable to acquire storage_clean lock: creating lock file: open /.local/share/caddy/locks/storage_clean.lock: no such file or directory"}
{"level":"info","ts":1742762060.3494847,"logger":"tls","msg":"finished cleaning storage units"}
Listening on http://0.0.0.0:3000

The container isn't crashing.

This is with Docker 28.0.2 on Ubuntu 22.04

@ItalyPaleAle commented on GitHub: @stonith404 I can repro with docker too: ```sh $ docker run --read-only ghcr.io/pocket-id/pocket-id:v0.43.1 Creating user and group... addgroup: /etc/group: Read-only file system adduser: unknown group pocket-id-group mkdir: can't create directory '/app/backend/data': Read-only file system find: /app/backend/data: No such file or directory Starting frontend... Starting backend... Starting Caddy... 2025/03/23 20:34:20 Error copying file: mkdir data: read-only file system {"level":"info","ts":1742762060.3332598,"msg":"using config from file","file":"/etc/caddy/Caddyfile"} {"level":"info","ts":1742762060.33735,"msg":"adapted config to JSON","adapter":"caddyfile"} {"level":"info","ts":1742762060.3431003,"logger":"admin","msg":"admin endpoint started","address":"localhost:2019","enforce_origin":false,"origins":["//localhost:2019","//[::1]:2019","//127.0.0.1:2019"]} {"level":"warn","ts":1742762060.3439674,"logger":"http.auto_https","msg":"server is listening only on the HTTP port, so no automatic HTTPS will be applied to this server","server_name":"srv0","http_port":80} {"level":"info","ts":1742762060.3461628,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc0002b7700"} {"level":"info","ts":1742762060.34703,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1","h2","h3"]} {"level":"error","ts":1742762060.3472958,"msg":"unable to create folder for config autosave","dir":"/.config/caddy","error":"mkdir /.config: read-only file system"} {"level":"info","ts":1742762060.3474617,"msg":"serving initial configuration"} Successfully started Caddy (pid=37) - Caddy is running in the background {"level":"warn","ts":1742762060.348481,"logger":"tls","msg":"unable to get instance ID; storage clean stamps will be incomplete","error":"mkdir /.local: read-only file system"} {"level":"error","ts":1742762060.3492997,"logger":"tls","msg":"could not clean default/global storage","error":"unable to acquire storage_clean lock: creating lock file: open /.local/share/caddy/locks/storage_clean.lock: no such file or directory"} {"level":"info","ts":1742762060.3494847,"logger":"tls","msg":"finished cleaning storage units"} Listening on http://0.0.0.0:3000 ``` The container isn't crashing. This is with Docker 28.0.2 on Ubuntu 22.04
Author
Owner

@stonith404 commented on GitHub:

Thanks, I was able to reproduce this on my Ubuntu server too. This should now be fixed in the latest version, let me know if you still have any issues.

@stonith404 commented on GitHub: Thanks, I was able to reproduce this on my Ubuntu server too. This should now be fixed in the latest version, let me know if you still have any issues.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/pocket-id-pocket-id-2#306