[BUG] Machine Learning container fails to start after upgrading to v1.77.0 from v1.76.1 #1306

Closed
opened 2026-02-05 01:13:04 +03:00 by OVERLORD · 6 comments
Owner

Originally created by @thariq-shanavas on GitHub (Sep 6, 2023).

The bug

I upgraded Immich and the machine learning container fails to start.
Output of sudo docker logs -f immich_machine_learning

[09/06/23 10:35:59] INFO Booting worker with pid: 4585
[09/06/23 10:36:29] CRITICAL WORKER TIMEOUT (pid:4585)
[09/06/23 10:36:31] ERROR Worker (pid:4585) was sent SIGKILL! Perhaps out of memory?

The container tries to restart, then fails with the same timeout error. I suspect a bug from https://github.com/immich-app/immich/pull/3934

I'm running on a system with 2 GB RAM (with 1 GB ZRAM and 1GB swap), so I've enabled only face recognition among the machine learning features. The processor is an Intel Atom Z8350. It works great in v1.76.1

In my .env file, I have pinned the version to v1.76.1 until this is resolved. Thank you all so much for this amazing software! I'll be happy to post any other logs as needed.

The OS that Immich Server is running on

Debian 12

Version of Immich Server

v1.77.0

Version of Immich Mobile App

NA

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

version: "3.8"

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: [ "start.sh", "immich" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends:
    #   file: hwaccel.yml
    #   service: hwaccel
    command: [ "start.sh", "microservices" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always

  immich-web:
    container_name: immich_web
    image: ghcr.io/immich-app/immich-web:${IMMICH_VERSION:-release}
    env_file:
      - .env
    restart: always

  typesense:
    container_name: immich_typesense
    image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd
    environment:
      - TYPESENSE_API_KEY=${TYPESENSE_API_KEY}
      - TYPESENSE_DATA_DIR=/data
      # remove this to get debug messages
      - GLOG_minloglevel=1
    volumes:
      - tsdata:/data
    restart: always
  redis:
    container_name: immich_redis
    image: redis:6.2-alpine@sha256:70a7a5b641117670beae0d80658430853896b5ef269ccf00d1827427e3263fa3
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14-alpine@sha256:28407a9961e76f2d285dc6991e8e48893503cc3836a4755bbc2d40bcc272a441
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

  immich-proxy:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
    environment:
      # Make sure these values get passed through from the env file
      - IMMICH_SERVER_URL
      - IMMICH_WEB_URL
    ports:
      - 2283:8080
    depends_on:
      - immich-server
      - immich-web
    restart: always

volumes:
  pgdata:
  model-cache:
  tsdata:

Your .env content

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION= [Redacted]

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
# IMMICH_VERSION=release
IMMICH_VERSION=v1.76.1
# Connection secrets for postgres and typesense. You should change these to random passwords
TYPESENSE_API_KEY= [Redacted]
DB_PASSWORD= [Redacted]

# The values below this line do not need to be changed
###################################################################################
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

REDIS_HOSTNAME=immich_redis
#IMMICH_MACHINE_LEARNING_ENABLED=false
#TYPESENSE_ENABLED=false

Reproduction steps

1. Start the container with docker-compose up -d

Additional information

No response

Originally created by @thariq-shanavas on GitHub (Sep 6, 2023). ### The bug I upgraded Immich and the machine learning container fails to start. Output of `sudo docker logs -f immich_machine_learning` `[09/06/23 10:35:59] INFO Booting worker with pid: 4585` `[09/06/23 10:36:29] CRITICAL WORKER TIMEOUT (pid:4585)` `[09/06/23 10:36:31] ERROR Worker (pid:4585) was sent SIGKILL! Perhaps out of memory?` The container tries to restart, then fails with the same timeout error. I suspect a bug from https://github.com/immich-app/immich/pull/3934 I'm running on a system with 2 GB RAM (with 1 GB ZRAM and 1GB swap), so I've enabled only face recognition among the machine learning features. The processor is an Intel Atom Z8350. It works great in v1.76.1 In my .env file, I have pinned the version to v1.76.1 until this is resolved. Thank you all so much for this amazing software! I'll be happy to post any other logs as needed. ### The OS that Immich Server is running on Debian 12 ### Version of Immich Server v1.77.0 ### Version of Immich Mobile App NA ### Platform with the issue - [X] Server - [ ] Web - [ ] Mobile ### Your docker-compose.yml content ```YAML version: "3.8" services: immich-server: container_name: immich_server image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} command: [ "start.sh", "immich" ] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload env_file: - .env depends_on: - redis - database - typesense restart: always immich-microservices: container_name: immich_microservices image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} # extends: # file: hwaccel.yml # service: hwaccel command: [ "start.sh", "microservices" ] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload env_file: - .env depends_on: - redis - database - typesense restart: always immich-machine-learning: container_name: immich_machine_learning image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release} volumes: - model-cache:/cache env_file: - .env restart: always immich-web: container_name: immich_web image: ghcr.io/immich-app/immich-web:${IMMICH_VERSION:-release} env_file: - .env restart: always typesense: container_name: immich_typesense image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd environment: - TYPESENSE_API_KEY=${TYPESENSE_API_KEY} - TYPESENSE_DATA_DIR=/data # remove this to get debug messages - GLOG_minloglevel=1 volumes: - tsdata:/data restart: always redis: container_name: immich_redis image: redis:6.2-alpine@sha256:70a7a5b641117670beae0d80658430853896b5ef269ccf00d1827427e3263fa3 restart: always database: container_name: immich_postgres image: postgres:14-alpine@sha256:28407a9961e76f2d285dc6991e8e48893503cc3836a4755bbc2d40bcc272a441 env_file: - .env environment: POSTGRES_PASSWORD: ${DB_PASSWORD} POSTGRES_USER: ${DB_USERNAME} POSTGRES_DB: ${DB_DATABASE_NAME} volumes: - pgdata:/var/lib/postgresql/data restart: always immich-proxy: container_name: immich_proxy image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release} environment: # Make sure these values get passed through from the env file - IMMICH_SERVER_URL - IMMICH_WEB_URL ports: - 2283:8080 depends_on: - immich-server - immich-web restart: always volumes: pgdata: model-cache: tsdata: ``` ### Your .env content ```Shell # You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables # The location where your uploaded files are stored UPLOAD_LOCATION= [Redacted] # The Immich version to use. You can pin this to a specific version like "v1.71.0" # IMMICH_VERSION=release IMMICH_VERSION=v1.76.1 # Connection secrets for postgres and typesense. You should change these to random passwords TYPESENSE_API_KEY= [Redacted] DB_PASSWORD= [Redacted] # The values below this line do not need to be changed ################################################################################### DB_HOSTNAME=immich_postgres DB_USERNAME=postgres DB_DATABASE_NAME=immich REDIS_HOSTNAME=immich_redis #IMMICH_MACHINE_LEARNING_ENABLED=false #TYPESENSE_ENABLED=false ``` ### Reproduction steps ```bash 1. Start the container with docker-compose up -d ``` ### Additional information _No response_
Author
Owner

@raisinbear commented on GitHub (Sep 6, 2023):

Same here, got this message at least twice at initial startup, but seems to have resolved itself after some retries. Face recognition also seems to be working, the other options are disabled just like for OP.

@raisinbear commented on GitHub (Sep 6, 2023): Same here, got this message at least twice at initial startup, but seems to have resolved itself after some retries. Face recognition also seems to be working, the other options are disabled just like for OP.
Author
Owner

@thariq-shanavas commented on GitHub (Sep 6, 2023):

I noticed it a couple hours after the update, and it had not resolved itself. It probably restarted hundreds of times in that time frame.
A reboot did not fix it either.

@thariq-shanavas commented on GitHub (Sep 6, 2023): I noticed it a couple hours after the update, and it had not resolved itself. It probably restarted hundreds of times in that time frame. A reboot did not fix it either.
Author
Owner

@hachre commented on GitHub (Sep 6, 2023):

I had the same issue (also noticed after hours) but in my case a docker compose down --remove-orphans and docker compose up -d solved it for me...

@hachre commented on GitHub (Sep 6, 2023): I had the same issue (also noticed after hours) but in my case a `docker compose down --remove-orphans` and `docker compose up -d` solved it for me...
Author
Owner

@alextran1502 commented on GitHub (Sep 6, 2023):

Cc @mertalev

@alextran1502 commented on GitHub (Sep 6, 2023): Cc @mertalev
Author
Owner

@mertalev commented on GitHub (Sep 7, 2023):

Looks like gunicorn gives workers 30s to start and terminates them if they don't start within this time. It might take longer than this for a worker to start on very slow CPUs. Setting --timeout to a higher number should fix it, maybe 120?

@mertalev commented on GitHub (Sep 7, 2023): Looks like gunicorn gives workers 30s to start and terminates them if they don't start within this time. It might take longer than this for a worker to start on very slow CPUs. Setting `--timeout` to a higher number should fix it, maybe 120?
Author
Owner

@koffienl commented on GitHub (Sep 8, 2023):

Not sure if this is the same issue, but just did a clean install (v1.77.0) on a clean docker container with the stack file from the site.
The machine-learning container won't finish the download and is stuck in a loop downloading over and over again.

There's plenty of CPU and mem for the container, but it's cutting off the download after 29 seconds.

/usr/local/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[09/08/23 15:15:14] ERROR    Worker (pid:121) was sent code 134!                
[09/08/23 15:15:14] INFO     Booting worker with pid: 135                       
[09/08/23 15:15:21] INFO     Created in-memory cache with unloading disabled.   
[09/08/23 15:15:21] INFO     Initialized request thread pool with 12 threads.   
09/08/23 15:15:21] INFO     Downloading facial-recognition model 'buffalo_l'.This may take a while.                 
09/08/23 15:15:21] WARNING  Failed to load facial-recognition model            
buffalo_l'.Clearing cache and retrying.           
[09/08/23 15:15:21] INFO     Cleared cache directory for model 'buffalo_l'.     
[09/08/23 15:15:21] INFO     Downloading facial-recognition model 'buffalo_l'.This may take a while.                 
Downloading /cache/facial-recognition/buffalo_l/buffalo_l.zip from https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_l.zip...
18%|█▊        | 50581/281857 [00:05<00:26, 8850.62KB/s]=```

OK, my bad .. thought this fix was already published/live but it wasn't.
Editten the start.sh file with the timeout and it started to work.
@koffienl commented on GitHub (Sep 8, 2023): Not sure if this is the same issue, but just did a clean install (v1.77.0) on a clean docker container with the stack file from the site. The machine-learning container won't finish the download and is stuck in a loop downloading over and over again. There's plenty of CPU and mem for the container, but it's cutting off the download after 29 seconds. ```92%|█████████▏| 259560/281857 [00:29<00:02, 8630.17KB/s][09/08/23 15:15:13] CRITICAL WORKER TIMEOUT (pid:121) /usr/local/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' [09/08/23 15:15:14] ERROR Worker (pid:121) was sent code 134! [09/08/23 15:15:14] INFO Booting worker with pid: 135 [09/08/23 15:15:21] INFO Created in-memory cache with unloading disabled. [09/08/23 15:15:21] INFO Initialized request thread pool with 12 threads. 09/08/23 15:15:21] INFO Downloading facial-recognition model 'buffalo_l'.This may take a while. 09/08/23 15:15:21] WARNING Failed to load facial-recognition model buffalo_l'.Clearing cache and retrying. [09/08/23 15:15:21] INFO Cleared cache directory for model 'buffalo_l'. [09/08/23 15:15:21] INFO Downloading facial-recognition model 'buffalo_l'.This may take a while. Downloading /cache/facial-recognition/buffalo_l/buffalo_l.zip from https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_l.zip... 18%|█▊ | 50581/281857 [00:05<00:26, 8850.62KB/s]=``` OK, my bad .. thought this fix was already published/live but it wasn't. Editten the start.sh file with the timeout and it started to work.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: immich-app/immich#1306