[BUG] Machine Learning does not work and causes docker container to increase #1622

Closed
opened 2026-02-05 02:42:19 +03:00 by OVERLORD · 6 comments
Owner

Originally created by @kleinMaggus on GitHub (Nov 15, 2023).

The bug

After updating from 1.83 to 1.84 (latest as well) I can not run any machine learning anymore. First of all the machine learning does not work and causes docker to increase the data disk usage until 100%.

The OS that Immich Server is running on

Linux odroid 4.9.337-33 #1 SMP PREEMPT Thu Jul 20 18:05:01 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Version of Immich Server

1.86

Version of Immich Mobile App

1.86

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

version: '3.8'

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: [ "start.sh", "immich" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends:
    #   file: hwaccel.yml
    #   service: hwaccel
    command: [ "start.sh", "microservices" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    volumes:
      - /opt/immich/model-cache:/cache
    env_file:
      - .env
    restart: always

  immich-web:
    container_name: immich_web
    image: ghcr.io/immich-app/immich-web:${IMMICH_VERSION:-release}
    env_file:
      - .env
    restart: always

  typesense:
    container_name: immich_typesense
    image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd
    environment:
      - TYPESENSE_API_KEY=${TYPESENSE_API_KEY}
      - TYPESENSE_DATA_DIR=/data
      # remove this to get debug messages
      - GLOG_minloglevel=1
    volumes:
      - /opt/immich/typesense:/data
    restart: always

  redis:
    container_name: immich_redis
    image: redis:6.2-alpine@sha256:3995fe6ea6a619313e31046bd3c8643f9e70f8f2b294ff82659d409b47d06abb
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14-alpine@sha256:874f566dd512d79cf74f59754833e869ae76ece96716d153b0fa3e64aec88d92
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - /opt/immich/postgres:/var/lib/postgresql/data
    restart: always

  immich-proxy:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
    ports:
      - 2283:8080
    depends_on:
      - immich-server
      - immich-web
    restart: always

Your .env content

IMMICH_VERSION=v1.86.0

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=/mnt/Immich

# Connection secrets for postgres and typesense. You should change these to random passwords
TYPESENSE_API_KEY=<>
DB_PASSWORD=<>

MACHINE_LEARNING_REQUEST_THREADS=2

# The values below this line do not need to be changed
###################################################################################
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

REDIS_HOSTNAME=immich_redis

Reproduction steps

I have a fresh installation (1.86) with 13 images uploaded via cli. I disabled object tagging and clip encoding. Only face recognition enabled. 8 of 13 images have people. No peoples detected. 

Running `df` before upload:
`/dev/mmcblk0p2               59797828 14665116   45116324  25% /`

After upload:
`/dev/mmcblk0p2               59797828 18588104   41193336  32% /`

Additional information

docker logs immich_machine_learning
[11/15/23 11:37:08] INFO     Starting gunicorn 21.2.0
[11/15/23 11:37:08] INFO     Listening at: http://0.0.0.0:3003 (10)
[11/15/23 11:37:08] INFO     Using worker: uvicorn.workers.UvicornWorker
[11/15/23 11:37:08] INFO     Booting worker with pid: 16
[11/15/23 11:37:36] INFO     Created in-memory cache with unloading disabled.
[11/15/23 11:37:36] INFO     Initialized request thread pool with 2 threads.
[11/15/23 11:41:53] INFO     Downloading facial recognition model
                             'buffalo_l'.This may take a while.
Downloading (…)d6855/.gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 1.90MB/s]
Downloading (…)dcaf5d6855/README.md: 100%|██████████| 582/582 [00:00<00:00, 1.26MB/s]
Downloading model.onnx: 100%|██████████| 16.9M/16.9M [00:01<00:00, 16.6MB/s]s]
Downloading model.onnx: 100%|██████████| 174M/174M [00:06<00:00, 27.8MB/s]s]
Fetching 4 files: 100%|██████████| 4/4 [00:06<00:00,  1.72s/it]2, 13.2MB/s]
[11/15/23 11:42:00] INFO     Loading facial recognition model 'buffalo_l'
[11/15/23 11:42:30] ERROR    Worker (pid:16) was sent code 132!
[11/15/23 11:42:30] INFO     Booting worker with pid: 45
/usr/local/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[11/15/23 11:42:52] INFO     Created in-memory cache with unloading disabled.
[11/15/23 11:42:52] INFO     Initialized request thread pool with 2 threads.
[11/15/23 11:42:52] INFO     Loading facial recognition model 'buffalo_l'
[11/15/23 11:43:18] ERROR    Worker (pid:45) was sent code 132!
[11/15/23 11:43:18] INFO     Booting worker with pid: 66
[11/15/23 11:43:39] INFO     Created in-memory cache with unloading disabled.
[11/15/23 11:43:39] INFO     Initialized request thread pool with 2 threads.
[11/15/23 11:43:39] INFO     Loading facial recognition model 'buffalo_l'
[11/15/23 11:44:18] ERROR    Worker (pid:66) was sent code 132!
[11/15/23 11:44:18] INFO     Booting worker with pid: 87
[11/15/23 11:44:42] INFO     Created in-memory cache with unloading disabled.
[11/15/23 11:44:42] INFO     Initialized request thread pool with 2 threads.
[11/15/23 11:44:42] INFO     Loading facial recognition model 'buffalo_l'
[11/15/23 11:45:24] ERROR    Worker (pid:87) was sent code 132!
[11/15/23 11:45:25] INFO     Booting worker with pid: 108
[11/15/23 11:45:45] INFO     Created in-memory cache with unloading disabled.
[11/15/23 11:45:45] INFO     Initialized request thread pool with 2 threads.
[11/15/23 11:45:45] INFO     Loading facial recognition model 'buffalo_l'
[11/15/23 11:46:07] ERROR    Worker (pid:108) was sent code 132!
[11/15/23 11:46:07] INFO     Booting worker with pid: 129
[11/15/23 11:46:28] INFO     Created in-memory cache with unloading disabled.
[11/15/23 11:46:28] INFO     Initialized request thread pool with 2 threads.
[11/15/23 11:46:28] INFO     Loading facial recognition model 'buffalo_l'
[11/15/23 11:46:59] ERROR    Worker (pid:129) was sent code 132!
[11/15/23 11:46:59] INFO     Booting worker with pid: 150
[11/15/23 11:47:19] INFO     Created in-memory cache with unloading disabled.
[11/15/23 11:47:19] INFO     Initialized request thread pool with 2 threads.
[11/15/23 11:47:19] INFO     Loading facial recognition model 'buffalo_l'
[11/15/23 11:47:47] ERROR    Worker (pid:150) was sent code 132!
[11/15/23 11:47:47] INFO     Booting worker with pid: 170
[11/15/23 11:48:07] INFO     Created in-memory cache with unloading disabled.
[11/15/23 11:48:07] INFO     Initialized request thread pool with 2 threads.

Log in immich_microservices

[Nest] 6  - 11/15/2023, 11:42:26 AM   ERROR [JobService] Unable to run job handler (recognizeFaces/recognize-faces): TypeError: fetch failed
[Nest] 6  - 11/15/2023, 11:42:26 AM   ERROR [JobService] TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11372:11)
    at async MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:16:21)
    at async PersonService.handleRecognizeFaces (/usr/src/app/dist/domain/person/person.service.js:183:23)
    at async /usr/src/app/dist/domain/job/job.service.js:108:37
    at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:350:28)
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:537:24)
[Nest] 6  - 11/15/2023, 11:42:26 AM   ERROR [JobService] Object:
{
  "id": "28dcc33a-512a-4033-97f0-6f3a227c9117",
  "source": "upload"
}
Originally created by @kleinMaggus on GitHub (Nov 15, 2023). ### The bug After updating from 1.83 to 1.84 (latest as well) I can not run any machine learning anymore. First of all the machine learning does not work and causes docker to increase the data disk usage until 100%. ### The OS that Immich Server is running on Linux odroid 4.9.337-33 #1 SMP PREEMPT Thu Jul 20 18:05:01 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux ### Version of Immich Server 1.86 ### Version of Immich Mobile App 1.86 ### Platform with the issue - [X] Server - [ ] Web - [ ] Mobile ### Your docker-compose.yml content ```YAML version: '3.8' services: immich-server: container_name: immich_server image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} command: [ "start.sh", "immich" ] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload - /etc/localtime:/etc/localtime:ro env_file: - .env depends_on: - redis - database - typesense restart: always immich-microservices: container_name: immich_microservices image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} # extends: # file: hwaccel.yml # service: hwaccel command: [ "start.sh", "microservices" ] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload - /etc/localtime:/etc/localtime:ro env_file: - .env depends_on: - redis - database - typesense restart: always immich-machine-learning: container_name: immich_machine_learning image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release} volumes: - /opt/immich/model-cache:/cache env_file: - .env restart: always immich-web: container_name: immich_web image: ghcr.io/immich-app/immich-web:${IMMICH_VERSION:-release} env_file: - .env restart: always typesense: container_name: immich_typesense image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd environment: - TYPESENSE_API_KEY=${TYPESENSE_API_KEY} - TYPESENSE_DATA_DIR=/data # remove this to get debug messages - GLOG_minloglevel=1 volumes: - /opt/immich/typesense:/data restart: always redis: container_name: immich_redis image: redis:6.2-alpine@sha256:3995fe6ea6a619313e31046bd3c8643f9e70f8f2b294ff82659d409b47d06abb restart: always database: container_name: immich_postgres image: postgres:14-alpine@sha256:874f566dd512d79cf74f59754833e869ae76ece96716d153b0fa3e64aec88d92 env_file: - .env environment: POSTGRES_PASSWORD: ${DB_PASSWORD} POSTGRES_USER: ${DB_USERNAME} POSTGRES_DB: ${DB_DATABASE_NAME} volumes: - /opt/immich/postgres:/var/lib/postgresql/data restart: always immich-proxy: container_name: immich_proxy image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release} ports: - 2283:8080 depends_on: - immich-server - immich-web restart: always ``` ### Your .env content ```Shell IMMICH_VERSION=v1.86.0 # You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables # The location where your uploaded files are stored UPLOAD_LOCATION=/mnt/Immich # Connection secrets for postgres and typesense. You should change these to random passwords TYPESENSE_API_KEY=<> DB_PASSWORD=<> MACHINE_LEARNING_REQUEST_THREADS=2 # The values below this line do not need to be changed ################################################################################### DB_HOSTNAME=immich_postgres DB_USERNAME=postgres DB_DATABASE_NAME=immich REDIS_HOSTNAME=immich_redis ``` ### Reproduction steps ```bash I have a fresh installation (1.86) with 13 images uploaded via cli. I disabled object tagging and clip encoding. Only face recognition enabled. 8 of 13 images have people. No peoples detected. Running `df` before upload: `/dev/mmcblk0p2 59797828 14665116 45116324 25% /` After upload: `/dev/mmcblk0p2 59797828 18588104 41193336 32% /` ``` ### Additional information ``` docker logs immich_machine_learning [11/15/23 11:37:08] INFO Starting gunicorn 21.2.0 [11/15/23 11:37:08] INFO Listening at: http://0.0.0.0:3003 (10) [11/15/23 11:37:08] INFO Using worker: uvicorn.workers.UvicornWorker [11/15/23 11:37:08] INFO Booting worker with pid: 16 [11/15/23 11:37:36] INFO Created in-memory cache with unloading disabled. [11/15/23 11:37:36] INFO Initialized request thread pool with 2 threads. [11/15/23 11:41:53] INFO Downloading facial recognition model 'buffalo_l'.This may take a while. Downloading (…)d6855/.gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 1.90MB/s] Downloading (…)dcaf5d6855/README.md: 100%|██████████| 582/582 [00:00<00:00, 1.26MB/s] Downloading model.onnx: 100%|██████████| 16.9M/16.9M [00:01<00:00, 16.6MB/s]s] Downloading model.onnx: 100%|██████████| 174M/174M [00:06<00:00, 27.8MB/s]s] Fetching 4 files: 100%|██████████| 4/4 [00:06<00:00, 1.72s/it]2, 13.2MB/s] [11/15/23 11:42:00] INFO Loading facial recognition model 'buffalo_l' [11/15/23 11:42:30] ERROR Worker (pid:16) was sent code 132! [11/15/23 11:42:30] INFO Booting worker with pid: 45 /usr/local/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' [11/15/23 11:42:52] INFO Created in-memory cache with unloading disabled. [11/15/23 11:42:52] INFO Initialized request thread pool with 2 threads. [11/15/23 11:42:52] INFO Loading facial recognition model 'buffalo_l' [11/15/23 11:43:18] ERROR Worker (pid:45) was sent code 132! [11/15/23 11:43:18] INFO Booting worker with pid: 66 [11/15/23 11:43:39] INFO Created in-memory cache with unloading disabled. [11/15/23 11:43:39] INFO Initialized request thread pool with 2 threads. [11/15/23 11:43:39] INFO Loading facial recognition model 'buffalo_l' [11/15/23 11:44:18] ERROR Worker (pid:66) was sent code 132! [11/15/23 11:44:18] INFO Booting worker with pid: 87 [11/15/23 11:44:42] INFO Created in-memory cache with unloading disabled. [11/15/23 11:44:42] INFO Initialized request thread pool with 2 threads. [11/15/23 11:44:42] INFO Loading facial recognition model 'buffalo_l' [11/15/23 11:45:24] ERROR Worker (pid:87) was sent code 132! [11/15/23 11:45:25] INFO Booting worker with pid: 108 [11/15/23 11:45:45] INFO Created in-memory cache with unloading disabled. [11/15/23 11:45:45] INFO Initialized request thread pool with 2 threads. [11/15/23 11:45:45] INFO Loading facial recognition model 'buffalo_l' [11/15/23 11:46:07] ERROR Worker (pid:108) was sent code 132! [11/15/23 11:46:07] INFO Booting worker with pid: 129 [11/15/23 11:46:28] INFO Created in-memory cache with unloading disabled. [11/15/23 11:46:28] INFO Initialized request thread pool with 2 threads. [11/15/23 11:46:28] INFO Loading facial recognition model 'buffalo_l' [11/15/23 11:46:59] ERROR Worker (pid:129) was sent code 132! [11/15/23 11:46:59] INFO Booting worker with pid: 150 [11/15/23 11:47:19] INFO Created in-memory cache with unloading disabled. [11/15/23 11:47:19] INFO Initialized request thread pool with 2 threads. [11/15/23 11:47:19] INFO Loading facial recognition model 'buffalo_l' [11/15/23 11:47:47] ERROR Worker (pid:150) was sent code 132! [11/15/23 11:47:47] INFO Booting worker with pid: 170 [11/15/23 11:48:07] INFO Created in-memory cache with unloading disabled. [11/15/23 11:48:07] INFO Initialized request thread pool with 2 threads. ``` Log in immich_microservices ``` [Nest] 6 - 11/15/2023, 11:42:26 AM ERROR [JobService] Unable to run job handler (recognizeFaces/recognize-faces): TypeError: fetch failed [Nest] 6 - 11/15/2023, 11:42:26 AM ERROR [JobService] TypeError: fetch failed at Object.fetch (node:internal/deps/undici/undici:11372:11) at async MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:16:21) at async PersonService.handleRecognizeFaces (/usr/src/app/dist/domain/person/person.service.js:183:23) at async /usr/src/app/dist/domain/job/job.service.js:108:37 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:350:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:537:24) [Nest] 6 - 11/15/2023, 11:42:26 AM ERROR [JobService] Object: { "id": "28dcc33a-512a-4033-97f0-6f3a227c9117", "source": "upload" } ```
Author
Owner

@alextran1502 commented on GitHub (Nov 15, 2023):

Can you try clicking on Recognized Face job for all again?

@alextran1502 commented on GitHub (Nov 15, 2023): Can you try clicking on Recognized Face job for all again?
Author
Owner

@mertalev commented on GitHub (Nov 16, 2023):

1.84 updated a lot of dependencies for the ML service. If it works in 1.83, then I suspect an upstream regression for aarch64.

@mertalev commented on GitHub (Nov 16, 2023): 1.84 updated a lot of dependencies for the ML service. If it works in 1.83, then I suspect an upstream regression for aarch64.
Author
Owner

@kleinMaggus commented on GitHub (Nov 16, 2023):

Unfortunately a rerun does not work either. My current workaround is to disable the machine learning stuff and use the Remote Machine Learning everytime I am on my personal computer.

@kleinMaggus commented on GitHub (Nov 16, 2023): Unfortunately a rerun does not work either. My current workaround is to disable the machine learning stuff and use the [Remote Machine Learning](https://immich.app/docs/guides/machine-learning) everytime I am on my personal computer.
Author
Owner

@mertalev commented on GitHub (Nov 16, 2023):

The latest release of onnxruntime mentions a bug fix possibly related to this:

Mobile bug fixes for crash on some older 64-bit ARM devices and AOT inlining issue on iOS with C# bindings

@mertalev commented on GitHub (Nov 16, 2023): The latest release of onnxruntime mentions a bug fix possibly related to this: > Mobile bug fixes for crash on some older 64-bit ARM devices and AOT inlining issue on iOS with C# bindings
Author
Owner

@cliffxzx commented on GitHub (Nov 20, 2023):

+1, I faced the same issue. My CPU is AMD Phenom II X6 1045T. It is a pretty old cpu.

@cliffxzx commented on GitHub (Nov 20, 2023): +1, I faced the same issue. My CPU is AMD Phenom II X6 1045T. It is a pretty old cpu.
Author
Owner

@matveybuk commented on GitHub (Jan 25, 2024):

I was running LLama models inside the docker, I found a solution that helped me avoid the error UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects. I simply increased the shared memory for the docker service with the machine learning model by increasing the shm_size in my compose file.

@matveybuk commented on GitHub (Jan 25, 2024): I was running LLama models inside the docker, I found a solution that helped me avoid the error UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects. I simply increased the shared memory for the docker service with the machine learning model by increasing the shm_size in my compose file.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: immich-app/immich#1622