[BUG] Tagging objects fails on Raspberry Pi 5 (worker gets killed) #1669

Closed
opened 2026-02-05 02:58:57 +03:00 by OVERLORD · 7 comments
Owner

Originally created by @Zinurist on GitHub (Nov 21, 2023).

The bug

The Tag Objects job fails to run. It goes through the queue slowly, but no tags appear. The machine learning workers seem to get killed for some reason.

I'm running Immich on a Raspberry Pi5 (8GB). Tag Objects Concurrency is set to 1. The Recognize Faces works, Encode clip presumably as well (not sure what it does exactly?).

A similar error was mentioned here, but for a different ML job.

I had this problem on 1.87.0 too, but I don't know about before that.

The OS that Immich Server is running on

Debian 12

Version of Immich Server

v1.88.2

Version of Immich Mobile App

1.87.0

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

version: "3.8"
services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: [ "start.sh", "immich" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    ports:
      - 2283:3001
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends:
    #   file: hwaccel.yml
    #   service: hwaccel
    command: [ "start.sh", "microservices" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    volumes:
      - /mnt/data/immich/model-cache:/cache
    env_file:
      - .env
    restart: always

  typesense:
    container_name: immich_typesense
    image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd
    environment:
      - TYPESENSE_API_KEY=${TYPESENSE_API_KEY}
      - TYPESENSE_DATA_DIR=/data
      # remove this to get debug messages
      - GLOG_minloglevel=1
    volumes:
      - /mnt/data/immich/tsdata:/data
    restart: always

  redis:
    container_name: immich_redis
    image: redis:6.2-alpine@sha256:70a7a5b641117670beae0d80658430853896b5ef269ccf00d1827427e3263fa3
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14-alpine@sha256:28407a9961e76f2d285dc6991e8e48893503cc3836a4755bbc2d40bcc272a441
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - /mnt/data/immich/pgdata:/var/lib/postgresql/data
    restart: always

Your .env content

UPLOAD_LOCATION=/mnt/data/immich/data
IMMICH_VERSION="v1.88.2"
TYPESENSE_API_KEY=...
DB_PASSWORD=...
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich
REDIS_HOSTNAME=immich_redis

Reproduction steps

1. On the Jobs page, I start the `Tag Objects` job.
2. It will start running the job for objects one at a time.
3. No tags appear. Logs mention the worker being killed.

Additional information

immich_microservices logs this for each job run:

[Nest] 7  - 11/21/2023, 5:04:13 PM   ERROR [JobService] Unable to run job handler (objectTagging/classify-image): TypeError: fetch failed
[Nest] 7  - 11/21/2023, 5:04:13 PM   ERROR [JobService] TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11372:11)
    at async MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:16:21)
    at async SmartInfoService.handleClassifyImage (/usr/src/app/dist/domain/smart-info/smart-info.service.js:55:22)
    at async /usr/src/app/dist/domain/job/job.service.js:108:37
    at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:385:28)
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:572:24)
[Nest] 7  - 11/21/2023, 5:04:13 PM   ERROR [JobService] Object:
{
  "id": "b128444b-7259-4a1a-a746-6553e9ad2d79"
}

immich_machine_learning logs this:

[11/21/23 17:04:41] INFO     Booting worker with pid: 42                        
[11/21/23 17:04:53] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[11/21/23 17:04:53] INFO     Initialized request thread pool with 4 threads.    
[11/21/23 17:04:53] INFO     Loading image classification model                 
                             'microsoft/resnet-50'                              
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
[11/21/23 17:04:53] INFO     ONNX model not found in cache directory for        
                             'microsoft/resnet-50'.Exporting optimized model for
                             future use.                                        
Framework not specified. Using pt to export to ONNX.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
/opt/venv/lib/python3.11/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead.
  warnings.warn(
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Using framework PyTorch: 2.1.0
/opt/venv/lib/python3.11/site-packages/transformers/models/resnet/modeling_resnet.py:95: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if num_channels != self.num_channels:
[11/21/23 17:05:27] ERROR    Worker (pid:42) was sent code 139!                 
[11/21/23 17:05:27] INFO     Booting worker with pid: 55
...

The above repeats for every job, does that mean Exporting optimized model for future use. is also failing?

The time until a worker is killed seems to be 45 seconds. I can see no signficant increase in memory (<3GB) or cpu usage during that time. I can see memory increasing at the start until the last line before the error (if num_channels != self.num_channels:) is printed, but memory will then drop and the worker is killed after a while.

Originally created by @Zinurist on GitHub (Nov 21, 2023). ### The bug The `Tag Objects` job fails to run. It goes through the queue slowly, but no tags appear. The machine learning workers seem to get killed for some reason. I'm running Immich on a Raspberry Pi5 (8GB). `Tag Objects Concurrency` is set to 1. The `Recognize Faces` works, `Encode clip` presumably as well (not sure what it does exactly?). A similar error was mentioned [here](https://github.com/immich-app/immich/issues/5064), but for a different ML job. I had this problem on 1.87.0 too, but I don't know about before that. ### The OS that Immich Server is running on Debian 12 ### Version of Immich Server v1.88.2 ### Version of Immich Mobile App 1.87.0 ### Platform with the issue - [X] Server - [ ] Web - [ ] Mobile ### Your docker-compose.yml content ```YAML version: "3.8" services: immich-server: container_name: immich_server image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} command: [ "start.sh", "immich" ] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload - /etc/localtime:/etc/localtime:ro env_file: - .env ports: - 2283:3001 depends_on: - redis - database - typesense restart: always immich-microservices: container_name: immich_microservices image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} # extends: # file: hwaccel.yml # service: hwaccel command: [ "start.sh", "microservices" ] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload - /etc/localtime:/etc/localtime:ro env_file: - .env depends_on: - redis - database - typesense restart: always immich-machine-learning: container_name: immich_machine_learning image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release} volumes: - /mnt/data/immich/model-cache:/cache env_file: - .env restart: always typesense: container_name: immich_typesense image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd environment: - TYPESENSE_API_KEY=${TYPESENSE_API_KEY} - TYPESENSE_DATA_DIR=/data # remove this to get debug messages - GLOG_minloglevel=1 volumes: - /mnt/data/immich/tsdata:/data restart: always redis: container_name: immich_redis image: redis:6.2-alpine@sha256:70a7a5b641117670beae0d80658430853896b5ef269ccf00d1827427e3263fa3 restart: always database: container_name: immich_postgres image: postgres:14-alpine@sha256:28407a9961e76f2d285dc6991e8e48893503cc3836a4755bbc2d40bcc272a441 env_file: - .env environment: POSTGRES_PASSWORD: ${DB_PASSWORD} POSTGRES_USER: ${DB_USERNAME} POSTGRES_DB: ${DB_DATABASE_NAME} volumes: - /mnt/data/immich/pgdata:/var/lib/postgresql/data restart: always ``` ### Your .env content ```Shell UPLOAD_LOCATION=/mnt/data/immich/data IMMICH_VERSION="v1.88.2" TYPESENSE_API_KEY=... DB_PASSWORD=... DB_HOSTNAME=immich_postgres DB_USERNAME=postgres DB_DATABASE_NAME=immich REDIS_HOSTNAME=immich_redis ``` ### Reproduction steps ```bash 1. On the Jobs page, I start the `Tag Objects` job. 2. It will start running the job for objects one at a time. 3. No tags appear. Logs mention the worker being killed. ``` ### Additional information `immich_microservices` logs this for each job run: ``` [Nest] 7 - 11/21/2023, 5:04:13 PM ERROR [JobService] Unable to run job handler (objectTagging/classify-image): TypeError: fetch failed [Nest] 7 - 11/21/2023, 5:04:13 PM ERROR [JobService] TypeError: fetch failed at Object.fetch (node:internal/deps/undici/undici:11372:11) at async MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:16:21) at async SmartInfoService.handleClassifyImage (/usr/src/app/dist/domain/smart-info/smart-info.service.js:55:22) at async /usr/src/app/dist/domain/job/job.service.js:108:37 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:385:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:572:24) [Nest] 7 - 11/21/2023, 5:04:13 PM ERROR [JobService] Object: { "id": "b128444b-7259-4a1a-a746-6553e9ad2d79" } ``` `immich_machine_learning` logs this: ``` [11/21/23 17:04:41] INFO Booting worker with pid: 42 [11/21/23 17:04:53] INFO Created in-memory cache with unloading after 300s of inactivity. [11/21/23 17:04:53] INFO Initialized request thread pool with 4 threads. [11/21/23 17:04:53] INFO Loading image classification model 'microsoft/resnet-50' Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. [11/21/23 17:04:53] INFO ONNX model not found in cache directory for 'microsoft/resnet-50'.Exporting optimized model for future use. Framework not specified. Using pt to export to ONNX. Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. /opt/venv/lib/python3.11/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead. warnings.warn( Using the export variant default. Available variants are: - default: The default ONNX variant. Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. Using framework PyTorch: 2.1.0 /opt/venv/lib/python3.11/site-packages/transformers/models/resnet/modeling_resnet.py:95: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if num_channels != self.num_channels: [11/21/23 17:05:27] ERROR Worker (pid:42) was sent code 139! [11/21/23 17:05:27] INFO Booting worker with pid: 55 ... ``` The above repeats for every job, does that mean `Exporting optimized model for future use.` is also failing? The time until a worker is killed seems to be 45 seconds. I can see no signficant increase in memory (<3GB) or cpu usage during that time. I can see memory increasing at the start until the last line before the error (`if num_channels != self.num_channels:`) is printed, but memory will then drop and the worker is killed after a while.
Author
Owner

@Kezii commented on GitHub (Nov 23, 2023):

I'm having the same issue with the same logs on Rockchip rk3566 (aarch64)

it looks like python is crashing continuously, hogging system resources with the coredumps

> sudo coredumpctl list --reverse -n 20
TIME                          PID UID GID SIG     COREFILE EXE                         SIZE
Wed 2023-11-22 22:30:13 CET 75687   0   0 SIGSEGV present  /usr/local/bin/python3.11 629.0M
Wed 2023-11-22 22:25:57 CET 75573   0   0 SIGSEGV present  /usr/local/bin/python3.11 414.1M
Wed 2023-11-22 22:19:02 CET 75317   0   0 SIGSEGV present  /usr/local/bin/python3.11 647.3M
Wed 2023-11-22 22:14:15 CET 75237   0   0 SIGSEGV present  /usr/local/bin/python3.11 409.5M
Wed 2023-11-22 22:09:44 CET 75081   0   0 SIGSEGV present  /usr/local/bin/python3.11 630.6M
Wed 2023-11-22 22:04:35 CET 75026   0   0 SIGSEGV present  /usr/local/bin/python3.11 408.6M
Wed 2023-11-22 22:02:07 CET 74941   0   0 SIGSEGV present  /usr/local/bin/python3.11 408.0M
Wed 2023-11-22 21:43:10 CET 74459   0   0 SIGSEGV present  /usr/local/bin/python3.11 415.2M
> sudo coredumpctl info
           PID: 75795 (gunicorn)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Wed 2023-11-22 22:30:14 CET (6min ago)
  Command Line: /opt/venv/bin/python /opt/venv/bin/gunicorn app.main:app -k uvicorn.workers.UvicornWorker -w 1 -b 0.0.0.0:3003 -t 120 --log-config-json log_conf.json
    Executable: /usr/local/bin/python3.11
 Control Group: /system.slice/docker-[...].scope
          Unit: docker-[...].scope
         Slice: system.slice
@Kezii commented on GitHub (Nov 23, 2023): I'm having the same issue with the same logs on Rockchip rk3566 (aarch64) it looks like python is crashing continuously, hogging system resources with the coredumps ``` > sudo coredumpctl list --reverse -n 20 TIME PID UID GID SIG COREFILE EXE SIZE Wed 2023-11-22 22:30:13 CET 75687 0 0 SIGSEGV present /usr/local/bin/python3.11 629.0M Wed 2023-11-22 22:25:57 CET 75573 0 0 SIGSEGV present /usr/local/bin/python3.11 414.1M Wed 2023-11-22 22:19:02 CET 75317 0 0 SIGSEGV present /usr/local/bin/python3.11 647.3M Wed 2023-11-22 22:14:15 CET 75237 0 0 SIGSEGV present /usr/local/bin/python3.11 409.5M Wed 2023-11-22 22:09:44 CET 75081 0 0 SIGSEGV present /usr/local/bin/python3.11 630.6M Wed 2023-11-22 22:04:35 CET 75026 0 0 SIGSEGV present /usr/local/bin/python3.11 408.6M Wed 2023-11-22 22:02:07 CET 74941 0 0 SIGSEGV present /usr/local/bin/python3.11 408.0M Wed 2023-11-22 21:43:10 CET 74459 0 0 SIGSEGV present /usr/local/bin/python3.11 415.2M ``` ``` > sudo coredumpctl info PID: 75795 (gunicorn) UID: 0 (root) GID: 0 (root) Signal: 11 (SEGV) Timestamp: Wed 2023-11-22 22:30:14 CET (6min ago) Command Line: /opt/venv/bin/python /opt/venv/bin/gunicorn app.main:app -k uvicorn.workers.UvicornWorker -w 1 -b 0.0.0.0:3003 -t 120 --log-config-json log_conf.json Executable: /usr/local/bin/python3.11 Control Group: /system.slice/docker-[...].scope Unit: docker-[...].scope Slice: system.slice ```
Author
Owner

@sannidhyaroy commented on GitHub (Nov 26, 2023):

I am having the same issue. I'm running Immich using the Docker recommended method on a Raspberry Pi 4 (8 GB RAM).

@sannidhyaroy commented on GitHub (Nov 26, 2023): I am having the same issue. I'm running Immich using the Docker recommended method on a Raspberry Pi 4 (8 GB RAM).
Author
Owner

@mteij commented on GitHub (Nov 27, 2023):

^ Same issue on a raspberry pi 4 (8gb, ubuntu, newest version of immich)

@mteij commented on GitHub (Nov 27, 2023): ^ Same issue on a raspberry pi 4 (8gb, ubuntu, newest version of immich)
Author
Owner

@TomasValenta commented on GitHub (Dec 2, 2023):

I have the similar/same issue on Raspberrypi 4 8GB, logs and config attached in this thread.

@TomasValenta commented on GitHub (Dec 2, 2023): I have the similar/same issue on Raspberrypi 4 8GB, logs and config attached [in this thread](https://github.com/imagegenius/docker-immich/issues/244).
Author
Owner

@TomasValenta commented on GitHub (Dec 2, 2023):

Additional info: I have migrated now back to version 1.86.0 and ML works fine in this version for me.

@TomasValenta commented on GitHub (Dec 2, 2023): Additional info: I have migrated now back to version 1.86.0 and ML works fine in this version for me.
Author
Owner

@jginternational commented on GitHub (Dec 4, 2023):

Downgrade to 1.86.0 did not work for me

Additional info: I have migrated now back to version 1.86.0 and ML works fine in this version.

Also having the same problem with Raspberry pi 4 8GB arm64

@jginternational commented on GitHub (Dec 4, 2023): Downgrade to 1.86.0 did not work for me > Additional info: I have migrated now back to version 1.86.0 and ML works fine in this version. Also having the same problem with Raspberry pi 4 8GB arm64
Author
Owner

@196693 commented on GitHub (Dec 7, 2023):

Similar problem with AMD 200GE x86

@196693 commented on GitHub (Dec 7, 2023): Similar problem with AMD 200GE x86
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: immich-app/immich#1669