After 1.100.0 Version Update, ML Workers get killed #2714

Closed
opened 2026-02-05 06:51:57 +03:00 by OVERLORD · 6 comments
Owner

Originally created by @Adriankor on GitHub (Mar 31, 2024).

The bug

ML Worker Repeatedly killed no ML Container Function at all for the immich server (Face Recognition etc.)

[03/30/24 21:16:11] INFO Booting worker with pid: 151
[03/30/24 21:16:32] INFO Started server process [151]
[03/30/24 21:16:32] INFO Waiting for application startup.
[03/30/24 21:16:32] INFO Created in-memory cache with unloading after 300s
of inactivity.
[03/30/24 21:16:32] INFO Initialized request thread pool with 8 threads.
[03/30/24 21:16:32] INFO Application startup complete.
[03/30/24 21:16:32] INFO Setting 'buffalo_l' execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
[03/30/24 21:16:32] INFO Setting 'ViT-B-32__openai' execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
[03/30/24 21:16:32] INFO Loading facial recognition model 'buffalo_l' to
memory
[03/30/24 21:16:42] INFO Loading clip model 'ViT-B-32__openai' to memory
[03/30/24 21:18:14] INFO Starting gunicorn 21.2.0
[03/30/24 21:18:14] INFO Listening at: http://[::]:3003 (9)
[03/30/24 21:18:14] INFO Using worker: app.config.CustomUvicornWorker
[03/30/24 21:18:14] INFO Booting worker with pid: 17
[03/30/24 21:18:37] INFO Started server process [17]
[03/30/24 21:18:37] INFO Waiting for application startup.
[03/30/24 21:18:37] INFO Created in-memory cache with unloading after 300s
of inactivity.
[03/30/24 21:18:37] INFO Initialized request thread pool with 8 threads.
[03/30/24 21:18:37] INFO Application startup complete.
[03/30/24 21:18:37] INFO Setting 'ViT-B-32__openai' execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
[03/30/24 21:18:37] INFO Loading clip model 'ViT-B-32__openai' to memory
[03/30/24 21:20:37] CRITICAL WORKER TIMEOUT (pid:17)
[03/30/24 21:20:38] ERROR Worker (pid:17) was sent SIGKILL! Perhaps out of
memory?

The OS that Immich Server is running on

Ubuntu 22.4.04

Version of Immich Server

1.100.0

Version of Immich Mobile App

1.100.0

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

version: '3.8'

#
# WARNING: Make sure to use the docker-compose.yml of the current release:
#
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
#
# The compose file on main may not be compatible with the latest release.
#

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: ['start.sh', 'immich']
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - stack.env
    ports:
      - 2283:3001
    depends_on:
      - redis
      - database
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/hardware-transcoding
    #   file: hwaccel.transcoding.yml
    #   service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding

    #hardware accel short for intel cpu 
    devices:
      - /dev/dri:/dev/dri
    command: ['start.sh', 'microservices']
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - stack.env
    depends_on:
      - redis
      - database
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-openvino
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: hwaccel.ml.yml
    #   service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    device_cgroup_rules:
      - "c 189:* rmw"
    devices:
      - /dev/dri:/dev/dri
    volumes:
      #openvino cpu hardware accel
      - /dev/bus/usb:/dev/bus/usb
      - model-cache:/cache
    env_file:
      - stack.env
    restart: always

  redis:
    container_name: immich_redis
    image: registry.hub.docker.com/library/redis:6.2-alpine@sha256:51d6c56749a4243096327e3fb964a48ed92254357108449cb6e23999c37773c5
    restart: always

  database:
    container_name: immich_postgres
    image: registry.hub.docker.com/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

volumes:
  pgdata:
  model-cache:

Your .env content

UPLOAD_LOCATION=/home/adrian/docker/appdata/immich
IMMICH_VERSION=release
DB_PASSWORD=postgres
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich
REDIS_HOSTNAME=immich_redis

Reproduction steps

1. Repeatly redeployed the stack with portainer
2. Deleted the Volumes of the Containers
3. Deployed again
4. Started Backup from iOS App again.

Additional information

I changed some bios settings for CPU C-State and ASPM Support for reducing Power consumption. But nothing else on the server is affected, except Immich.
Also, I activated GNA Device in the Bios.

Originally created by @Adriankor on GitHub (Mar 31, 2024). ### The bug ML Worker Repeatedly killed no ML Container Function at all for the immich server (Face Recognition etc.) > [03/30/24 21:16:11] INFO Booting worker with pid: 151 [03/30/24 21:16:32] INFO Started server process [151] [03/30/24 21:16:32] INFO Waiting for application startup. [03/30/24 21:16:32] INFO Created in-memory cache with unloading after 300s of inactivity. [03/30/24 21:16:32] INFO Initialized request thread pool with 8 threads. [03/30/24 21:16:32] INFO Application startup complete. [03/30/24 21:16:32] INFO Setting 'buffalo_l' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [03/30/24 21:16:32] INFO Setting 'ViT-B-32__openai' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [03/30/24 21:16:32] INFO Loading facial recognition model 'buffalo_l' to memory [03/30/24 21:16:42] INFO Loading clip model 'ViT-B-32__openai' to memory [03/30/24 21:18:14] INFO Starting gunicorn 21.2.0 [03/30/24 21:18:14] INFO Listening at: http://[::]:3003 (9) [03/30/24 21:18:14] INFO Using worker: app.config.CustomUvicornWorker [03/30/24 21:18:14] INFO Booting worker with pid: 17 [03/30/24 21:18:37] INFO Started server process [17] [03/30/24 21:18:37] INFO Waiting for application startup. [03/30/24 21:18:37] INFO Created in-memory cache with unloading after 300s of inactivity. [03/30/24 21:18:37] INFO Initialized request thread pool with 8 threads. [03/30/24 21:18:37] INFO Application startup complete. [03/30/24 21:18:37] INFO Setting 'ViT-B-32__openai' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [03/30/24 21:18:37] INFO Loading clip model 'ViT-B-32__openai' to memory [03/30/24 21:20:37] CRITICAL WORKER TIMEOUT (pid:17) [03/30/24 21:20:38] ERROR Worker (pid:17) was sent SIGKILL! Perhaps out of memory? ### The OS that Immich Server is running on Ubuntu 22.4.04 ### Version of Immich Server 1.100.0 ### Version of Immich Mobile App 1.100.0 ### Platform with the issue - [X] Server - [ ] Web - [ ] Mobile ### Your docker-compose.yml content ```YAML version: '3.8' # # WARNING: Make sure to use the docker-compose.yml of the current release: # # https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml # # The compose file on main may not be compatible with the latest release. # name: immich services: immich-server: container_name: immich_server image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} command: ['start.sh', 'immich'] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload - /etc/localtime:/etc/localtime:ro env_file: - stack.env ports: - 2283:3001 depends_on: - redis - database restart: always immich-microservices: container_name: immich_microservices image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/hardware-transcoding # file: hwaccel.transcoding.yml # service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding #hardware accel short for intel cpu devices: - /dev/dri:/dev/dri command: ['start.sh', 'microservices'] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload - /etc/localtime:/etc/localtime:ro env_file: - stack.env depends_on: - redis - database restart: always immich-machine-learning: container_name: immich_machine_learning # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag. # Example tag: ${IMMICH_VERSION:-release}-cuda image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-openvino # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration # file: hwaccel.ml.yml # service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable device_cgroup_rules: - "c 189:* rmw" devices: - /dev/dri:/dev/dri volumes: #openvino cpu hardware accel - /dev/bus/usb:/dev/bus/usb - model-cache:/cache env_file: - stack.env restart: always redis: container_name: immich_redis image: registry.hub.docker.com/library/redis:6.2-alpine@sha256:51d6c56749a4243096327e3fb964a48ed92254357108449cb6e23999c37773c5 restart: always database: container_name: immich_postgres image: registry.hub.docker.com/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0 environment: POSTGRES_PASSWORD: ${DB_PASSWORD} POSTGRES_USER: ${DB_USERNAME} POSTGRES_DB: ${DB_DATABASE_NAME} volumes: - pgdata:/var/lib/postgresql/data restart: always volumes: pgdata: model-cache: ``` ### Your .env content ```Shell UPLOAD_LOCATION=/home/adrian/docker/appdata/immich IMMICH_VERSION=release DB_PASSWORD=postgres DB_HOSTNAME=immich_postgres DB_USERNAME=postgres DB_DATABASE_NAME=immich REDIS_HOSTNAME=immich_redis ``` ### Reproduction steps ```bash 1. Repeatly redeployed the stack with portainer 2. Deleted the Volumes of the Containers 3. Deployed again 4. Started Backup from iOS App again. ``` ### Additional information I changed some bios settings for CPU C-State and ASPM Support for reducing Power consumption. But nothing else on the server is affected, except Immich. Also, I activated GNA Device in the Bios.
Author
Owner

@alextran1502 commented on GitHub (Mar 31, 2024):

Can you try downgrade to v1.99.0 and verify that if it works

@alextran1502 commented on GitHub (Mar 31, 2024): Can you try downgrade to v1.99.0 and verify that if it works
Author
Owner

@Adriankor commented on GitHub (Mar 31, 2024):

Can you try downgrade to v1.99.0 and verify that if it works

Should I downgrade every container or just the ML one?

@Adriankor commented on GitHub (Mar 31, 2024): > Can you try downgrade to v1.99.0 and verify that if it works Should I downgrade every container or just the ML one?
Author
Owner

@alextran1502 commented on GitHub (Mar 31, 2024):

Try just the ML one

@alextran1502 commented on GitHub (Mar 31, 2024): Try just the ML one
Author
Owner

@Adriankor commented on GitHub (Mar 31, 2024):

I didn't know how to just change the ML Conainer so i changed the env. IMMICH_Version to :v1.99.0 Smart Search works now but ML getting this new Error:

[03/31/24 09:35:31] INFO Initialized request thread pool with 8 threads.
[03/31/24 09:35:31] INFO Application startup complete.
[03/31/24 09:36:40] INFO Setting 'ViT-B-32__openai' execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
[03/31/24 09:36:40] INFO Loading clip model 'ViT-B-32__openai' to memory
[03/31/24 09:37:13] INFO Setting 'buffalo_l' execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
[03/31/24 09:37:13] INFO Loading facial recognition model 'buffalo_l' to
memory
[03/31/24 09:37:14] ERROR Exception in ASGI application

                         ╭─────── Traceback (most recent call last) ───────╮
                         │ /usr/src/app/main.py:118 in predict             │
                         │                                                 │
                         │   115 │                                         │
                         │   116 │   model = await load(await model_cache. │
                         │       ttl=settings.model_ttl, **kwargs))        │
                         │   117 │   model.configure(**kwargs)             │
                         │ ❱ 118 │   outputs = await run(model.predict, in │
                         │   119 │   return ORJSONResponse(outputs)        │
                         │   120                                           │
                         │   121                                           │
                         │                                                 │
                         │ /usr/src/app/main.py:125 in run                 │
                         │                                                 │
                         │   122 async def run(func: Callable[..., Any], i │
                         │   123 │   if thread_pool is None:               │
                         │   124 │   │   return func(inputs)               │
                         │ ❱ 125 │   return await asyncio.get_running_loop │
                         │   126                                           │
                         │   127                                           │
                         │   128 async def load(model: InferenceModel) ->  │
                         │                                                 │
                         │ /usr/lib/python3.10/concurrent/futures/thread.p │
                         │ y:58 in run                                     │
                         │                                                 │
                         │ /usr/src/app/models/base.py:59 in predict       │
                         │                                                 │
                         │    56 │   │   self.load()                       │
                         │    57 │   │   if model_kwargs:                  │
                         │    58 │   │   │   self.configure(**model_kwargs │
                         │ ❱  59 │   │   return self._predict(inputs)      │
                         │    60 │                                         │
                         │    61 │   @abstractmethod                       │
                         │    62 │   def _predict(self, inputs: Any) -> An │
                         │                                                 │
                         │ /usr/src/app/models/facial_recognition.py:49 in │
                         │ _predict                                        │
                         │                                                 │
                         │   46 │   │   else:                              │
                         │   47 │   │   │   decoded_image = image          │
                         │   48 │   │   assert is_ndarray(decoded_image, n │
                         │ ❱ 49 │   │   bboxes, kpss = self.det_model.dete │
                         │   50 │   │   if bboxes.size == 0:               │
                         │   51 │   │   │   return []                      │
                         │   52 │   │   assert is_ndarray(kpss, np.float32 │
                         │                                                 │
                         │ /opt/venv/lib/python3.10/site-packages/insightf │
                         │ ace/model_zoo/retinaface.py:224 in detect       │
                         │                                                 │
                         │   221 │   │   det_img = np.zeros( (input_size[1 │
                         │   222 │   │   det_img[:new_height, :new_width,  │
                         │   223 │   │                                     │
                         │ ❱ 224 │   │   scores_list, bboxes_list, kpss_li │
                         │   225 │   │                                     │
                         │   226 │   │   scores = np.vstack(scores_list)   │
                         │   227 │   │   scores_ravel = scores.ravel()     │
                         │                                                 │
                         │ /opt/venv/lib/python3.10/site-packages/insightf │
                         │ ace/model_zoo/retinaface.py:152 in forward      │
                         │                                                 │
                         │   149 │   │   kpss_list = []                    │
                         │   150 │   │   input_size = tuple(img.shape[0:2] │
                         │   151 │   │   blob = cv2.dnn.blobFromImage(img, │
                         │       (self.input_mean, self.input_mean, self.i │
                         │ ❱ 152 │   │   net_outs = self.session.run(self. │
                         │   153 │   │                                     │
                         │   154 │   │   input_height = blob.shape[2]      │
                         │   155 │   │   input_width = blob.shape[3]       │
                         │                                                 │
                         │ /opt/venv/lib/python3.10/site-packages/onnxrunt │
                         │ ime/capi/onnxruntime_inference_collection.py:22 │
                         │ 0 in run                                        │
                         │                                                 │
                         │    217 │   │   if not output_names:             │
                         │    218 │   │   │   output_names = [output.name  │
                         │    219 │   │   try:                             │
                         │ ❱  220 │   │   │   return self._sess.run(output │
                         │    221 │   │   except C.EPFail as err:          │
                         │    222 │   │   │   if self._enable_fallback:    │
                         │    223 │   │   │   │   print(f"EP Error: {err!s │
                         ╰─────────────────────────────────────────────────╯
                         RuntimeException: [ONNXRuntimeError] : 6 :         
                         RUNTIME_EXCEPTION : Encountered unknown exception  
                         in Run()                                           
@Adriankor commented on GitHub (Mar 31, 2024): I didn't know how to just change the ML Conainer so i changed the env. IMMICH_Version to :v1.99.0 Smart Search works now but ML getting this new Error: > [03/31/24 09:35:31] INFO Initialized request thread pool with 8 threads. [03/31/24 09:35:31] INFO Application startup complete. [03/31/24 09:36:40] INFO Setting 'ViT-B-32__openai' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [03/31/24 09:36:40] INFO Loading clip model 'ViT-B-32__openai' to memory [03/31/24 09:37:13] INFO Setting 'buffalo_l' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [03/31/24 09:37:13] INFO Loading facial recognition model 'buffalo_l' to memory [03/31/24 09:37:14] ERROR Exception in ASGI application ╭─────── Traceback (most recent call last) ───────╮ │ /usr/src/app/main.py:118 in predict │ │ │ │ 115 │ │ │ 116 │ model = await load(await model_cache. │ │ ttl=settings.model_ttl, **kwargs)) │ │ 117 │ model.configure(**kwargs) │ │ ❱ 118 │ outputs = await run(model.predict, in │ │ 119 │ return ORJSONResponse(outputs) │ │ 120 │ │ 121 │ │ │ │ /usr/src/app/main.py:125 in run │ │ │ │ 122 async def run(func: Callable[..., Any], i │ │ 123 │ if thread_pool is None: │ │ 124 │ │ return func(inputs) │ │ ❱ 125 │ return await asyncio.get_running_loop │ │ 126 │ │ 127 │ │ 128 async def load(model: InferenceModel) -> │ │ │ │ /usr/lib/python3.10/concurrent/futures/thread.p │ │ y:58 in run │ │ │ │ /usr/src/app/models/base.py:59 in predict │ │ │ │ 56 │ │ self.load() │ │ 57 │ │ if model_kwargs: │ │ 58 │ │ │ self.configure(**model_kwargs │ │ ❱ 59 │ │ return self._predict(inputs) │ │ 60 │ │ │ 61 │ @abstractmethod │ │ 62 │ def _predict(self, inputs: Any) -> An │ │ │ │ /usr/src/app/models/facial_recognition.py:49 in │ │ _predict │ │ │ │ 46 │ │ else: │ │ 47 │ │ │ decoded_image = image │ │ 48 │ │ assert is_ndarray(decoded_image, n │ │ ❱ 49 │ │ bboxes, kpss = self.det_model.dete │ │ 50 │ │ if bboxes.size == 0: │ │ 51 │ │ │ return [] │ │ 52 │ │ assert is_ndarray(kpss, np.float32 │ │ │ │ /opt/venv/lib/python3.10/site-packages/insightf │ │ ace/model_zoo/retinaface.py:224 in detect │ │ │ │ 221 │ │ det_img = np.zeros( (input_size[1 │ │ 222 │ │ det_img[:new_height, :new_width, │ │ 223 │ │ │ │ ❱ 224 │ │ scores_list, bboxes_list, kpss_li │ │ 225 │ │ │ │ 226 │ │ scores = np.vstack(scores_list) │ │ 227 │ │ scores_ravel = scores.ravel() │ │ │ │ /opt/venv/lib/python3.10/site-packages/insightf │ │ ace/model_zoo/retinaface.py:152 in forward │ │ │ │ 149 │ │ kpss_list = [] │ │ 150 │ │ input_size = tuple(img.shape[0:2] │ │ 151 │ │ blob = cv2.dnn.blobFromImage(img, │ │ (self.input_mean, self.input_mean, self.i │ │ ❱ 152 │ │ net_outs = self.session.run(self. │ │ 153 │ │ │ │ 154 │ │ input_height = blob.shape[2] │ │ 155 │ │ input_width = blob.shape[3] │ │ │ │ /opt/venv/lib/python3.10/site-packages/onnxrunt │ │ ime/capi/onnxruntime_inference_collection.py:22 │ │ 0 in run │ │ │ │ 217 │ │ if not output_names: │ │ 218 │ │ │ output_names = [output.name │ │ 219 │ │ try: │ │ ❱ 220 │ │ │ return self._sess.run(output │ │ 221 │ │ except C.EPFail as err: │ │ 222 │ │ │ if self._enable_fallback: │ │ 223 │ │ │ │ print(f"EP Error: {err!s │ ╰─────────────────────────────────────────────────╯ RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Encountered unknown exception in Run()
Author
Owner

@Adriankor commented on GitHub (Mar 31, 2024):

Also the Logs from microservices:

at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)

[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object:
{
"id": "2a6edd42-a31e-4b70-af4a-6af1d6e611bd"
}
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19
at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21)
at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23)
at async /usr/src/app/dist/domain/job/job.service.js:137:36
at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object:
{
"id": "c0a1005e-fa45-472f-8146-3916bda2964c"
}
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19
at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21)
at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23)
at async /usr/src/app/dist/domain/job/job.service.js:137:36
at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object:
{
"id": "3d19a16c-e547-4670-8653-6e76ffabc669"
}
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19
at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21)
at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23)
at async /usr/src/app/dist/domain/job/job.service.js:137:36
at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object:
{
"id": "96fd3646-b5aa-42a3-8bb2-6dfba8d713fb"
}
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003
at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21)
at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23)
at async /usr/src/app/dist/domain/job/job.service.js:137:36
at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object:
{
"id": "d653d30d-a4d5-4008-a308-1f63cac37087"
}
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003
at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21)
at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23)
at async /usr/src/app/dist/domain/job/job.service.js:137:36
at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object:
{
"id": "3d81774b-45a5-4586-a448-89fb519680cd"
}
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003
at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21)
at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23)
at async /usr/src/app/dist/domain/job/job.service.js:137:36
at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object:
{
"id": "fd8f8537-a4ca-45a7-b3d4-214524cdfe80"
}
[Nest] 7 - 03/31/2024, 9:42:10 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request for facial recognition failed with status 500: Internal Server Error
[Nest] 7 - 03/31/2024, 9:42:10 AM ERROR [JobService] Error: Machine learning request for facial recognition failed with status 500: Internal Server Error
at MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:23:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23)
at async /usr/src/app/dist/domain/job/job.service.js:137:36
at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[Nest] 7 - 03/31/2024, 9:42:10 AM ERROR [JobService] Object:
{
"id": "e274ea9f-110b-4be9-8fd8-d6d70bc5958d"
}

@Adriankor commented on GitHub (Mar 31, 2024): Also the Logs from microservices: > at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24) [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object: { "id": "2a6edd42-a31e-4b70-af4a-6af1d6e611bd" } [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19 at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21) at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23) at async /usr/src/app/dist/domain/job/job.service.js:137:36 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24) [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object: { "id": "c0a1005e-fa45-472f-8146-3916bda2964c" } [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19 at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21) at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23) at async /usr/src/app/dist/domain/job/job.service.js:137:36 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24) [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object: { "id": "3d19a16c-e547-4670-8653-6e76ffabc669" } [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19 at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21) at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23) at async /usr/src/app/dist/domain/job/job.service.js:137:36 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24) [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object: { "id": "96fd3646-b5aa-42a3-8bb2-6dfba8d713fb" } [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003 [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003 at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19 at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21) at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23) at async /usr/src/app/dist/domain/job/job.service.js:137:36 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24) [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object: { "id": "d653d30d-a4d5-4008-a308-1f63cac37087" } [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003 [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003 at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19 at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21) at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23) at async /usr/src/app/dist/domain/job/job.service.js:137:36 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24) [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object: { "id": "3d81774b-45a5-4586-a448-89fb519680cd" } [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003 [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with Error: connect ECONNREFUSED 172.18.0.2:3003 at /usr/src/app/dist/infra/repositories/machine-learning.repository.js:19:19 at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:21) at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23) at async /usr/src/app/dist/domain/job/job.service.js:137:36 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24) [Nest] 7 - 03/31/2024, 9:42:06 AM ERROR [JobService] Object: { "id": "fd8f8537-a4ca-45a7-b3d4-214524cdfe80" } [Nest] 7 - 03/31/2024, 9:42:10 AM ERROR [JobService] Unable to run job handler (faceDetection/face-detection): Error: Machine learning request for facial recognition failed with status 500: Internal Server Error [Nest] 7 - 03/31/2024, 9:42:10 AM ERROR [JobService] Error: Machine learning request for facial recognition failed with status 500: Internal Server Error at MachineLearningRepository.predict (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:23:19) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async PersonService.handleDetectFaces (/usr/src/app/dist/domain/person/person.service.js:248:23) at async /usr/src/app/dist/domain/job/job.service.js:137:36 at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28) at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24) [Nest] 7 - 03/31/2024, 9:42:10 AM ERROR [JobService] Object: { "id": "e274ea9f-110b-4be9-8fd8-d6d70bc5958d" }
Author
Owner

@mertalev commented on GitHub (Apr 1, 2024):

The facial recognition error with openvino is being tracked in #8226. The timeout in the OP is most likely the model taking too long to compile the first time it loads. You can set MACHINE_LEARNING_WORKER_TIMEOUT=300 to give the worker more time before it gets killed.

@mertalev commented on GitHub (Apr 1, 2024): The facial recognition error with openvino is being tracked in #8226. The timeout in the OP is most likely the model taking too long to compile the first time it loads. You can set `MACHINE_LEARNING_WORKER_TIMEOUT=300` to give the worker more time before it gets killed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: immich-app/immich#2714