The openvino variant of the machine learning container responds differently to probes #2971

New Issue

OVERLORD · 2026-02-05T07:15:27+03:00

OVERLORD commented

2026-02-05 07:15:27 +03:00

Originally created by @djjudas21 on GitHub (Apr 19, 2024).

Originally assigned to: @mertalev on GitHub.

The bug

I have been using Immich with Kubernetes without problems, via the Helm chart. This week I have been working on enabling GPU support.

I noticed that a deployment of the image immich-app/immich-machine-learning:v1.101.0 works completely fine, but when I switch to immich-app/immich-machine-learning:v1.101.0-openvino with no other changes, that container gets into CrashLoopBackoff after the model is loaded, because the Liveness probes fail and the container get repeatedly restarted.

This issue is specific to Kubernetes, but it clearly demonstrates that there is some kind of difference in behaviour between the standard and openvino variants of the image, so I feel this issue belongs here.

The OS that Immich Server is running on

Kubernetes

Version of Immich Server

v1.101.0

Version of Immich Mobile App

1.101.0 build.147

Platform with the issue

Server
Web
Mobile

Your docker-compose.yml content

machine-learning:
  enabled: true
  image:
    tag: v1.101.0-openvino
  env:
    TRANSFORMERS_CACHE: /cache
  resources:
    requests:
      memory: 1500Mi
      cpu: 1
      gpu.intel.com/i915: 1
    limits:
      gpu.intel.com/i915: 1

Your .env content

N/A

Reproduction steps

1. Make a deployment of the openvino machine-learning image
2. Everything is fine
3. Upload an image to trigger the ML container to do something
4. ML container loads a model, stops responding to probes
5. Kubernetes restarts the container
6. GOTO 3

Relevant log output

[jonathan@poseidon immich]$ kubectl logs -f immich-machine-learning-696fc76fff-hg5bx
[04/19/24 13:45:03] INFO     Starting gunicorn 21.2.0                           
[04/19/24 13:45:03] INFO     Listening at: http://[::]:3003 (9)                 
[04/19/24 13:45:03] INFO     Using worker: app.config.CustomUvicornWorker       
[04/19/24 13:45:03] INFO     Booting worker with pid: 13                        
[04/19/24 13:45:06] INFO     Started server process [13]                        
[04/19/24 13:45:06] INFO     Waiting for application startup.                   
[04/19/24 13:45:06] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[04/19/24 13:45:06] INFO     Initialized request thread pool with 4 threads.    
[04/19/24 13:45:06] INFO     Application startup complete.                      
## Everything is fine up until now. Here I upload a new image to Immich and
## as soon as the model is loaded, the container stops responding to probes, so
## Kubernetes kills it.
[04/19/24 13:48:10] INFO     Setting 'ViT-B-32__openai' execution providers to  
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[04/19/24 13:48:10] INFO     Setting 'buffalo_l' execution providers to         
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[04/19/24 13:48:10] INFO     Loading clip model 'ViT-B-32__openai' to memory    
## Container is killed shortly after here.



Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Normal   Scheduled               3m41s               default-scheduler        Successfully assigned immich/immich-machine-learning-696fc76fff-hg5bx to kube04
  Normal   SuccessfulAttachVolume  3m40s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-0ccd4181-398b-4d3c-ad97-bc876159b7c4"
  Warning  Unhealthy               3m35s               kubelet                  Readiness probe failed: Get "http://10.1.102.109:3003/ping": dial tcp 10.1.102.109:3003: connect: connection refused
  Warning  Unhealthy               5s (x3 over 25s)    kubelet                  Liveness probe failed: Get "http://10.1.102.109:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing                 5s                  kubelet                  Container immich-machine-learning failed liveness probe, will be restarted
  Warning  Unhealthy               4s                  kubelet                  Readiness probe failed: Get "http://10.1.102.109:3003/ping": read tcp 192.168.0.55:41504->10.1.102.109:3003: read: connection reset by peer
  Normal   Pulled                  4s (x2 over 3m35s)  kubelet                  Container image "ghcr.io/immich-app/immich-machine-learning:v1.101.0-openvino" already present on machine
  Normal   Created                 4s (x2 over 3m35s)  kubelet                  Created container immich-machine-learning
  Normal   Started                 4s (x2 over 3m35s)  kubelet                  Started container immich-machine-learning
  Warning  Unhealthy               1s (x7 over 3m33s)  kubelet                  Readiness probe failed: Get "http://10.1.102.109:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Additional information

No response

Originally created by @djjudas21 on GitHub (Apr 19, 2024). Originally assigned to: @mertalev on GitHub. ### The bug I have been using Immich with Kubernetes without problems, via the Helm chart. This week I have been working on enabling GPU support. I noticed that a deployment of the image `immich-app/immich-machine-learning:v1.101.0` works completely fine, but when I switch to `immich-app/immich-machine-learning:v1.101.0-openvino` with no other changes, that container gets into CrashLoopBackoff after the model is loaded, because the Liveness probes fail and the container get repeatedly restarted. This issue is specific to Kubernetes, but it clearly demonstrates that there is some kind of difference in behaviour between the standard and openvino variants of the image, so I feel this issue belongs here. ### The OS that Immich Server is running on Kubernetes ### Version of Immich Server v1.101.0 ### Version of Immich Mobile App 1.101.0 build.147 ### Platform with the issue - [X] Server - [ ] Web - [ ] Mobile ### Your docker-compose.yml content ```YAML machine-learning: enabled: true image: tag: v1.101.0-openvino env: TRANSFORMERS_CACHE: /cache resources: requests: memory: 1500Mi cpu: 1 gpu.intel.com/i915: 1 limits: gpu.intel.com/i915: 1 ``` ### Your .env content ```Shell N/A ``` ### Reproduction steps ```bash 1. Make a deployment of the openvino machine-learning image 2. Everything is fine 3. Upload an image to trigger the ML container to do something 4. ML container loads a model, stops responding to probes 5. Kubernetes restarts the container 6. GOTO 3 ``` ### Relevant log output ```shell [jonathan@poseidon immich]$ kubectl logs -f immich-machine-learning-696fc76fff-hg5bx [04/19/24 13:45:03] INFO Starting gunicorn 21.2.0 [04/19/24 13:45:03] INFO Listening at: http://[::]:3003 (9) [04/19/24 13:45:03] INFO Using worker: app.config.CustomUvicornWorker [04/19/24 13:45:03] INFO Booting worker with pid: 13 [04/19/24 13:45:06] INFO Started server process [13] [04/19/24 13:45:06] INFO Waiting for application startup. [04/19/24 13:45:06] INFO Created in-memory cache with unloading after 300s of inactivity. [04/19/24 13:45:06] INFO Initialized request thread pool with 4 threads. [04/19/24 13:45:06] INFO Application startup complete. ## Everything is fine up until now. Here I upload a new image to Immich and ## as soon as the model is loaded, the container stops responding to probes, so ## Kubernetes kills it. [04/19/24 13:48:10] INFO Setting 'ViT-B-32__openai' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [04/19/24 13:48:10] INFO Setting 'buffalo_l' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [04/19/24 13:48:10] INFO Loading clip model 'ViT-B-32__openai' to memory ## Container is killed shortly after here. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 3m41s default-scheduler Successfully assigned immich/immich-machine-learning-696fc76fff-hg5bx to kube04 Normal SuccessfulAttachVolume 3m40s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-0ccd4181-398b-4d3c-ad97-bc876159b7c4" Warning Unhealthy 3m35s kubelet Readiness probe failed: Get "http://10.1.102.109:3003/ping": dial tcp 10.1.102.109:3003: connect: connection refused Warning Unhealthy 5s (x3 over 25s) kubelet Liveness probe failed: Get "http://10.1.102.109:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Normal Killing 5s kubelet Container immich-machine-learning failed liveness probe, will be restarted Warning Unhealthy 4s kubelet Readiness probe failed: Get "http://10.1.102.109:3003/ping": read tcp 192.168.0.55:41504->10.1.102.109:3003: read: connection reset by peer Normal Pulled 4s (x2 over 3m35s) kubelet Container image "ghcr.io/immich-app/immich-machine-learning:v1.101.0-openvino" already present on machine Normal Created 4s (x2 over 3m35s) kubelet Created container immich-machine-learning Normal Started 4s (x2 over 3m35s) kubelet Started container immich-machine-learning Warning Unhealthy 1s (x7 over 3m33s) kubelet Readiness probe failed: Get "http://10.1.102.109:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ``` ### Additional information _No response_

OVERLORD closed this issue

2026-02-05 07:15:28 +03:00

OVERLORD commented

2026-02-05 07:15:32 +03:00

@bo0tzz commented on GitHub (Apr 19, 2024):

I think the probes for the ML deployment don't quite work right anyways, and if a model takes too long to load (or if it's being downloaded at startup through the preload env var) I've had my non-openvino instances occasionally get killed as well. It could be that the openvino image just takes slightly longer to load things, thus exposing that issue?

@bo0tzz commented on GitHub (Apr 19, 2024): I think the probes for the ML deployment don't quite work right anyways, and if a model takes too long to load (or if it's being downloaded at startup through the preload env var) I've had my non-openvino instances occasionally get killed as well. It could be that the openvino image just takes slightly longer to load things, thus exposing that issue?

OVERLORD commented

2026-02-05 07:15:33 +03:00

@djjudas21 commented on GitHub (Apr 19, 2024):

I'll have a play with the probes and see if I can get it more stable. The default values look pretty aggressive IMO. Timeout of 1s, period of 10s. If a probe fails, it gets retried immediately - so the liveness probe could actually kill the container after just 3 seconds, which is insane.

    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /ping
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

@djjudas21 commented on GitHub (Apr 19, 2024): I'll have a play with the probes and see if I can get it more stable. The default values look pretty aggressive IMO. Timeout of 1s, period of 10s. If a probe fails, it gets retried immediately - so the liveness probe could actually kill the container after just 3 seconds, which is insane. ```yaml livenessProbe: failureThreshold: 3 httpGet: path: /ping port: http scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 ```

OVERLORD commented

2026-02-05 07:15:37 +03:00

@djjudas21 commented on GitHub (Apr 19, 2024):

Eww, they're hard-coded... https://github.com/immich-app/immich-charts/blob/main/charts/immich/templates/machine-learning.yaml#L28

@djjudas21 commented on GitHub (Apr 19, 2024): Eww, they're hard-coded... https://github.com/immich-app/immich-charts/blob/main/charts/immich/templates/machine-learning.yaml#L28

OVERLORD commented

2026-02-05 07:15:42 +03:00

@djjudas21 commented on GitHub (Apr 19, 2024):

OK so I manually configured the probes to have longer timeouts and a great tolerance of failed probes. It didn't get killed by Kubernetes, but it crashed in a nasty way. I'm well out of my depth with this

[jonathan@poseidon immich]$ kubectl logs -f immich-machine-learning-78f87fcc4b-5s84r
[04/19/24 15:18:49] INFO     Starting gunicorn 21.2.0                           
[04/19/24 15:18:49] INFO     Listening at: http://[::]:3003 (10)                
[04/19/24 15:18:49] INFO     Using worker: app.config.CustomUvicornWorker       
[04/19/24 15:18:49] INFO     Booting worker with pid: 14                        
[04/19/24 15:18:52] INFO     Started server process [14]                        
[04/19/24 15:18:52] INFO     Waiting for application startup.                   
[04/19/24 15:18:52] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[04/19/24 15:18:52] INFO     Initialized request thread pool with 4 threads.    
[04/19/24 15:18:52] INFO     Application startup complete.                      
[04/19/24 15:18:54] INFO     Setting 'ViT-B-32__openai' execution providers to  
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[04/19/24 15:18:54] INFO     Loading clip model 'ViT-B-32__openai' to memory    
[04/19/24 15:18:54] INFO     Setting 'buffalo_l' execution providers to         
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[04/19/24 15:20:19] INFO     Loading facial recognition model 'buffalo_l' to    
                             memory                                             
[04/19/24 15:21:11] ERROR    Exception in ASGI application                      
                                                                                
                             ╭─────── Traceback (most recent call last) ───────╮
                             │ /usr/src/app/main.py:118 in predict             │
                             │                                                 │
                             │   115 │                                         │
                             │   116 │   model = await load(await model_cache. │
                             │       ttl=settings.model_ttl, **kwargs))        │
                             │   117 │   model.configure(**kwargs)             │
                             │ ❱ 118 │   outputs = await run(model.predict, in │
                             │   119 │   return ORJSONResponse(outputs)        │
                             │   120                                           │
                             │   121                                           │
                             │                                                 │
                             │ /usr/src/app/main.py:125 in run                 │
                             │                                                 │
                             │   122 async def run(func: Callable[..., Any], i │
                             │   123 │   if thread_pool is None:               │
                             │   124 │   │   return func(inputs)               │
                             │ ❱ 125 │   return await asyncio.get_running_loop │
                             │   126                                           │
                             │   127                                           │
                             │   128 async def load(model: InferenceModel) ->  │
                             │                                                 │
                             │ /usr/lib/python3.10/concurrent/futures/thread.p │
                             │ y:58 in run                                     │
                             │                                                 │
                             │ /usr/src/app/models/base.py:59 in predict       │
                             │                                                 │
                             │    56 │   │   self.load()                       │
                             │    57 │   │   if model_kwargs:                  │
                             │    58 │   │   │   self.configure(**model_kwargs │
                             │ ❱  59 │   │   return self._predict(inputs)      │
                             │    60 │                                         │
                             │    61 │   @abstractmethod                       │
                             │    62 │   def _predict(self, inputs: Any) -> An │
                             │                                                 │
                             │ /usr/src/app/models/facial_recognition.py:49 in │
                             │ _predict                                        │
                             │                                                 │
                             │   46 │   │   else:                              │
                             │   47 │   │   │   decoded_image = image          │
                             │   48 │   │   assert is_ndarray(decoded_image, n │
                             │ ❱ 49 │   │   bboxes, kpss = self.det_model.dete │
                             │   50 │   │   if bboxes.size == 0:               │
                             │   51 │   │   │   return []                      │
                             │   52 │   │   assert is_ndarray(kpss, np.float32 │
                             │                                                 │
                             │ /opt/venv/lib/python3.10/site-packages/insightf │
                             │ ace/model_zoo/retinaface.py:224 in detect       │
                             │                                                 │
                             │   221 │   │   det_img = np.zeros( (input_size[1 │
                             │   222 │   │   det_img[:new_height, :new_width,  │
                             │   223 │   │                                     │
                             │ ❱ 224 │   │   scores_list, bboxes_list, kpss_li │
                             │   225 │   │                                     │
                             │   226 │   │   scores = np.vstack(scores_list)   │
                             │   227 │   │   scores_ravel = scores.ravel()     │
                             │                                                 │
                             │ /opt/venv/lib/python3.10/site-packages/insightf │
                             │ ace/model_zoo/retinaface.py:152 in forward      │
                             │                                                 │
                             │   149 │   │   kpss_list = []                    │
                             │   150 │   │   input_size = tuple(img.shape[0:2] │
                             │   151 │   │   blob = cv2.dnn.blobFromImage(img, │
                             │       (self.input_mean, self.input_mean, self.i │
                             │ ❱ 152 │   │   net_outs = self.session.run(self. │
                             │   153 │   │                                     │
                             │   154 │   │   input_height = blob.shape[2]      │
                             │   155 │   │   input_width = blob.shape[3]       │
                             │                                                 │
                             │ /opt/venv/lib/python3.10/site-packages/onnxrunt │
                             │ ime/capi/onnxruntime_inference_collection.py:22 │
                             │ 0 in run                                        │
                             │                                                 │
                             │    217 │   │   if not output_names:             │
                             │    218 │   │   │   output_names = [output.name  │
                             │    219 │   │   try:                             │
                             │ ❱  220 │   │   │   return self._sess.run(output │
                             │    221 │   │   except C.EPFail as err:          │
                             │    222 │   │   │   if self._enable_fallback:    │
                             │    223 │   │   │   │   print(f"EP Error: {err!s │
                             ╰─────────────────────────────────────────────────╯
                             RuntimeException: [ONNXRuntimeError] : 6 :         
                             RUNTIME_EXCEPTION : Encountered unknown exception  
                             in Run()

@djjudas21 commented on GitHub (Apr 19, 2024): OK so I manually configured the probes to have longer timeouts and a great tolerance of failed probes. It didn't get killed by Kubernetes, but it crashed in a nasty way. I'm well out of my depth with this ``` [jonathan@poseidon immich]$ kubectl logs -f immich-machine-learning-78f87fcc4b-5s84r [04/19/24 15:18:49] INFO Starting gunicorn 21.2.0 [04/19/24 15:18:49] INFO Listening at: http://[::]:3003 (10) [04/19/24 15:18:49] INFO Using worker: app.config.CustomUvicornWorker [04/19/24 15:18:49] INFO Booting worker with pid: 14 [04/19/24 15:18:52] INFO Started server process [14] [04/19/24 15:18:52] INFO Waiting for application startup. [04/19/24 15:18:52] INFO Created in-memory cache with unloading after 300s of inactivity. [04/19/24 15:18:52] INFO Initialized request thread pool with 4 threads. [04/19/24 15:18:52] INFO Application startup complete. [04/19/24 15:18:54] INFO Setting 'ViT-B-32__openai' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [04/19/24 15:18:54] INFO Loading clip model 'ViT-B-32__openai' to memory [04/19/24 15:18:54] INFO Setting 'buffalo_l' execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference [04/19/24 15:20:19] INFO Loading facial recognition model 'buffalo_l' to memory [04/19/24 15:21:11] ERROR Exception in ASGI application ╭─────── Traceback (most recent call last) ───────╮ │ /usr/src/app/main.py:118 in predict │ │ │ │ 115 │ │ │ 116 │ model = await load(await model_cache. │ │ ttl=settings.model_ttl, **kwargs)) │ │ 117 │ model.configure(**kwargs) │ │ ❱ 118 │ outputs = await run(model.predict, in │ │ 119 │ return ORJSONResponse(outputs) │ │ 120 │ │ 121 │ │ │ │ /usr/src/app/main.py:125 in run │ │ │ │ 122 async def run(func: Callable[..., Any], i │ │ 123 │ if thread_pool is None: │ │ 124 │ │ return func(inputs) │ │ ❱ 125 │ return await asyncio.get_running_loop │ │ 126 │ │ 127 │ │ 128 async def load(model: InferenceModel) -> │ │ │ │ /usr/lib/python3.10/concurrent/futures/thread.p │ │ y:58 in run │ │ │ │ /usr/src/app/models/base.py:59 in predict │ │ │ │ 56 │ │ self.load() │ │ 57 │ │ if model_kwargs: │ │ 58 │ │ │ self.configure(**model_kwargs │ │ ❱ 59 │ │ return self._predict(inputs) │ │ 60 │ │ │ 61 │ @abstractmethod │ │ 62 │ def _predict(self, inputs: Any) -> An │ │ │ │ /usr/src/app/models/facial_recognition.py:49 in │ │ _predict │ │ │ │ 46 │ │ else: │ │ 47 │ │ │ decoded_image = image │ │ 48 │ │ assert is_ndarray(decoded_image, n │ │ ❱ 49 │ │ bboxes, kpss = self.det_model.dete │ │ 50 │ │ if bboxes.size == 0: │ │ 51 │ │ │ return [] │ │ 52 │ │ assert is_ndarray(kpss, np.float32 │ │ │ │ /opt/venv/lib/python3.10/site-packages/insightf │ │ ace/model_zoo/retinaface.py:224 in detect │ │ │ │ 221 │ │ det_img = np.zeros( (input_size[1 │ │ 222 │ │ det_img[:new_height, :new_width, │ │ 223 │ │ │ │ ❱ 224 │ │ scores_list, bboxes_list, kpss_li │ │ 225 │ │ │ │ 226 │ │ scores = np.vstack(scores_list) │ │ 227 │ │ scores_ravel = scores.ravel() │ │ │ │ /opt/venv/lib/python3.10/site-packages/insightf │ │ ace/model_zoo/retinaface.py:152 in forward │ │ │ │ 149 │ │ kpss_list = [] │ │ 150 │ │ input_size = tuple(img.shape[0:2] │ │ 151 │ │ blob = cv2.dnn.blobFromImage(img, │ │ (self.input_mean, self.input_mean, self.i │ │ ❱ 152 │ │ net_outs = self.session.run(self. │ │ 153 │ │ │ │ 154 │ │ input_height = blob.shape[2] │ │ 155 │ │ input_width = blob.shape[3] │ │ │ │ /opt/venv/lib/python3.10/site-packages/onnxrunt │ │ ime/capi/onnxruntime_inference_collection.py:22 │ │ 0 in run │ │ │ │ 217 │ │ if not output_names: │ │ 218 │ │ │ output_names = [output.name │ │ 219 │ │ try: │ │ ❱ 220 │ │ │ return self._sess.run(output │ │ 221 │ │ except C.EPFail as err: │ │ 222 │ │ │ if self._enable_fallback: │ │ 223 │ │ │ │ print(f"EP Error: {err!s │ ╰─────────────────────────────────────────────────╯ RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Encountered unknown exception in Run() ```

OVERLORD commented

2026-02-05 07:15:44 +03:00

@mertalev commented on GitHub (Apr 19, 2024):

For the first issue, models when using OpenVINO take some time to start on the first load since they're compiled to that format. But everything is supposed to happen in a separate thread pool without blocking the main thread, so I'm not sure what's causing this to be blocking.

For the second issue, OpenVINO is buggy and doesn't currently work with facial recognition, see #8226.

@mertalev commented on GitHub (Apr 19, 2024): For the first issue, models when using OpenVINO take some time to start on the first load since they're compiled to that format. But everything is supposed to happen in a separate thread pool without blocking the main thread, so I'm not sure what's causing this to be blocking. For the second issue, OpenVINO is buggy and doesn't currently work with facial recognition, see #8226.

OVERLORD commented

2026-02-05 07:15:45 +03:00

@djjudas21 commented on GitHub (Apr 19, 2024):

Thanks for clarifying about the OpenVINO bug, I'm following that one now.

The startup time for the machine learning container isn't straightforward though. If it immediately loaded the models when the container started, that would be fine because you can define a startupProbe in Kubernetes to protect the container during this period. But instead, the container starts up quickly and the model is not loaded until a request for ML comes in from Immich, which may happen a long time after the container has started, so there is no way to work around this in Kubernetes, other than to make the probes very generous.

I'm no software engineer, but it does sound odd that the main thread gets blocked.

In my case, I have no choice but to disable GPU acceleration for ML if OpenVINO doesn't support it, but I'll leave this issue open because it sounds like there is some investigation to be done with the thread pools, as @bo0tzz said that also affects the non-openvino container.

Thanks for your help, @mertalev. Have a nice weekend!

@djjudas21 commented on GitHub (Apr 19, 2024): Thanks for clarifying about the OpenVINO bug, I'm following that one now. The startup time for the machine learning container isn't straightforward though. If it immediately loaded the models when the container started, that would be fine because you can define a startupProbe in Kubernetes to protect the container during this period. But instead, the container starts up quickly and the model is not loaded until a request for ML comes in from Immich, which may happen a long time after the container has started, so there is no way to work around this in Kubernetes, other than to make the probes very generous. I'm no software engineer, but it does sound odd that the main thread gets blocked. In my case, I have no choice but to disable GPU acceleration for ML if OpenVINO doesn't support it, but I'll leave this issue open because it sounds like there is some investigation to be done with the thread pools, as @bo0tzz said that also affects the non-openvino container. Thanks for your help, @mertalev. Have a nice weekend!

OVERLORD commented

2026-02-05 07:15:49 +03:00

@mertalev commented on GitHub (Apr 19, 2024):

There are envs to preload certain models if it helps: setting MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l will eagerly load those models at startup without waiting for a request.

@mertalev commented on GitHub (Apr 19, 2024): There are envs to preload certain models if it helps: setting `MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai` and `MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l` will eagerly load those models at startup without waiting for a request.

OVERLORD commented

2026-02-05 07:15:53 +03:00

@mertalev commented on GitHub (Jun 19, 2024):

I think this is because of the Python GIL. When it compiles to OpenVINO, it holds onto the GIL and prevents other threads from executing. That means it can't respond to probes during that time. Not sure what I can do about that short of putting it in a subprocess or something.

@mertalev commented on GitHub (Jun 19, 2024): I think this is because of the Python GIL. When it compiles to OpenVINO, it holds onto the GIL and prevents other threads from executing. That means it can't respond to probes during that time. Not sure what I can do about that short of putting it in a subprocess or something.

OVERLORD commented

2026-02-05 07:15:56 +03:00

@djjudas21 commented on GitHub (Aug 6, 2024):

I can confirm this is still a problem. With #8226 now resolved, I had another crack at enabling hardware acceleration for my ML. I'm using the immich-machine-learning:main-openvino image. It starts up properly and appears stable at idle, but once the first ML job is started, the container stops responding and gets killed.

[jonathan@latitude immich]$ kubectl logs -f immich-machine-learning-66cbb9cbb8-n4wqt
[08/06/24 16:23:27] INFO     Starting gunicorn 22.0.0                           
[08/06/24 16:23:27] INFO     Listening at: http://[::]:3003 (9)                 
[08/06/24 16:23:27] INFO     Using worker: app.config.CustomUvicornWorker       
[08/06/24 16:23:27] INFO     Booting worker with pid: 10                        
[08/06/24 16:23:32] INFO     Started server process [10]                        
[08/06/24 16:23:32] INFO     Waiting for application startup.                   
[08/06/24 16:23:32] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[08/06/24 16:23:32] INFO     Initialized request thread pool with 4 threads.    
[08/06/24 16:23:32] INFO     Application startup complete.                      
[08/06/24 16:58:04] INFO     Loading visual model 'ViT-B-32__openai' to memory  
[08/06/24 16:58:04] INFO     Setting execution providers to                     
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference

Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Normal   Created    31s (x2 over 35m)  kubelet  Created container immich-machine-learning
  Warning  Unhealthy  31s (x3 over 51s)  kubelet  Liveness probe failed: Get "http://10.1.199.105:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    31s                kubelet  Container immich-machine-learning failed liveness probe, will be restarted
  Warning  Unhealthy  31s                kubelet  Readiness probe failed: Get "http://10.1.199.105:3003/ping": read tcp 192.168.0.57:59722->10.1.199.105:3003: read: connection reset by peer
  Normal   Pulled     31s                kubelet  Container image "ghcr.io/immich-app/immich-machine-learning:main-openvino" already present on machine
  Normal   Started    30s (x2 over 35m)  kubelet  Started container immich-machine-learning
  Warning  Unhealthy  30s (x3 over 35m)  kubelet  Readiness probe failed: Get "http://10.1.199.105:3003/ping": dial tcp 10.1.199.105:3003: connect: connection refused
  Warning  Unhealthy  28s (x5 over 35m)  kubelet  Readiness probe failed: Get "http://10.1.199.105:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

@djjudas21 commented on GitHub (Aug 6, 2024): I can confirm this is still a problem. With #8226 now resolved, I had another crack at enabling hardware acceleration for my ML. I'm using the `immich-machine-learning:main-openvino` image. It starts up properly and appears stable at idle, but once the first ML job is started, the container stops responding and gets killed. ``` [jonathan@latitude immich]$ kubectl logs -f immich-machine-learning-66cbb9cbb8-n4wqt [08/06/24 16:23:27] INFO Starting gunicorn 22.0.0 [08/06/24 16:23:27] INFO Listening at: http://[::]:3003 (9) [08/06/24 16:23:27] INFO Using worker: app.config.CustomUvicornWorker [08/06/24 16:23:27] INFO Booting worker with pid: 10 [08/06/24 16:23:32] INFO Started server process [10] [08/06/24 16:23:32] INFO Waiting for application startup. [08/06/24 16:23:32] INFO Created in-memory cache with unloading after 300s of inactivity. [08/06/24 16:23:32] INFO Initialized request thread pool with 4 threads. [08/06/24 16:23:32] INFO Application startup complete. [08/06/24 16:58:04] INFO Loading visual model 'ViT-B-32__openai' to memory [08/06/24 16:58:04] INFO Setting execution providers to ['OpenVINOExecutionProvider', 'CPUExecutionProvider'], in descending order of preference ``` ``` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Created 31s (x2 over 35m) kubelet Created container immich-machine-learning Warning Unhealthy 31s (x3 over 51s) kubelet Liveness probe failed: Get "http://10.1.199.105:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Normal Killing 31s kubelet Container immich-machine-learning failed liveness probe, will be restarted Warning Unhealthy 31s kubelet Readiness probe failed: Get "http://10.1.199.105:3003/ping": read tcp 192.168.0.57:59722->10.1.199.105:3003: read: connection reset by peer Normal Pulled 31s kubelet Container image "ghcr.io/immich-app/immich-machine-learning:main-openvino" already present on machine Normal Started 30s (x2 over 35m) kubelet Started container immich-machine-learning Warning Unhealthy 30s (x3 over 35m) kubelet Readiness probe failed: Get "http://10.1.199.105:3003/ping": dial tcp 10.1.199.105:3003: connect: connection refused Warning Unhealthy 28s (x5 over 35m) kubelet Readiness probe failed: Get "http://10.1.199.105:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ```

OVERLORD commented

2026-02-05 07:15:58 +03:00

@djjudas21 commented on GitHub (Aug 6, 2024):

Starting up with MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l does indeed load the models up front, but that just causes the container to fail its probes earlier

@djjudas21 commented on GitHub (Aug 6, 2024): Starting up with `MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai` and `MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l` does indeed load the models up front, but that just causes the container to fail its probes earlier

OVERLORD commented

2026-02-05 07:16:00 +03:00

@mertalev commented on GitHub (Aug 6, 2024):

The models only take that long the first time they're compiled. You can preload them and set a really high timeout for the probes to let it get compiled.

@mertalev commented on GitHub (Aug 6, 2024): The models only take that long the first time they're compiled. You can preload them and set a really high timeout for the probes to let it get compiled.

OVERLORD commented

2026-02-05 07:16:04 +03:00

@djjudas21 commented on GitHub (Aug 6, 2024):

Good idea. Unfortunately that brings me to a second issue with the Helm chart where the startupProbe is hardcoded to false but I have added initialDelaySeconds to the liveness and readiness probes.

Here's my full, working ML block for a Kubernetes deployment via Helm chart, because examples/documentation seem a bit lacking for this 🙂

machine-learning:
  enabled: true
  image:
    # Specify OpenVINO image variant
    tag: main-openvino
  env:
    TRANSFORMERS_CACHE: /cache
    # Load ML models up front
    MACHINE_LEARNING_PRELOAD__CLIP: ViT-B-32__openai
    MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION: buffalo_l
  resources:
    requests:
      memory: 1500Mi
      cpu: 1
      gpu.intel.com/i915: 1
    limits:
      gpu.intel.com/i915: 1
  # Override probes to allow slow ML startup
  probes:
    liveness:
      spec:
        initialDelaySeconds: 120
    readiness:
      spec:
        initialDelaySeconds: 120
  persistence:
    cache:
      enabled: true
      size: 10Gi
      # Optional: Set this to pvc to avoid downloading the ML models every start.
      type: pvc
      accessMode: ReadWriteMany
      storageClass: truenas

@djjudas21 commented on GitHub (Aug 6, 2024): Good idea. Unfortunately that brings me to a second issue with the Helm chart where [the `startupProbe` is hardcoded to false](https://github.com/immich-app/immich-charts/blob/main/charts/immich/templates/machine-learning.yaml#L17) but I have added `initialDelaySeconds` to the liveness and readiness probes. Here's my full, working ML block for a Kubernetes deployment via Helm chart, because examples/documentation seem a bit lacking for this :slightly_smiling_face: ```yaml machine-learning: enabled: true image: # Specify OpenVINO image variant tag: main-openvino env: TRANSFORMERS_CACHE: /cache # Load ML models up front MACHINE_LEARNING_PRELOAD__CLIP: ViT-B-32__openai MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION: buffalo_l resources: requests: memory: 1500Mi cpu: 1 gpu.intel.com/i915: 1 limits: gpu.intel.com/i915: 1 # Override probes to allow slow ML startup probes: liveness: spec: initialDelaySeconds: 120 readiness: spec: initialDelaySeconds: 120 persistence: cache: enabled: true size: 10Gi # Optional: Set this to pvc to avoid downloading the ML models every start. type: pvc accessMode: ReadWriteMany storageClass: truenas ```

OVERLORD commented

2026-02-05 07:16:06 +03:00

@hranicka commented on GitHub (Apr 23, 2025):

Having the same issues. Probes are not responding an the pods are being shut down. Using official Helm charts and image v1.131.3-openvino.

As a workaround, after a deployment, I had to patch the deployment:

kubectl patch deployment immich-machine-learning -n immich \
  --type json \
  -p='[
    {"op":"remove","path":"/spec/template/spec/containers/0/livenessProbe"},
    {"op":"remove","path":"/spec/template/spec/containers/0/readinessProbe"},
    {"op":"remove","path":"/spec/template/spec/containers/0/startupProbe"}
  ]'

Setting this in values.yaml did not help:

  image:
    repository: ghcr.io/immich-app/immich-machine-learning
    tag: v1.131.3-openvino

  probes:
    liveness:
      enabled: false
    readiness:
      enabled: false
    startup:
      enabled: false

@hranicka commented on GitHub (Apr 23, 2025): Having the same issues. Probes are not responding an the pods are being shut down. Using official Helm charts and image `v1.131.3-openvino`. As a workaround, after a deployment, I had to patch the deployment: ```shell kubectl patch deployment immich-machine-learning -n immich \ --type json \ -p='[ {"op":"remove","path":"/spec/template/spec/containers/0/livenessProbe"}, {"op":"remove","path":"/spec/template/spec/containers/0/readinessProbe"}, {"op":"remove","path":"/spec/template/spec/containers/0/startupProbe"} ]' ``` Setting this in `values.yaml` did not help: ```yaml image: repository: ghcr.io/immich-app/immich-machine-learning tag: v1.131.3-openvino probes: liveness: enabled: false readiness: enabled: false startup: enabled: false ```

OVERLORD commented

2026-02-05 07:16:09 +03:00

@jrasm91 commented on GitHub (Sep 19, 2025):

Sounds like this issue is essentially resolved, and that potentially a separate issue should be opened in https://github.com/immich-app/immich-charts for probe configuration or default values for the machine learning pod.

@jrasm91 commented on GitHub (Sep 19, 2025): Sounds like this issue is essentially resolved, and that potentially a separate issue should be opened in https://github.com/immich-app/immich-charts for probe configuration or default values for the machine learning pod.

Sign in to join this conversation.

Branches Tags

main

fix/warn-invalid-filetype

update-pwa

chore/translations

release/next

refactor/star-rating

renovate/flutter

renovate/major-machine-learning

uhthomas/feat-sort-smart-search

renovate/github-cqlabs-homebrew-dcm-1.x

push-vxwxqoulmxun

push-zlzxxyywnmtr

feat/mobile-edit-2-server-sync-entity

chore/deduplicate-storage-template-example

feat/splash-screen-error

fix/download-button

fix/maintenance-reload

feat/video-player

feat/mobile-editing

feat/use-native-clients

refactor/remove-replace-with-upload

push-snrprxmlposz

push-okmnxsumoyzr

uhthomas/chore-mobile-maplibre

feat/library-offline-stats

uhthomas/mobile-fix-asset-details-album-pop

feat/crawl-wrapper

feat/open-in-browser

push-skvzqoozqkpl

feat/custom-date-range

feat/edit-filters

fix/locale-settings-desc

push-xyozownmuwqp

push-lvyturrtwkrq

push-mvnsqpxklmnu

push-ztrmyrpuwvow

push-rsywxvptwxuv

push-pvvtwywwqzvy

postgres-socketio

feat/pg-queue

proposal/zod

refactor/asset-upload

feat/integrity-checks-izzy

renovate/connectivity_plus-7.x

better-project-structure

uhthomas/mobile-feat-asset-viewer-details

fix/ml-rocm-build

fix/25803

feat/asset-file-apis

midzelis/wip

push-zpwsovysllvn

push-nwxlpmyzkyrl

feature/bottom-buttons-order

sqlite_thumbs

fix-keep-correct-ios-shared-album-asset

fix-memory-generation-and-display

push-vpxwmwwxwnvw

fix-migration-width-height

revert/prettier-translations

shared-deep-link-handler

feat/thumbnail-native-clients

feat/platform-clients

fix/foreground-cloud-sync

filter-by-person

feat/csp

refactor/sidebar

fix/disable-editing

fix/view-timeline-deeplink

image-zoom-on-slow-connection

fix-consider-dar-for-video-dimension

fix/merged-edited-assets

open-api-fix

feat/create-job-with-dto

use-toast-primary

feat/vitest-4

feat/ios-fastlane-match

match-signing

fix-update-time-update-timeline

feat/modal-routes

feat/panorama-tiles

feature/mobile-view-asset-owner

feat/system-settings

feature/show-activity-count

better-info-in-asset-viewer

fix/all-people-count

feat/location-favorites

feature/rearrange-buttons-2

fix/download-storage-template

feat/kb-shortcuts-mobile

fix/people-count

push-qolzzzzxrvvn

chore/originals-in-asset-files

feat/asset-size-columns

ben/tree-a11y

new-search-filter-ui

refactor/expectSelectedReadonly

refactor/mobile-grdb

push-qvuktpxmkknu

feat/mobile-native-local-sync

refactor/timeline_ops

fix/scrubber_end

feat/version.txt

feat/context-menus

feat/server-chunked-uploads

refactor/virtualsegment

refactor/rename_daymonth_groups

fix/restrict-android-bg-worker

feat/android-periodic-worker

fix-remote-sync-clean-up

refactor/timeline_move_ops

fix/timeline_split_selectable

feat/keyboard_actions_help_modal

feat/static_frontend

feat/notification-warnign-android

feat/plugins2

feat/plugins

test/create-workflow-token-action

fix/docs-force

debug/search-result-similarity

debug/cf-chunked-uploads

feat/eslint_rule

feat/search-filter-album/web

refactor/timeline_photostream

refactor/timelineasset_asset

feat/session-permissions

feat/timeline_photostream_assetnav

feat/timeline_minor_optimize

feat/timeline_perf_nocomp

feat/timeline_search_results_actions

feat/timeline_search_results_page

fix/timeline_padding

fix/timeline_search_reactivity_warnings

feat/timeline_scrollbar

feat/timeline_stream_withviewer

fix/timeline_back_forth_nav

refactor/timeline_photostream_component

fix/generated-files-checks

fix/locate-button-local

chore/base-image-mimalloc

refactor/timeline_assetlayout

refactor/timeline_selectable

refactor/timeline_aware_actions

refactor/timeline_monthsegment

feat/remove-old-pages

chore/deps-gradle

tmp_photostream

tmp/lcms

feat/mobile-dynamic-thumbnails

fix/mobile-finer-thumbnail-concurrency

refactor/timeline1

refactor/extract_photostream

refactor/rename_load_api

refactor/timeline2

refactor/timeline3

feat/multi-select-asset-viewer

feat-no-thumbhash-cache

refactor/asset_grid

feat/faster-access-checks

fix/18991

fix/19543

chore/temp-remove

fix/21419

feat/mobile-hdr-images

chore/update-mise-lockfile

feat/mise-server-checks

feat/mise-ci

feat/windows-2025

feat/dev_cli

refactor/mobile-migrate-clients

fix/map-theme

fix/require-checkbox

chore/use_swc

feat/efficient-thumbnail-decoding

refactor/mobile-thumbhash

refactor/mobile-thumbhash-new

feat/beta-background-upload

fix/beta-timeline-memories-setting

fix/failed-uploads-not-removed

feat/mobile-shared-album

feat/groups

drift-map-page

drift-auth-user-sync

fix/disable-memory

feat/add-to-album-action

edit-date-time-action

drift-people-page

sqlite-remove-isIn

chore/required-reviewers

refact/asset-manager

fix/folder-sort

pnpm

feat/widget-multiple-server-urls

chore/medium-tests-dbname

fix/web-no-iterator-find

fix/map-pan-interruption

track-livephotos

timeline_events

chore/oxlint-migration

feat/maintenance-worker

feat/dav

chore/demo-snapshot

refactor/server-side-dedupe

feat/integrity-checks

dev/recognition-eval

lighter_buckets_test

perf/postgres-queue

postgres-queue

focus_rings

refactor/web-stores-1

refactor/add-to-taken

feat/sort-places

vet

tmp/demo-snapshot-preview

fix/server-migration-file-extension

fix/asset-update-race-condition

rknn-toolkit-lite2

refactor/mobile-split-up-search-page

feature/Add-rocm-support-for-machine-learning

feat/rocm

chore/async-hash-file

feat/shared-link-view-count

feat/rotation

feat/graphql

feat/job-ids

feat/ignore-library-permission-error

feat/docker-compose-builder

feat/kysely-typeorm

mobile/onboarding

no-video-player

fix/server-qsv-output-format

chore/server-geodata-tweaks

mobile/native-video-player-no-hero

feat/xxhash

fix/docs-concurrency

feat/local-tileserver

refactor/exif-orientation

original-path-infix

refactor/mobile/login-form-1

feat/server-editor-endpoints

fix/server-qsv-vbr

fix-mobile-db-problems

feat/ml-armnn-conversion

feat/mobile/backup-with-album-info

feat/fast-initial-sync-1

chore/handle-output_dims

feat/unassign-faces

feat/shortcuts-on-asset-grid

feat/capacitor-mobile-app-poc

feat/server-nvenc-hw-decoding

fix/mobile-fetch-non-archive

web/automation-ui

feat/mobile-server-endpoint-save-dropdown

object-storage

feat/memories-animations

dev/metrics

ml/tflite

feat/ml-export-cli

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: immich-app/immich#2971