Machine-learning crashes when loading model on startup #2586

Closed
opened 2026-02-05 06:13:46 +03:00 by OVERLORD · 5 comments
Owner

Originally created by @bo0tzz on GitHub (Mar 13, 2024).

[03/13/24 08:42:29] INFO     Starting gunicorn 21.2.0                           
[03/13/24 08:42:29] INFO     Listening at: http://0.0.0.0:3003 (9)              
[03/13/24 08:42:29] INFO     Using worker: app.config.CustomUvicornWorker       
[03/13/24 08:42:29] INFO     Booting worker with pid: 13                        
[03/13/24 08:42:29] DEBUG    Could not load ANN shared libraries, using ONNX:   
                             libmali.so: cannot open shared object file: No such
                             file or directory                                  
[03/13/24 08:42:33] INFO     Started server process [13]                        
[03/13/24 08:42:33] INFO     Waiting for application startup.                   
[03/13/24 08:42:33] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[03/13/24 08:42:33] INFO     Initialized request thread pool with 4 threads.    
[03/13/24 08:42:33] INFO     Preloading models:                                 
                             clip='ViT-H-14-378-quickgelu__dfn5b'               
                             facial_recognition=None                            
[03/13/24 08:42:33] DEBUG    Available ORT providers: {'CPUExecutionProvider',  
                             'AzureExecutionProvider'}                          
[03/13/24 08:42:33] INFO     Setting 'ViT-H-14-378-quickgelu__dfn5b' execution  
                             providers to ['CPUExecutionProvider'], in          
                             descending order of preference                     
[03/13/24 08:42:33] DEBUG    Setting execution provider options to              
                             [{'arena_extend_strategy': 'kSameAsRequested'}]    
[03/13/24 08:42:33] DEBUG    Setting execution_mode to ORT_SEQUENTIAL           
[03/13/24 08:42:33] DEBUG    Setting inter_op_num_threads to 1                  
[03/13/24 08:42:33] DEBUG    Setting intra_op_num_threads to 2                  
[03/13/24 08:42:33] DEBUG    Setting preferred runtime to onnx                  
[03/13/24 08:42:33] DEBUG    Checking for inactivity...                         
[03/13/24 08:42:33] INFO     Loading clip model 'ViT-H-14-378-quickgelu__dfn5b' 
                             to memory                                          
[03/13/24 08:42:33] DEBUG    Loading clip text model                            
                             'ViT-H-14-378-quickgelu__dfn5b'                    
[03/13/24 08:42:34] DEBUG    Loaded clip text model                             
                             'ViT-H-14-378-quickgelu__dfn5b'                    
[03/13/24 08:42:34] DEBUG    Loading clip vision model                          
                             'ViT-H-14-378-quickgelu__dfn5b'                    
[03/13/24 08:42:35] ERROR    Traceback (most recent call last):                 
                               File                                             
                             "/opt/venv/lib/python3.11/site-packages/starlette/r
                             outing.py", line 734, in lifespan                  
                                 async with self.lifespan_context(app) as       
                             maybe_state:                                       
                               File "/usr/local/lib/python3.11/contextlib.py",  
                             line 210, in __aenter__                            
                                 return await anext(self.gen)                   
                                        ^^^^^^^^^^^^^^^^^^^^^                   
                               File "/usr/src/app/main.py", line 55, in lifespan
                                 await preload_models(settings.preload)         
                               File "/usr/src/app/main.py", line 69, in         
                             preload_models                                     
                                 await load(await                               
                             model_cache.get(preload_models.clip,               
                             ModelType.CLIP))                                   
                               File "/usr/src/app/main.py", line 137, in load   
                                 await run(_load, model)                        
                               File "/usr/src/app/main.py", line 125, in run    
                                 return await                                   
                             asyncio.get_running_loop().run_in_executor(thread_p
                             ool, func, inputs)                                 
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                
                               File                                             
                             "/usr/local/lib/python3.11/concurrent/futures/threa
                             d.py", line 58, in run                             
                                 result = self.fn(*self.args, **self.kwargs)    
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    
                               File "/usr/src/app/main.py", line 134, in _load  
                                 model.load()                                   
                               File "/usr/src/app/models/base.py", line 53, in  
                             load                                               
                                 self._load()                                   
                               File "/usr/src/app/models/clip.py", line 146, in 
                             _load                                              
                                 super()._load()                                
                               File "/usr/src/app/models/clip.py", line 41, in  
                             _load                                              
                                 self.vision_model =                            
                             self._make_session(self.visual_path)               
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
                             ^^^^^^^^^                                          
                               File "/usr/src/app/models/base.py", line 121, in 
                             _make_session                                      
                                 session = ort.InferenceSession(                
                                           ^^^^^^^^^^^^^^^^^^^^^                
                               File                                             
                             "/opt/venv/lib/python3.11/site-packages/onnxruntime
                             /capi/onnxruntime_inference_collection.py", line   
                             419, in __init__                                   
                                 self._create_inference_session(providers,      
                             provider_options, disabled_optimizers)             
                               File                                             
                             "/opt/venv/lib/python3.11/site-packages/onnxruntime
                             /capi/onnxruntime_inference_collection.py", line   
                             483, in _create_inference_session                  
                                 sess.initialize_session(providers,             
                             provider_options, disabled_optimizers)             
                             onnxruntime.capi.onnxruntime_pybind11_state.Fail:  
                             [ONNXRuntimeError] : 1 : FAIL : Deserialize tensor 
                             onnx::MatMul_6214 failed.GetFileLength for         
                             /cache/clip/ViT-H-14-378-quickgelu__dfn5b/visual/Co
                             nstant_7383_attr__value failed:Invalid fd was      
                             supplied: -1                                       
                                                                                
[03/13/24 08:42:35] ERROR    Application startup failed. Exiting.               
[03/13/24 08:42:35] INFO     Worker exiting (pid: 13)                           
[03/13/24 08:42:35] ERROR    Worker (pid:13) exited with code 3                 
[03/13/24 08:42:35] ERROR    Shutting down: Master                              
[03/13/24 08:42:35] ERROR    Reason: Worker failed to boot.

This might potentially be because the download of the model was interrupted midway through? The download also seems to be taking an unreasonable amount of time for me, and I'm not getting any progress indications.

Env vars used:

MACHINE_LEARNING_PRELOAD__CLIP: "ViT-H-14-378-quickgelu__dfn5b"
MACHINE_LEARNING_WORKER_TIMEOUT: 3600
TRANSFORMERS_CACHE: /cache
Originally created by @bo0tzz on GitHub (Mar 13, 2024). ``` [03/13/24 08:42:29] INFO Starting gunicorn 21.2.0 [03/13/24 08:42:29] INFO Listening at: http://0.0.0.0:3003 (9) [03/13/24 08:42:29] INFO Using worker: app.config.CustomUvicornWorker [03/13/24 08:42:29] INFO Booting worker with pid: 13 [03/13/24 08:42:29] DEBUG Could not load ANN shared libraries, using ONNX: libmali.so: cannot open shared object file: No such file or directory [03/13/24 08:42:33] INFO Started server process [13] [03/13/24 08:42:33] INFO Waiting for application startup. [03/13/24 08:42:33] INFO Created in-memory cache with unloading after 300s of inactivity. [03/13/24 08:42:33] INFO Initialized request thread pool with 4 threads. [03/13/24 08:42:33] INFO Preloading models: clip='ViT-H-14-378-quickgelu__dfn5b' facial_recognition=None [03/13/24 08:42:33] DEBUG Available ORT providers: {'CPUExecutionProvider', 'AzureExecutionProvider'} [03/13/24 08:42:33] INFO Setting 'ViT-H-14-378-quickgelu__dfn5b' execution providers to ['CPUExecutionProvider'], in descending order of preference [03/13/24 08:42:33] DEBUG Setting execution provider options to [{'arena_extend_strategy': 'kSameAsRequested'}] [03/13/24 08:42:33] DEBUG Setting execution_mode to ORT_SEQUENTIAL [03/13/24 08:42:33] DEBUG Setting inter_op_num_threads to 1 [03/13/24 08:42:33] DEBUG Setting intra_op_num_threads to 2 [03/13/24 08:42:33] DEBUG Setting preferred runtime to onnx [03/13/24 08:42:33] DEBUG Checking for inactivity... [03/13/24 08:42:33] INFO Loading clip model 'ViT-H-14-378-quickgelu__dfn5b' to memory [03/13/24 08:42:33] DEBUG Loading clip text model 'ViT-H-14-378-quickgelu__dfn5b' [03/13/24 08:42:34] DEBUG Loaded clip text model 'ViT-H-14-378-quickgelu__dfn5b' [03/13/24 08:42:34] DEBUG Loading clip vision model 'ViT-H-14-378-quickgelu__dfn5b' [03/13/24 08:42:35] ERROR Traceback (most recent call last): File "/opt/venv/lib/python3.11/site-packages/starlette/r outing.py", line 734, in lifespan async with self.lifespan_context(app) as maybe_state: File "/usr/local/lib/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/app/main.py", line 55, in lifespan await preload_models(settings.preload) File "/usr/src/app/main.py", line 69, in preload_models await load(await model_cache.get(preload_models.clip, ModelType.CLIP)) File "/usr/src/app/main.py", line 137, in load await run(_load, model) File "/usr/src/app/main.py", line 125, in run return await asyncio.get_running_loop().run_in_executor(thread_p ool, func, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/concurrent/futures/threa d.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/app/main.py", line 134, in _load model.load() File "/usr/src/app/models/base.py", line 53, in load self._load() File "/usr/src/app/models/clip.py", line 146, in _load super()._load() File "/usr/src/app/models/clip.py", line 41, in _load self.vision_model = self._make_session(self.visual_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^ File "/usr/src/app/models/base.py", line 121, in _make_session session = ort.InferenceSession( ^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.11/site-packages/onnxruntime /capi/onnxruntime_inference_collection.py", line 419, in __init__ self._create_inference_session(providers, provider_options, disabled_optimizers) File "/opt/venv/lib/python3.11/site-packages/onnxruntime /capi/onnxruntime_inference_collection.py", line 483, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Deserialize tensor onnx::MatMul_6214 failed.GetFileLength for /cache/clip/ViT-H-14-378-quickgelu__dfn5b/visual/Co nstant_7383_attr__value failed:Invalid fd was supplied: -1 [03/13/24 08:42:35] ERROR Application startup failed. Exiting. [03/13/24 08:42:35] INFO Worker exiting (pid: 13) [03/13/24 08:42:35] ERROR Worker (pid:13) exited with code 3 [03/13/24 08:42:35] ERROR Shutting down: Master [03/13/24 08:42:35] ERROR Reason: Worker failed to boot. ``` This might potentially be because the download of the model was interrupted midway through? The download also seems to be taking an unreasonable amount of time for me, and I'm not getting any progress indications. Env vars used: ```env MACHINE_LEARNING_PRELOAD__CLIP: "ViT-H-14-378-quickgelu__dfn5b" MACHINE_LEARNING_WORKER_TIMEOUT: 3600 TRANSFORMERS_CACHE: /cache ```
OVERLORD added the 🧠machine-learning label 2026-02-05 06:13:46 +03:00
Author
Owner

@lexcao1729 commented on GitHub (Mar 14, 2024):

I have the same problem.

@lexcao1729 commented on GitHub (Mar 14, 2024): I have the same problem.
Author
Owner

@bo0tzz commented on GitHub (Mar 14, 2024):

I ended up getting things to work by clearing out the model cache and significantly increasing the timeout, to give ample time to download the new model. However bad state like this should still not cause a crash if it does happen.

@bo0tzz commented on GitHub (Mar 14, 2024): I ended up getting things to work by clearing out the model cache and significantly increasing the timeout, to give ample time to download the new model. However bad state like this should still not cause a crash if it does happen.
Author
Owner

@mertalev commented on GitHub (Mar 19, 2024):

Can you confirm if it still times out with the default limit if you increase request threads? I'm wondering if it's because there aren't enough threads to go around with only 4.

@mertalev commented on GitHub (Mar 19, 2024): Can you confirm if it still times out with the default limit if you increase request threads? I'm wondering if it's because there aren't enough threads to go around with only 4.
Author
Owner

@lexcao1729 commented on GitHub (Mar 19, 2024):

I ended up getting things to work by clearing out the model cache and significantly increasing the timeout, to give ample time to download the new model. However bad state like this should still not cause a crash if it does happen.

This works, thank you.

@lexcao1729 commented on GitHub (Mar 19, 2024): > I ended up getting things to work by clearing out the model cache and significantly increasing the timeout, to give ample time to download the new model. However bad state like this should still not cause a crash if it does happen. This works, thank you.
Author
Owner

@jrasm91 commented on GitHub (Sep 7, 2024):

Pretty sure this has been fixed

@jrasm91 commented on GitHub (Sep 7, 2024): Pretty sure this has been fixed
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: immich-app/immich#2586