[BUG] Unable to run after initial batch upload - High IOWait #378

Closed
opened 2026-02-04 20:08:29 +03:00 by OVERLORD · 14 comments
Owner

Originally created by @joeShuff on GitHub (Oct 24, 2022).

Describe the bug
I am planning on migrating over from google photos so I got the container spun up prior to myself and my partner receiving our google takeout zips and it was all fine. Tested app backup, bulk backup via CLI and all was good.

Once I received our takeout zips I unpacked them, reorganized and got them ready to bulk upload via the CLI tool. It took a long time which I expected but that seemed to finish off great.

After I uploaded everything the server started to become unresponsive. I managed to get a glimpse of netdata as it was struggling and I could see the IOWait graph was huge and not going down. Once I stopped the immich stack- it dropped back down to next to nothing. (which is usual levels)

During the time it was lagging I managed to find out a few things.

  • The high IOWait was on my boot drive where the docker volumes would be, NOT the drive my media is stored on and mounted via the UPLOAD_LOCATION env variable.
  • The likely task causing the issue was simply displayed as "Node" in the iotop command, and every time I saw it, it had a different task ID so maybe some recurring task?

docker-compose

version: "3.8"

services:
  immich-server:
    image: altran1502/immich-server:release
    entrypoint: ["/bin/sh", "./start-server.sh"]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - redis
      - database
    restart: always

  immich-microservices:
    image: altran1502/immich-server:release
    entrypoint: ["/bin/sh", "./start-microservices.sh"]
    deploy:
      resources:
        limits:
          cpus: 1.5
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - redis
      - database
    restart: always

  immich-machine-learning:
    image: altran1502/immich-machine-learning:release
    entrypoint: ["/bin/sh", "./entrypoint.sh"]
    deploy:
      resources:
        limits:
          cpus: 1
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - database
    restart: always

  immich-web:
    image: altran1502/immich-web:release
    entrypoint: ["/bin/sh", "./entrypoint.sh"]
    env_file:
      - .env
    restart: always

  redis:
    container_name: immich_redis
    image: redis:6.2
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      PG_DATA: /var/lib/postgresql/data
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

  immich-proxy:
    container_name: immich_proxy
    image: altran1502/immich-proxy:release
    ports:
      - 2283:8080
    logging:
      driver: none
    depends_on:
      - immich-server
    restart: always

volumes:
  pgdata:

redacted .env

DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_DATABASE_NAME=immich
REDIS_HOSTNAME=immich_redis
UPLOAD_LOCATION=/mnt/hdd/media/Photos
LOG_LEVEL=simple
JWT_SECRET=<secret>

Task List

  • [ x ] I have read thoroughly the README setup and installation instructions.
  • [ x ] I have included my docker-compose file.
  • [ x ] I have included my redacted .env file.
  • [ x ] I have included information on my machine, and environment.

System

  • Phone OS [iOS, Android]: Android 11
  • Server Version: Don't wanna start the container to find out but its latest as of now, so I'd guess 1.32.1
  • Mobile App Version: 1.32.0
Originally created by @joeShuff on GitHub (Oct 24, 2022). **Describe the bug** I am planning on migrating over from google photos so I got the container spun up prior to myself and my partner receiving our google takeout zips and it was all fine. Tested app backup, bulk backup via CLI and all was good. Once I received our takeout zips I unpacked them, reorganized and got them ready to bulk upload via the CLI tool. It took a long time which I expected but that seemed to finish off great. After I uploaded everything the server started to become unresponsive. I managed to get a glimpse of netdata as it was struggling and I could see the IOWait graph was huge and not going down. Once I stopped the immich stack- it dropped back down to next to nothing. (which is usual levels) During the time it was lagging I managed to find out a few things. - The high IOWait was on my boot drive where the docker volumes would be, NOT the drive my media is stored on and mounted via the `UPLOAD_LOCATION` env variable. - The likely task causing the issue was simply displayed as "Node" in the `iotop` command, and every time I saw it, it had a different task ID so maybe some recurring task? docker-compose ``` version: "3.8" services: immich-server: image: altran1502/immich-server:release entrypoint: ["/bin/sh", "./start-server.sh"] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload env_file: - .env environment: - NODE_ENV=production depends_on: - redis - database restart: always immich-microservices: image: altran1502/immich-server:release entrypoint: ["/bin/sh", "./start-microservices.sh"] deploy: resources: limits: cpus: 1.5 volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload env_file: - .env environment: - NODE_ENV=production depends_on: - redis - database restart: always immich-machine-learning: image: altran1502/immich-machine-learning:release entrypoint: ["/bin/sh", "./entrypoint.sh"] deploy: resources: limits: cpus: 1 volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload env_file: - .env environment: - NODE_ENV=production depends_on: - database restart: always immich-web: image: altran1502/immich-web:release entrypoint: ["/bin/sh", "./entrypoint.sh"] env_file: - .env restart: always redis: container_name: immich_redis image: redis:6.2 restart: always database: container_name: immich_postgres image: postgres:14 env_file: - .env environment: POSTGRES_PASSWORD: ${DB_PASSWORD} POSTGRES_USER: ${DB_USERNAME} POSTGRES_DB: ${DB_DATABASE_NAME} PG_DATA: /var/lib/postgresql/data volumes: - pgdata:/var/lib/postgresql/data restart: always immich-proxy: container_name: immich_proxy image: altran1502/immich-proxy:release ports: - 2283:8080 logging: driver: none depends_on: - immich-server restart: always volumes: pgdata: ``` redacted .env ``` DB_HOSTNAME=immich_postgres DB_USERNAME=postgres DB_PASSWORD=postgres DB_DATABASE_NAME=immich REDIS_HOSTNAME=immich_redis UPLOAD_LOCATION=/mnt/hdd/media/Photos LOG_LEVEL=simple JWT_SECRET=<secret> ``` **Task List** - [ x ] I have read thoroughly the README setup and installation instructions. - [ x ] I have included my `docker-compose` file. - [ x ] I have included my redacted `.env` file. - [ x ] I have included information on my machine, and environment. **System** - Phone OS [iOS, Android]: `Android 11` - Server Version: Don't wanna start the container to find out but its latest as of now, so I'd guess `1.32.1` - Mobile App Version: `1.32.0`
Author
Owner

@alextran1502 commented on GitHub (Oct 24, 2022):

After you upload all your asset, the server will handle all the heavy lifting tasks of video encoding, thumbnail generation and machine learning. Depends on the number of assets you send to the server, it will take sometimes.

@alextran1502 commented on GitHub (Oct 24, 2022): After you upload all your asset, the server will handle all the heavy lifting tasks of video encoding, thumbnail generation and machine learning. Depends on the number of assets you send to the server, it will take sometimes.
Author
Owner

@joeShuff commented on GitHub (Oct 24, 2022):

Do you have any info on what container does what? I tried limiting some containers CPU usage because its not the only thing I have running on this machine so having to share out the resources is definitely something I would need to do.

@joeShuff commented on GitHub (Oct 24, 2022): Do you have any info on what container does what? I tried limiting some containers CPU usage because its not the only thing I have running on this machine so having to share out the resources is definitely something I would need to do.
Author
Owner

@alextran1502 commented on GitHub (Oct 24, 2022):

Yes, here is the info

Server

  • Handle I/O

Microservice

  • JPEG thumbnail generation
  • WEBP thumbnail generation
  • H265 Encoding
  • Reverse Geocoding
  • Metadata/EXIF extraction

Machine Learning

  • Object detection
  • Image Classification

I think you will want to limit the CPU on microservice and machine-learning container

@alextran1502 commented on GitHub (Oct 24, 2022): Yes, here is the info ## Server - Handle I/O ## Microservice - JPEG thumbnail generation - WEBP thumbnail generation - H265 Encoding - Reverse Geocoding - Metadata/EXIF extraction ## Machine Learning - Object detection - Image Classification I think you will want to limit the CPU on `microservice` and `machine-learning` container
Author
Owner

@bo0tzz commented on GitHub (Oct 24, 2022):

The high IOWait was on my boot drive where the docker volumes would be, NOT the drive my media is stored on and mounted via the UPLOAD_LOCATION env variable.

@alextran1502 Do you know whether uploads go straight to the library folder, or are they stored in a temporary location first?

@bo0tzz commented on GitHub (Oct 24, 2022): > The high IOWait was on my boot drive where the docker volumes would be, NOT the drive my media is stored on and mounted via the UPLOAD_LOCATION env variable. @alextran1502 Do you know whether uploads go straight to the library folder, or are they stored in a temporary location first?
Author
Owner

@alextran1502 commented on GitHub (Oct 24, 2022):

The high IOWait was on my boot drive where the docker volumes would be, NOT the drive my media is stored on and mounted via the UPLOAD_LOCATION env variable.

@alextran1502 Do you know whether uploads go straight to the library folder, or are they stored in a temporary location first?

From my understanding it goes straight to the library folder. Though I am not sure if Nginx interfere something here.

@alextran1502 commented on GitHub (Oct 24, 2022): > > The high IOWait was on my boot drive where the docker volumes would be, NOT the drive my media is stored on and mounted via the UPLOAD_LOCATION env variable. > > > > @alextran1502 Do you know whether uploads go straight to the library folder, or are they stored in a temporary location first? From my understanding it goes straight to the library folder. Though I am not sure if Nginx interfere something here.
Author
Owner

@joeShuff commented on GitHub (Oct 24, 2022):

I think you will want to limit the CPU on microservice and machine-learning container

These are the two I've already limited to no effect unfortunately. The iowait processing power doesn't seem to count as the containers cpu, I have a docker monitoring container and when the server was struggling, none of the containers were showing excessive usage. Might have to schedule a time to turn immich on and let it process.

Is the h265 encoding in place to optimise storage space? If so would you consider the possibility of disabling it via an environment variable? I'm just thinking that processing power is more important to me now than the storage space. The abundance of storage space is why I would like to self host a photos solution.

@joeShuff commented on GitHub (Oct 24, 2022): > I think you will want to limit the CPU on microservice and machine-learning container These are the two I've already limited to no effect unfortunately. The iowait processing power doesn't seem to count as the containers cpu, I have a docker monitoring container and when the server was struggling, none of the containers were showing excessive usage. Might have to schedule a time to turn immich on and let it process. Is the h265 encoding in place to optimise storage space? If so would you consider the possibility of disabling it via an environment variable? I'm just thinking that processing power is more important to me now than the storage space. The abundance of storage space is why I would like to self host a photos solution.
Author
Owner

@alextran1502 commented on GitHub (Oct 24, 2022):

I think you will want to limit the CPU on microservice and machine-learning container

These are the two I've already limited to no effect unfortunately. The iowait processing power doesn't seem to count as the containers cpu, I have a docker monitoring container and when the server was struggling, none of the containers were showing excessive usage. Might have to schedule a time to turn immich on and let it process.

Is the h265 encoding in place to optimise storage space? If so would you consider the possibility of disabling it via an environment variable? I'm just thinking that processing power is more important to me now than the storage space. The abundance of storage space is why I would like to self host a photos solution.

The h265 encoding is for showing the video on the web. Mostly happen on iOS MOV video file though. And if it is running, the workload will be mainly on microservice container CPU usage. I am not sure why you are running into High IO load here. Are you running your server on SSD or HDD?

How many asset did you upload from the Google takeout?

Does the high IO load happen after you restart the stack as well?

@alextran1502 commented on GitHub (Oct 24, 2022): > > I think you will want to limit the CPU on microservice and machine-learning container > > These are the two I've already limited to no effect unfortunately. The iowait processing power doesn't seem to count as the containers cpu, I have a docker monitoring container and when the server was struggling, none of the containers were showing excessive usage. Might have to schedule a time to turn immich on and let it process. > > Is the h265 encoding in place to optimise storage space? If so would you consider the possibility of disabling it via an environment variable? I'm just thinking that processing power is more important to me now than the storage space. The abundance of storage space is why I would like to self host a photos solution. The h265 encoding is for showing the video on the web. Mostly happen on iOS MOV video file though. And if it is running, the workload will be mainly on `microservice` container CPU usage. I am not sure why you are running into High IO load here. Are you running your server on SSD or HDD? How many asset did you upload from the Google takeout? Does the high IO load happen after you restart the stack as well?
Author
Owner

@joeShuff commented on GitHub (Oct 25, 2022):

Are you running your server on SSD or HDD?

The servers boot drive is an nvme SSD, which was the one I was seeing a high IO load on, but the storage is on a 12TB hard disk connected via USB3 and auto-mounted. Is it possible that the high IO is the cache of the transcoding? or does all transcoding happen on the drive the target media is on?

How many asset did you upload from the Google takeout?

It was something in the area of 50k-60k, majority images.

Does the high IO load happen after you restart the stack as well?

I did a system reboot a couple of times thinking it was something just misbehaving as I've seen that before but I've not tried just starting the stack from a calm system.

Are all of the containers dependant on one another or would there be some way of starting one by one and monitoring the system stats?

@joeShuff commented on GitHub (Oct 25, 2022): > Are you running your server on SSD or HDD? The servers boot drive is an nvme SSD, which was the one I was seeing a high IO load on, but the storage is on a 12TB hard disk connected via USB3 and auto-mounted. Is it possible that the high IO is the cache of the transcoding? or does all transcoding happen on the drive the target media is on? > How many asset did you upload from the Google takeout? It was something in the area of 50k-60k, majority images. > Does the high IO load happen after you restart the stack as well? I did a system reboot a couple of times thinking it was something just misbehaving as I've seen that before but I've not tried just starting the stack from a calm system. Are all of the containers dependant on one another or would there be some way of starting one by one and monitoring the system stats?
Author
Owner

@bo0tzz commented on GitHub (Oct 25, 2022):

When uploading assets, they first go to the immich-server container. That places them in the library folder, and then adds a task to the redis queue for processing. All that processing is handled by microservices, which also calls out to machine-learning (if it is available). So with just postgres, redis, and microservices started, that's where I would expect to see the load.
You could also try clearing the state from redis (deleting and recreating the container should work). That would empty the queue, after which there should be no more load from Immich. You can restart the EXIF processing job from the Administration->Jobs settings in Immich.
As far as I can tell, transcoding shouldn't be using any temporary files - it should go straight from the original file to the output.

@bo0tzz commented on GitHub (Oct 25, 2022): When uploading assets, they first go to the immich-server container. That places them in the library folder, and then adds a task to the redis queue for processing. All that processing is handled by microservices, which also calls out to machine-learning (if it is available). So with just postgres, redis, and microservices started, that's where I would expect to see the load. You could also try clearing the state from redis (deleting and recreating the container should work). That would empty the queue, after which there should be no more load from Immich. You can restart the EXIF processing job from the Administration->Jobs settings in Immich. As far as I can tell, transcoding shouldn't be using any temporary files - it should go straight from the original file to the output.
Author
Owner

@joeShuff commented on GitHub (Oct 25, 2022):

I'll give that a go later, thanks for the information.

@joeShuff commented on GitHub (Oct 25, 2022): I'll give that a go later, thanks for the information.
Author
Owner

@joeShuff commented on GitHub (Oct 26, 2022):

So I started the containers again last night, one by one. Started with postgres, redis and microservices, was stable for like 15 minutes, then I started the server and web too, still stable. So I took the time to sync my partners phone to the server, all was fine. Saw high CPU usage from microservices which would've been the importing, but it was under the user usage so thats fine, the docker CPU limiting did its job.

Then I started the machine learning container, saw the amount of outstanding images go down like 2k every second which seemed a lot, but then it slowed down to like 1 every 2 seconds and then the IOWait usage shot up to like 60%. Turned off the machine learning and it dropped back down so I've left it off for now.

Is this anything that I should assist and do more investigation into or are you happy that its an intensive process so its behaving as you'd expect?

@joeShuff commented on GitHub (Oct 26, 2022): So I started the containers again last night, one by one. Started with postgres, redis and microservices, was stable for like 15 minutes, then I started the server and web too, still stable. So I took the time to sync my partners phone to the server, all was fine. Saw high CPU usage from microservices which would've been the importing, but it was under the `user` usage so thats fine, the docker CPU limiting did its job. Then I started the machine learning container, saw the amount of outstanding images go down like 2k every second which seemed a lot, but then it slowed down to like 1 every 2 seconds and then the IOWait usage shot up to like 60%. Turned off the machine learning and it dropped back down so I've left it off for now. Is this anything that I should assist and do more investigation into or are you happy that its an intensive process so its behaving as you'd expect?
Author
Owner

@alextran1502 commented on GitHub (Oct 29, 2022):

saw the amount of outstanding images go down like 2k every second which seemed a lot

What do you mean by outstanding images?

@alextran1502 commented on GitHub (Oct 29, 2022): > saw the amount of outstanding images go down like 2k every second which seemed a lot What do you mean by *outstanding images*?
Author
Owner

@joeShuff commented on GitHub (Oct 30, 2022):

What do you mean by outstanding images?

Sorry I should've used the term that was on the front end to avoid confusion, I mean in the "Jobs" section of the Administration part of the front end, the number in the "Waiting" column.

@joeShuff commented on GitHub (Oct 30, 2022): > What do you mean by outstanding images? Sorry I should've used the term that was on the front end to avoid confusion, I mean in the "Jobs" section of the Administration part of the front end, the number in the "Waiting" column.
Author
Owner

@alextran1502 commented on GitHub (Dec 27, 2022):

So I started the containers again last night, one by one. Started with postgres, redis and microservices, was stable for like 15 minutes, then I started the server and web too, still stable. So I took the time to sync my partners phone to the server, all was fine. Saw high CPU usage from microservices which would've been the importing, but it was under the user usage so thats fine, the docker CPU limiting did its job.

Then I started the machine learning container, saw the amount of outstanding images go down like 2k every second which seemed a lot, but then it slowed down to like 1 every 2 seconds and then the IOWait usage shot up to like 60%. Turned off the machine learning and it dropped back down so I've left it off for now.

Is this anything that I should assist and do more investigation into or are you happy that its an intensive process so its behaving as you'd expect?

After re-reading the issue, I think if you limit the CPU usage of the machine-learning container, it will help with the IO wait as well since it is very CPU intensive to perform inference tasks for machine learning.

I will close this issue for now as it is the expected behavior of the application. Please feel free to open this if you have additional problems or issues

@alextran1502 commented on GitHub (Dec 27, 2022): > So I started the containers again last night, one by one. Started with postgres, redis and microservices, was stable for like 15 minutes, then I started the server and web too, still stable. So I took the time to sync my partners phone to the server, all was fine. Saw high CPU usage from microservices which would've been the importing, but it was under the `user` usage so thats fine, the docker CPU limiting did its job. > > Then I started the machine learning container, saw the amount of outstanding images go down like 2k every second which seemed a lot, but then it slowed down to like 1 every 2 seconds and then the IOWait usage shot up to like 60%. Turned off the machine learning and it dropped back down so I've left it off for now. > > Is this anything that I should assist and do more investigation into or are you happy that its an intensive process so its behaving as you'd expect? After re-reading the issue, I think if you limit the CPU usage of the machine-learning container, it will help with the IO wait as well since it is very CPU intensive to perform inference tasks for machine learning. I will close this issue for now as it is the expected behavior of the application. Please feel free to open this if you have additional problems or issues
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: immich-app/immich#378