[Feature]: Free text image search using CLIP features #559

New Issue

OVERLORD · 2026-02-04T21:21:14+03:00

OVERLORD commented

2026-02-04 21:21:14 +03:00

Originally created by @TheStealthReporter on GitHub (Jan 8, 2023).

Feature detail

I've seen that currently work is done in Immich to implement image search. If this search system is based on "fixed" tags/labeling it might be worth looking into CLIP embeddings. I tried the CLIP embedding approach on my photo collection and it was vastly superior in retrieving images compared to any class-output-based (like the 1000 ImageNet classes) neural network that I tried.

How it works

The idea behind the embeddings is that there are two different neural networks that transform the input in a common "semantic" space (where related concepts are positioned closely together in that space):

text -> CLIP embedding space
images -> CLIP embedding space

The CLIP embeddings for the photos can be precomputed once. The "text -> CLIP embedding" model has to be run every time the user enters a search query. Through a standard nearest-neighbor search inside the CLIP space we can retrieve the most related photos to a given search query.

This idea has been discussed for PhotoPrism before. The code I used on my photo collection was derived from the example given there (otherwise a minimal example is also provided here).

Advantage compared to class-based image search

The advantage of this approach is that you can also successfully search for more complicated queries like "three people" or "a person wearing a hat next to a dog". With queries like this, I was able to accurately find any specific photo within the five nearest-neighbors results in a database of 5000 images usually with the first query that came to mind.

If you were already aware of this approach feel free to close this issue (I haven't seen it discussed on this repo before though) - I'm just hoping to spread awareness about it.

Platform

Server

Originally created by @TheStealthReporter on GitHub (Jan 8, 2023). ### Feature detail I've seen that currently work is done in Immich to implement image search. If this search system is based on "fixed" tags/labeling it might be worth looking into [CLIP](https://github.com/openai/CLIP) embeddings. I tried the CLIP embedding approach on my photo collection and it was __vastly__ superior in retrieving images compared to any class-output-based (like the 1000 ImageNet classes) neural network that I tried. How it works ---- The idea behind the embeddings is that there are two different neural networks that transform the input in a common "semantic" space (where related concepts are positioned closely together in that space): - text -> CLIP embedding space - images -> CLIP embedding space The CLIP embeddings for the photos can be precomputed once. The "text -> CLIP embedding" model has to be run every time the user enters a search query. Through a standard nearest-neighbor search inside the CLIP space we can retrieve the most related photos to a given search query. This idea has been discussed for [PhotoPrism](https://github.com/photoprism/photoprism/issues/1287) before. The code I used on my photo collection was derived from the example given there (otherwise a minimal example is also provided [here](https://huggingface.co/sentence-transformers/clip-ViT-B-32)). Advantage compared to class-based image search --- The advantage of this approach is that you can also successfully search for more complicated queries like "three people" or "a person wearing a hat next to a dog". With queries like this, I was able to accurately find any specific photo within the five nearest-neighbors results in a database of 5000 images usually with the first query that came to mind. If you were already aware of this approach feel free to close this issue (I haven't seen it discussed on this repo before though) - I'm just hoping to spread awareness about it. ### Platform Server

OVERLORD closed this issue

2026-02-04 21:21:18 +03:00

OVERLORD commented

2026-02-04 21:21:21 +03:00

@jrasm91 commented on GitHub (Jan 8, 2023):

So you use an existing model, encode each image (convert to clip space) and save it as a binary file, then for queries you encode the query, load the binary file, and do a nearest neighbor search? Am i understanding that correctly?

Do you know how long it takes to train 5000ish pictures? Or, how big the binary file is in relation to image count? This looks really interesting and potentially a better approach than the image classification we're doing now.

I assume we would index new files as they're uploaded. Is it possible to remove an image from the index as well?

@jrasm91 commented on GitHub (Jan 8, 2023): So you use an existing model, encode each image (convert to clip space) and save it as a binary file, then for queries you encode the query, load the binary file, and do a nearest neighbor search? Am i understanding that correctly? Do you know how long it takes to train 5000ish pictures? Or, how big the binary file is in relation to image count? This looks really interesting and potentially a better approach than the image classification we're doing now. I assume we would index new files as they're uploaded. Is it possible to remove an image from the index as well?

OVERLORD commented

2026-02-04 21:21:26 +03:00

@TheStealthReporter commented on GitHub (Jan 8, 2023):

Yes, the pre-trained model clip-ViT-B-32 (if I remember correctly ~600MB) is what I've used in my experiments. Each embedding is a 512-dimensional vector. The "database" file of my 5000 photos has a final size of 11MB.

Running the "image -> CLIP space" model on my Ryzen 7 5800X CPU (but only single threaded) took about something 30min-45min for the 5000 photos. So each image takes a bit less than 1s.

I've tried a bit to achieve multi-threading but I've not managed to get it to work with my first tries. Not sure how complicated it is to apply the model to multiple images concurrently/multi-threaded with Python (without loading the model for each thread individually)...

I'd advise using a spatial acceleration structure for the (approximate) nearest-neighbor search. For 5000 photos it's fine to iterate over all of them but for more larger image databases we'd probably like the logarithmic complexity. In a pull request for PhotoPrism, the qdrant database was proposed. I've also stumbled upon the FAISS library for that purpose. I don't know these libraries, this is just what others have used. An investigation which nearest-neighbor databases exist might be necessary.

Loading the model takes a few seconds. So the script should not be used as-is but rather used as a place for experimentation.

Here my code which also visualizes the results:

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
import os
import numpy as np
from tqdm.auto import tqdm
from PIL import Image, ImageOps
import multiprocessing

from matplotlib import pyplot as plt
import sys

yourimageglob = '/home/user/Pictures/Camera/*.jpg'


if __name__ == "__main__":
    #First, we load the respective CLIP model
    model = SentenceTransformer('clip-ViT-B-32')
    #model = SentenceTransformer('clip-ViT-L-14')

    use_precomputed_embeddings = True

    emb_file = 'pretrained_embeddings.pkl'

    try:
        with open(emb_file, 'rb') as fIn:
            img_names, img_emb = pickle.load(fIn)
    except:
        img_names = list(glob.glob(yourimageglob))
        img_names = img_names[0:5000]
        #print("Images:", len(img_names))

        def compute_embedding(i, img_name):
            global model
            print("analyze {}/{} {} ".format(i + 1, len(img_names), img_name))
            img = Image.open(img_name)
            img = ImageOps.exif_transpose(img)
            img_emb = model.encode(img, device='cpu')
            img.close()
            return img_emb

        img_emb = []
        for this_img_emb in map(compute_embedding, range(len(img_names)), img_names):
            img_emb.append(this_img_emb)
        # print(img_emb)
        img_emb = np.array(img_emb)
        img_emb = torch.tensor(img_emb)
        # img_emb = list()
        # for filepath in img_names:
        #     print('Analyzing {}'.format(filepath))
        #     this_img_emb = model.encode(Image.open(filepath))
        #     img_emb.append(this_img_emb)

        data = (img_names, img_emb)
        file = open(emb_file, 'wb')

        # dump information to that file
        pickle.dump(data, file)

        # close the file
        file.close()


    query = sys.argv[1]
    text_emb = model.encode([query])
    scores = list()
    for img_name, img_em in zip(img_names, img_emb):
        cos_score = util.cos_sim(img_em, text_emb).tolist()[0][0] * 100
        scores.append(cos_score)

    scored_imgs = list(zip(img_names, scores))
    scored_imgs.sort(key=lambda v: v[1], reverse=True)
    for img_name, score in reversed(scored_imgs):
        print("{:.2f} {}".format(score, img_name))

    print(text_emb.shape)

    # create figure
    fig = plt.figure(figsize=(10, 7))


    for i, (img_name, score) in enumerate(scored_imgs[:9]):
        # Adds a subplot at the 1st position
        fig.add_subplot(3, 3, i + 1)

        img = Image.open(img_name)
        img = ImageOps.exif_transpose(img)

        # showing image
        plt.imshow(img)
        plt.axis('off')
        plt.title("{:.2f}".format(score))

    plt.show()

@TheStealthReporter commented on GitHub (Jan 8, 2023): Yes, the pre-trained model `clip-ViT-B-32` (if I remember correctly ~600MB) is what I've used in my experiments. Each embedding is a 512-dimensional vector. The "database" file of my 5000 photos has a final size of 11MB. Running the "image -> CLIP space" model on my Ryzen 7 5800X CPU (but only single threaded) took about something 30min-45min for the 5000 photos. So each image takes a bit less than 1s. I've tried a bit to achieve multi-threading but I've not managed to get it to work with my first tries. Not sure how complicated it is to apply the model to multiple images concurrently/multi-threaded with Python (without loading the model for each thread individually)... I'd advise using a spatial acceleration structure for the (approximate) nearest-neighbor search. For 5000 photos it's fine to iterate over all of them but for more larger image databases we'd probably like the logarithmic complexity. In a pull request for [PhotoPrism](https://github.com/photoprism/photoprism/pull/2005), the [qdrant](https://hub.docker.com/r/qdrant/qdrant/) database was proposed. I've also stumbled upon the [FAISS](https://github.com/facebookresearch/faiss) library for that purpose. I don't know these libraries, this is just what others have used. An investigation which nearest-neighbor databases exist might be necessary. Loading the model takes a few seconds. So the script should not be used as-is but rather used as a place for experimentation. Here my code which also visualizes the results: ``` from sentence_transformers import SentenceTransformer, util from PIL import Image import glob import torch import pickle import zipfile import os import numpy as np from tqdm.auto import tqdm from PIL import Image, ImageOps import multiprocessing from matplotlib import pyplot as plt import sys yourimageglob = '/home/user/Pictures/Camera/*.jpg' if __name__ == "__main__": #First, we load the respective CLIP model model = SentenceTransformer('clip-ViT-B-32') #model = SentenceTransformer('clip-ViT-L-14') use_precomputed_embeddings = True emb_file = 'pretrained_embeddings.pkl' try: with open(emb_file, 'rb') as fIn: img_names, img_emb = pickle.load(fIn) except: img_names = list(glob.glob(yourimageglob)) img_names = img_names[0:5000] #print("Images:", len(img_names)) def compute_embedding(i, img_name): global model print("analyze {}/{} {} ".format(i + 1, len(img_names), img_name)) img = Image.open(img_name) img = ImageOps.exif_transpose(img) img_emb = model.encode(img, device='cpu') img.close() return img_emb img_emb = [] for this_img_emb in map(compute_embedding, range(len(img_names)), img_names): img_emb.append(this_img_emb) # print(img_emb) img_emb = np.array(img_emb) img_emb = torch.tensor(img_emb) # img_emb = list() # for filepath in img_names: # print('Analyzing {}'.format(filepath)) # this_img_emb = model.encode(Image.open(filepath)) # img_emb.append(this_img_emb) data = (img_names, img_emb) file = open(emb_file, 'wb') # dump information to that file pickle.dump(data, file) # close the file file.close() query = sys.argv[1] text_emb = model.encode([query]) scores = list() for img_name, img_em in zip(img_names, img_emb): cos_score = util.cos_sim(img_em, text_emb).tolist()[0][0] * 100 scores.append(cos_score) scored_imgs = list(zip(img_names, scores)) scored_imgs.sort(key=lambda v: v[1], reverse=True) for img_name, score in reversed(scored_imgs): print("{:.2f} {}".format(score, img_name)) print(text_emb.shape) # create figure fig = plt.figure(figsize=(10, 7)) for i, (img_name, score) in enumerate(scored_imgs[:9]): # Adds a subplot at the 1st position fig.add_subplot(3, 3, i + 1) img = Image.open(img_name) img = ImageOps.exif_transpose(img) # showing image plt.imshow(img) plt.axis('off') plt.title("{:.2f}".format(score)) plt.show() ```

OVERLORD commented

2026-02-04 21:21:30 +03:00

@bo0tzz commented on GitHub (Jan 8, 2023):

Very cool stuff! For the nearest-neighbour, do you know whether it'd be possible to use Postgres for that somehow? That way we wouldn't need to add another stateful container :)

@bo0tzz commented on GitHub (Jan 8, 2023): Very cool stuff! For the nearest-neighbour, do you know whether it'd be possible to use Postgres for that somehow? That way we wouldn't need to add another stateful container :)

OVERLORD commented

2026-02-04 21:21:33 +03:00

@TheStealthReporter commented on GitHub (Jan 8, 2023):

I'm not familiar with Postgres. But natively it isn't supported as far as I can tell. A quick Google search "postgresql high dimensional nearest neighbor search extension" yields the Postgres extension PASE (paper). How easy to include this compared to qdrant I don't know.

@TheStealthReporter commented on GitHub (Jan 8, 2023): I'm not familiar with Postgres. But natively it isn't supported as far as I can tell. A quick Google search "postgresql high dimensional nearest neighbor search extension" yields the Postgres extension [PASE](https://github.com/alipay/PASE) ([paper](https://dl.acm.org/doi/abs/10.1145/3318464.3386131)). How easy to include this compared to qdrant I don't know.

OVERLORD commented

2026-02-04 21:21:39 +03:00

@yowmamasita commented on GitHub (Feb 8, 2023):

Saw this posted on HN https://mazzzystar.github.io/2022/12/29/Run-CLIP-on-iPhone-to-Search-Photos/

@yowmamasita commented on GitHub (Feb 8, 2023): Saw this posted on HN https://mazzzystar.github.io/2022/12/29/Run-CLIP-on-iPhone-to-Search-Photos/

OVERLORD commented

2026-02-04 21:21:44 +03:00

@jrasm91 commented on GitHub (Feb 8, 2023):

It's funny - we were just talking about this internally 👍

@jrasm91 commented on GitHub (Feb 8, 2023): It's funny - we were just talking about this internally :+1:

Sign in to join this conversation.

Branches Tags

main

feat/asset-file-apis

chore/translations

fix/web-switch-label-clickable

fix/web-people-hidden-state

renovate/typescript-projects

release/next

fix/timezones

fix/time-zone-upserts

midzelis/wip

push-zpwsovysllvn

push-nwxlpmyzkyrl

push-nvnkszuqwppm

renovate/github-actions

push-smstsuupsowp

refactor/adaptive_image

push-olwpzvrxnomt

push-lmxsupnmxspl

renovate/machine-learning

feat/web-chromecast-video-looping

feat/use-native-clients

renovate/flutter

fix/create-face-edited

fix/mobile-ios-mtls

docs/contributing

docs/mise-mobile

renovate/grafana-monorepo

feature/bottom-buttons-order

feat/immich-mobile-ui-showcase

refactor/consolidate-image-requests

renovate/connectivity_plus-7.x

renovate/major-vitest-monorepo

renovate/pypi-python-multipart-vulnerability

fix/mobile-people-query

sqlite_thumbs

feat/html-text

chore/no-macro-validation

refactor/purchase-store

uhthomas/mobile-fix-app-bar-fade

uhthomas/mobile-fix-asset-jump

feat/pano-ocr

feat/shared-link-login

fix/database-backup-db-names

fix-keep-correct-ios-shared-album-asset

fix-memory-generation-and-display

feat/verify-permissions

refactor/album-service-small-tests

fix/ml-rocm-build

fix/flipped-dimensions-mobile

push-vpxwmwwxwnvw

fix-migration-width-height

refactor/more-queries

revert/prettier-translations

refactor/asset-service-queries

fix/locale-settings-desc

chore/add-debug-log

feat/edit-filters

shared-deep-link-handler

feat/mobile-editing

feat/thumbnail-native-clients

feat/platform-clients

feat/integrity-checks-izzy

fix/foreground-cloud-sync

feat/dynamic-layout

filter-by-person

feat/csp

refactor/sidebar

fix/disable-editing

fix/view-timeline-deeplink

image-zoom-on-slow-connection

fix-consider-dar-for-video-dimension

fix/merged-edited-assets

perf/optimize-album-sort

open-api-fix

feat/create-job-with-dto

use-toast-primary

feat/vitest-4

feat/ios-fastlane-match

match-signing

fix-update-time-update-timeline

chore/translation-keys

feat/modal-routes

feat/panorama-tiles

feature/mobile-view-asset-owner

feat/system-settings

feature/show-activity-count

better-info-in-asset-viewer

fix/all-people-count

feat/location-favorites

feature/rearrange-buttons-2

fix/download-storage-template

feat/kb-shortcuts-mobile

fix/people-count

push-qolzzzzxrvvn

chore/originals-in-asset-files

feat/asset-size-columns

ben/tree-a11y

new-search-filter-ui

refactor/expectSelectedReadonly

refactor/mobile-grdb

push-qvuktpxmkknu

feat/mobile-native-local-sync

refactor/timeline_ops

fix/scrubber_end

feat/version.txt

feat/context-menus

feat/server-chunked-uploads

refactor/virtualsegment

refactor/rename_daymonth_groups

fix/restrict-android-bg-worker

feat/android-periodic-worker

fix-remote-sync-clean-up

refactor/timeline_move_ops

renovate/mapbox-mapbox-gl-rtl-text-0.x

fix/timeline_split_selectable

feat/keyboard_actions_help_modal

feat/static_frontend

feat/notification-warnign-android

feat/plugins2

feat/plugins

test/create-workflow-token-action

fix/docs-force

debug/search-result-similarity

debug/cf-chunked-uploads

feat/eslint_rule

feat/search-filter-album/web

refactor/timeline_photostream

refactor/timelineasset_asset

feat/session-permissions

feat/timeline_photostream_assetnav

feat/timeline_minor_optimize

feat/timeline_perf_nocomp

feat/timeline_search_results_actions

feat/timeline_search_results_page

fix/timeline_padding

fix/timeline_search_reactivity_warnings

feat/timeline_scrollbar

feat/timeline_stream_withviewer

fix/timeline_back_forth_nav

refactor/timeline_photostream_component

fix/generated-files-checks

fix/locate-button-local

chore/base-image-mimalloc

refactor/timeline_assetlayout

refactor/timeline_selectable

refactor/timeline_aware_actions

refactor/timeline_monthsegment

feat/remove-old-pages

chore/deps-gradle

tmp_photostream

tmp/lcms

feat/mobile-dynamic-thumbnails

fix/mobile-finer-thumbnail-concurrency

refactor/timeline1

refactor/extract_photostream

refactor/rename_load_api

refactor/timeline2

refactor/timeline3

feat/multi-select-asset-viewer

feat-no-thumbhash-cache

refactor/asset_grid

feat/faster-access-checks

fix/18991

fix/19543

chore/temp-remove

fix/21419

feat/mobile-hdr-images

chore/update-mise-lockfile

feat/mise-server-checks

feat/mise-ci

feat/windows-2025

feat/dev_cli

refactor/mobile-migrate-clients

fix/map-theme

fix/require-checkbox

chore/use_swc

feat/efficient-thumbnail-decoding

refactor/mobile-thumbhash

refactor/mobile-thumbhash-new

fix/mobile-uncached-zoom

feat/beta-background-upload

fix/beta-timeline-memories-setting

fix/failed-uploads-not-removed

feat/mobile-shared-album

feat/groups

drift-map-page

drift-auth-user-sync

fix/disable-memory

feat/add-to-album-action

edit-date-time-action

drift-people-page

sqlite-remove-isIn

feat/inline-storage-columns

chore/required-reviewers

refact/asset-manager

fix/folder-sort

pnpm

feat/widget-multiple-server-urls

chore/medium-tests-dbname

fix/web-no-iterator-find

fix/map-pan-interruption

track-livephotos

timeline_events

chore/oxlint-migration

feat/maintenance-worker

feat/dav

chore/demo-snapshot

refactor/server-side-dedupe

feat/integrity-checks

dev/recognition-eval

lighter_buckets_test

perf/postgres-queue

postgres-queue

focus_rings

refactor/web-stores-1

refactor/add-to-taken

feat/sort-places

feat/sidecar-asset-file

vet

tmp/demo-snapshot-preview

fix/server-migration-file-extension

refactor/mobile-v2

fix/asset-update-race-condition

rknn-toolkit-lite2

refactor/mobile-split-up-search-page

feature/Add-rocm-support-for-machine-learning

feat/rocm

chore/async-hash-file

feat/shared-link-view-count

feat/rotation

feat/graphql

feat/job-ids

feat/ignore-library-permission-error

feat/docker-compose-builder

feat/kysely-typeorm

mobile/onboarding

no-video-player

fix/server-qsv-output-format

chore/server-geodata-tweaks

mobile/native-video-player-no-hero

feat/xxhash

fix/docs-concurrency

feat/preload-ml-textual-model

feat/local-tileserver

refactor/exif-orientation

original-path-infix

refactor/mobile/login-form-1

feat/server-editor-endpoints

fix/server-qsv-vbr

fix-mobile-db-problems

feat/ml-armnn-conversion

feat/mobile/backup-with-album-info

feat/fast-initial-sync-1

chore/handle-output_dims

feat/server-more-robust-generation

feat/unassign-faces

feat/shortcuts-on-asset-grid

feat/background-upload

feat/capacitor-mobile-app-poc

feat/server-nvenc-hw-decoding

release/v1.105

fix/mobile-fetch-non-archive

feat/fine-grained-access-controls

web/automation-ui

feat/mobile-server-endpoint-save-dropdown

feat/blurhash-thumbnail

object-storage

feat/memories-animations

dev/metrics

ml/tflite

feat/ml-export-cli

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: immich-app/immich#559