[Feature]: Free text image search using CLIP features #559

Closed
opened 2026-02-04 21:21:14 +03:00 by OVERLORD · 6 comments
Owner

Originally created by @TheStealthReporter on GitHub (Jan 8, 2023).

Feature detail

I've seen that currently work is done in Immich to implement image search. If this search system is based on "fixed" tags/labeling it might be worth looking into CLIP embeddings. I tried the CLIP embedding approach on my photo collection and it was vastly superior in retrieving images compared to any class-output-based (like the 1000 ImageNet classes) neural network that I tried.

How it works

The idea behind the embeddings is that there are two different neural networks that transform the input in a common "semantic" space (where related concepts are positioned closely together in that space):

  • text -> CLIP embedding space
  • images -> CLIP embedding space

The CLIP embeddings for the photos can be precomputed once. The "text -> CLIP embedding" model has to be run every time the user enters a search query. Through a standard nearest-neighbor search inside the CLIP space we can retrieve the most related photos to a given search query.

This idea has been discussed for PhotoPrism before. The code I used on my photo collection was derived from the example given there (otherwise a minimal example is also provided here).

The advantage of this approach is that you can also successfully search for more complicated queries like "three people" or "a person wearing a hat next to a dog". With queries like this, I was able to accurately find any specific photo within the five nearest-neighbors results in a database of 5000 images usually with the first query that came to mind.

If you were already aware of this approach feel free to close this issue (I haven't seen it discussed on this repo before though) - I'm just hoping to spread awareness about it.

Platform

Server

Originally created by @TheStealthReporter on GitHub (Jan 8, 2023). ### Feature detail I've seen that currently work is done in Immich to implement image search. If this search system is based on "fixed" tags/labeling it might be worth looking into [CLIP](https://github.com/openai/CLIP) embeddings. I tried the CLIP embedding approach on my photo collection and it was __vastly__ superior in retrieving images compared to any class-output-based (like the 1000 ImageNet classes) neural network that I tried. How it works ---- The idea behind the embeddings is that there are two different neural networks that transform the input in a common "semantic" space (where related concepts are positioned closely together in that space): - text -> CLIP embedding space - images -> CLIP embedding space The CLIP embeddings for the photos can be precomputed once. The "text -> CLIP embedding" model has to be run every time the user enters a search query. Through a standard nearest-neighbor search inside the CLIP space we can retrieve the most related photos to a given search query. This idea has been discussed for [PhotoPrism](https://github.com/photoprism/photoprism/issues/1287) before. The code I used on my photo collection was derived from the example given there (otherwise a minimal example is also provided [here](https://huggingface.co/sentence-transformers/clip-ViT-B-32)). Advantage compared to class-based image search --- The advantage of this approach is that you can also successfully search for more complicated queries like "three people" or "a person wearing a hat next to a dog". With queries like this, I was able to accurately find any specific photo within the five nearest-neighbors results in a database of 5000 images usually with the first query that came to mind. If you were already aware of this approach feel free to close this issue (I haven't seen it discussed on this repo before though) - I'm just hoping to spread awareness about it. ### Platform Server
Author
Owner

@jrasm91 commented on GitHub (Jan 8, 2023):

So you use an existing model, encode each image (convert to clip space) and save it as a binary file, then for queries you encode the query, load the binary file, and do a nearest neighbor search? Am i understanding that correctly?

Do you know how long it takes to train 5000ish pictures? Or, how big the binary file is in relation to image count? This looks really interesting and potentially a better approach than the image classification we're doing now.

I assume we would index new files as they're uploaded. Is it possible to remove an image from the index as well?

@jrasm91 commented on GitHub (Jan 8, 2023): So you use an existing model, encode each image (convert to clip space) and save it as a binary file, then for queries you encode the query, load the binary file, and do a nearest neighbor search? Am i understanding that correctly? Do you know how long it takes to train 5000ish pictures? Or, how big the binary file is in relation to image count? This looks really interesting and potentially a better approach than the image classification we're doing now. I assume we would index new files as they're uploaded. Is it possible to remove an image from the index as well?
Author
Owner

@TheStealthReporter commented on GitHub (Jan 8, 2023):

Yes, the pre-trained model clip-ViT-B-32 (if I remember correctly ~600MB) is what I've used in my experiments. Each embedding is a 512-dimensional vector. The "database" file of my 5000 photos has a final size of 11MB.

Running the "image -> CLIP space" model on my Ryzen 7 5800X CPU (but only single threaded) took about something 30min-45min for the 5000 photos. So each image takes a bit less than 1s.

I've tried a bit to achieve multi-threading but I've not managed to get it to work with my first tries. Not sure how complicated it is to apply the model to multiple images concurrently/multi-threaded with Python (without loading the model for each thread individually)...

I'd advise using a spatial acceleration structure for the (approximate) nearest-neighbor search. For 5000 photos it's fine to iterate over all of them but for more larger image databases we'd probably like the logarithmic complexity. In a pull request for PhotoPrism, the qdrant database was proposed. I've also stumbled upon the FAISS library for that purpose. I don't know these libraries, this is just what others have used. An investigation which nearest-neighbor databases exist might be necessary.

Loading the model takes a few seconds. So the script should not be used as-is but rather used as a place for experimentation.

Here my code which also visualizes the results:

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
import os
import numpy as np
from tqdm.auto import tqdm
from PIL import Image, ImageOps
import multiprocessing

from matplotlib import pyplot as plt
import sys

yourimageglob = '/home/user/Pictures/Camera/*.jpg'


if __name__ == "__main__":
    #First, we load the respective CLIP model
    model = SentenceTransformer('clip-ViT-B-32')
    #model = SentenceTransformer('clip-ViT-L-14')

    use_precomputed_embeddings = True

    emb_file = 'pretrained_embeddings.pkl'

    try:
        with open(emb_file, 'rb') as fIn:
            img_names, img_emb = pickle.load(fIn)
    except:
        img_names = list(glob.glob(yourimageglob))
        img_names = img_names[0:5000]
        #print("Images:", len(img_names))

        def compute_embedding(i, img_name):
            global model
            print("analyze {}/{} {} ".format(i + 1, len(img_names), img_name))
            img = Image.open(img_name)
            img = ImageOps.exif_transpose(img)
            img_emb = model.encode(img, device='cpu')
            img.close()
            return img_emb

        img_emb = []
        for this_img_emb in map(compute_embedding, range(len(img_names)), img_names):
            img_emb.append(this_img_emb)
        # print(img_emb)
        img_emb = np.array(img_emb)
        img_emb = torch.tensor(img_emb)
        # img_emb = list()
        # for filepath in img_names:
        #     print('Analyzing {}'.format(filepath))
        #     this_img_emb = model.encode(Image.open(filepath))
        #     img_emb.append(this_img_emb)

        data = (img_names, img_emb)
        file = open(emb_file, 'wb')

        # dump information to that file
        pickle.dump(data, file)

        # close the file
        file.close()


    query = sys.argv[1]
    text_emb = model.encode([query])
    scores = list()
    for img_name, img_em in zip(img_names, img_emb):
        cos_score = util.cos_sim(img_em, text_emb).tolist()[0][0] * 100
        scores.append(cos_score)

    scored_imgs = list(zip(img_names, scores))
    scored_imgs.sort(key=lambda v: v[1], reverse=True)
    for img_name, score in reversed(scored_imgs):
        print("{:.2f} {}".format(score, img_name))

    print(text_emb.shape)

    # create figure
    fig = plt.figure(figsize=(10, 7))


    for i, (img_name, score) in enumerate(scored_imgs[:9]):
        # Adds a subplot at the 1st position
        fig.add_subplot(3, 3, i + 1)

        img = Image.open(img_name)
        img = ImageOps.exif_transpose(img)

        # showing image
        plt.imshow(img)
        plt.axis('off')
        plt.title("{:.2f}".format(score))

    plt.show()
@TheStealthReporter commented on GitHub (Jan 8, 2023): Yes, the pre-trained model `clip-ViT-B-32` (if I remember correctly ~600MB) is what I've used in my experiments. Each embedding is a 512-dimensional vector. The "database" file of my 5000 photos has a final size of 11MB. Running the "image -> CLIP space" model on my Ryzen 7 5800X CPU (but only single threaded) took about something 30min-45min for the 5000 photos. So each image takes a bit less than 1s. I've tried a bit to achieve multi-threading but I've not managed to get it to work with my first tries. Not sure how complicated it is to apply the model to multiple images concurrently/multi-threaded with Python (without loading the model for each thread individually)... I'd advise using a spatial acceleration structure for the (approximate) nearest-neighbor search. For 5000 photos it's fine to iterate over all of them but for more larger image databases we'd probably like the logarithmic complexity. In a pull request for [PhotoPrism](https://github.com/photoprism/photoprism/pull/2005), the [qdrant](https://hub.docker.com/r/qdrant/qdrant/) database was proposed. I've also stumbled upon the [FAISS](https://github.com/facebookresearch/faiss) library for that purpose. I don't know these libraries, this is just what others have used. An investigation which nearest-neighbor databases exist might be necessary. Loading the model takes a few seconds. So the script should not be used as-is but rather used as a place for experimentation. Here my code which also visualizes the results: ``` from sentence_transformers import SentenceTransformer, util from PIL import Image import glob import torch import pickle import zipfile import os import numpy as np from tqdm.auto import tqdm from PIL import Image, ImageOps import multiprocessing from matplotlib import pyplot as plt import sys yourimageglob = '/home/user/Pictures/Camera/*.jpg' if __name__ == "__main__": #First, we load the respective CLIP model model = SentenceTransformer('clip-ViT-B-32') #model = SentenceTransformer('clip-ViT-L-14') use_precomputed_embeddings = True emb_file = 'pretrained_embeddings.pkl' try: with open(emb_file, 'rb') as fIn: img_names, img_emb = pickle.load(fIn) except: img_names = list(glob.glob(yourimageglob)) img_names = img_names[0:5000] #print("Images:", len(img_names)) def compute_embedding(i, img_name): global model print("analyze {}/{} {} ".format(i + 1, len(img_names), img_name)) img = Image.open(img_name) img = ImageOps.exif_transpose(img) img_emb = model.encode(img, device='cpu') img.close() return img_emb img_emb = [] for this_img_emb in map(compute_embedding, range(len(img_names)), img_names): img_emb.append(this_img_emb) # print(img_emb) img_emb = np.array(img_emb) img_emb = torch.tensor(img_emb) # img_emb = list() # for filepath in img_names: # print('Analyzing {}'.format(filepath)) # this_img_emb = model.encode(Image.open(filepath)) # img_emb.append(this_img_emb) data = (img_names, img_emb) file = open(emb_file, 'wb') # dump information to that file pickle.dump(data, file) # close the file file.close() query = sys.argv[1] text_emb = model.encode([query]) scores = list() for img_name, img_em in zip(img_names, img_emb): cos_score = util.cos_sim(img_em, text_emb).tolist()[0][0] * 100 scores.append(cos_score) scored_imgs = list(zip(img_names, scores)) scored_imgs.sort(key=lambda v: v[1], reverse=True) for img_name, score in reversed(scored_imgs): print("{:.2f} {}".format(score, img_name)) print(text_emb.shape) # create figure fig = plt.figure(figsize=(10, 7)) for i, (img_name, score) in enumerate(scored_imgs[:9]): # Adds a subplot at the 1st position fig.add_subplot(3, 3, i + 1) img = Image.open(img_name) img = ImageOps.exif_transpose(img) # showing image plt.imshow(img) plt.axis('off') plt.title("{:.2f}".format(score)) plt.show() ```
Author
Owner

@bo0tzz commented on GitHub (Jan 8, 2023):

Very cool stuff! For the nearest-neighbour, do you know whether it'd be possible to use Postgres for that somehow? That way we wouldn't need to add another stateful container :)

@bo0tzz commented on GitHub (Jan 8, 2023): Very cool stuff! For the nearest-neighbour, do you know whether it'd be possible to use Postgres for that somehow? That way we wouldn't need to add another stateful container :)
Author
Owner

@TheStealthReporter commented on GitHub (Jan 8, 2023):

I'm not familiar with Postgres. But natively it isn't supported as far as I can tell. A quick Google search "postgresql high dimensional nearest neighbor search extension" yields the Postgres extension PASE (paper). How easy to include this compared to qdrant I don't know.

@TheStealthReporter commented on GitHub (Jan 8, 2023): I'm not familiar with Postgres. But natively it isn't supported as far as I can tell. A quick Google search "postgresql high dimensional nearest neighbor search extension" yields the Postgres extension [PASE](https://github.com/alipay/PASE) ([paper](https://dl.acm.org/doi/abs/10.1145/3318464.3386131)). How easy to include this compared to qdrant I don't know.
Author
Owner

@yowmamasita commented on GitHub (Feb 8, 2023):

Saw this posted on HN https://mazzzystar.github.io/2022/12/29/Run-CLIP-on-iPhone-to-Search-Photos/

@yowmamasita commented on GitHub (Feb 8, 2023): Saw this posted on HN https://mazzzystar.github.io/2022/12/29/Run-CLIP-on-iPhone-to-Search-Photos/
Author
Owner

@jrasm91 commented on GitHub (Feb 8, 2023):

It's funny - we were just talking about this internally 👍

@jrasm91 commented on GitHub (Feb 8, 2023): It's funny - we were just talking about this internally :+1:
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: immich-app/immich#559