🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
Home
  • Tutorials
  • Home
  • Docs
  • Tutorials

  • Text-to-Image Search with Milvus

Open In Colab GitHub Repository

Text-to-Image Search with Milvus

Text-to-image search is an advanced technology that allows users to search for images using natural language text descriptions. It leverages a pretrained multimodal model to convert both text and images into embeddings in a shared semantic space, enabling similarity-based comparisons.

In this tutorial, we will explore how to implement text-based image retrieval using OpenAI’s CLIP (Contrastive Language-Image Pretraining) model and Milvus. We will generate image embeddings with CLIP, store them in Milvus, and perform efficient similarity searches.

Prerequisites

Before you start, make sure you have all the required packages and example data ready.

Install dependencies

  • pymilvus>=2.4.2 for interacting with the Milvus database
  • clip for working with the CLIP model
  • pillow for image processing and visualization
$ pip install --upgrade pymilvus pillow
$ pip install git+https://github.com/openai/CLIP.git

If you’re using Google Colab, you may need to restart the runtime (Navigate to the “Runtime” menu at the top of the interface, and select “Restart session” from the dropdown menu.)

Download example data

We will use a subset of the ImageNet dataset (100 classes, 10 images for each class) as example images. The following command will download the example data and extract it to the local folder ./images_folder:

$ wget https://github.com/towhee-io/examples/releases/download/data/reverse_image_search.zip
$ unzip -q reverse_image_search.zip -d images_folder

Set up Milvus

Before proceeding, set up your Milvus server and connect using your URI (and optionally, a token):

  • Milvus Lite (Recommended for Convenience): Set the URI to a local file, such as ./milvus.db. This automatically leverages Milvus Lite to store all data in a single file.

  • Docker or Kubernetes (For Large-Scale Data): For handling larger datasets, deploy a more performant Milvus server using Docker or Kubernetes. In this case, use the server URI, such as http://localhost:19530, to connect.

  • Zilliz Cloud (Managed Service): If you’re using Zilliz Cloud, Milvus’s fully managed cloud service, set the the Public Endpoint as URI and API Key as token.

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="milvus.db")

Getting Started

Now that you have the necessary dependencies and data, it’s time to set up feature extractors and start working with Milvus. This section will walk you through the key steps of building a text-to-image search system. Finally, we’ll demonstrate how to retrieve and visualize images based on text queries.

Define feature extractors

We will use a pretrained CLIP model to generate image and text embeddings. In this section, we load the pretrained ViT-B/32 variant of CLIP and define helper functions for encoding image and text:

  • encode_image(image_path): Processes and encodes images into feature vectors
  • encode_text(text): Encodes text queries into feature vectors

Both functions normalize the output features to ensure consistent comparisons by converting vectors to unit length, which is essential for accurate cosine similarity calculations.

import clip
from PIL import Image


# Load CLIP model
model_name = "ViT-B/32"
model, preprocess = clip.load(model_name)
model.eval()


# Define a function to encode images
def encode_image(image_path):
    image = preprocess(Image.open(image_path)).unsqueeze(0)
    image_features = model.encode_image(image)
    image_features /= image_features.norm(
        dim=-1, keepdim=True
    )  # Normalize the image features
    return image_features.squeeze().tolist()


# Define a function to encode text
def encode_text(text):
    text_tokens = clip.tokenize(text)
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(
        dim=-1, keepdim=True
    )  # Normalize the text features
    return text_features.squeeze().tolist()

Data Ingestion

To enable semantic image search, we first need to generate embeddings for all images and store them in a vector database for efficient indexing and retrieval. This section provides a step-by-step guide to ingesting image data into Milvus.

1. Create Milvus Collection

Before storing image embeddings, you need to create a Milvus collection. The following code demonstrates how to create a collection in a quick-setup mode with the default COSINE metric type. The collection includes the following fields:

  • id: A primary field with auto ID enabled.

  • vector: A field for storing floating-point vector embeddings.

If you need a custom schema, refer to the Milvus documentation for detailed instructions.

collection_name = "image_collection"

# Drop the collection if it already exists
if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

# Create a new collection in quickstart mode
milvus_client.create_collection(
    collection_name=collection_name,
    dimension=512,  # this should match the dimension of the image embedding
    auto_id=True,  # auto generate id and store in the id field
    enable_dynamic_field=True,  # enable dynamic field for scalar fields
)

2. Insert Data into Milvus

In this step, we use a predefined image encoder to generate embeddings for all JPEG images in the example data directory. These embeddings are then inserted into the Milvus collection, along with their corresponding file paths. Each entry in the collection consists of:

  • Embedding vector: The numerical representation of the image. Stored in the field vector.
  • File path: The location of the image file for reference. Stored in the field filepath as a dynamic field.
import os
from glob import glob


image_dir = "./images_folder/train"
raw_data = []

for image_path in glob(os.path.join(image_dir, "**/*.JPEG")):
    image_embedding = encode_image(image_path)
    image_dict = {"vector": image_embedding, "filepath": image_path}
    raw_data.append(image_dict)
insert_result = milvus_client.insert(collection_name=collection_name, data=raw_data)

print("Inserted", insert_result["insert_count"], "images into Milvus.")
Inserted 1000 images into Milvus.

Now, let’s run a search using an example text query. This will retrieve the most relevant images based on their semantic similarity to the given text description.

query_text = "a white dog"
query_embedding = encode_text(query_text)

search_results = milvus_client.search(
    collection_name=collection_name,
    data=[query_embedding],
    limit=10,  # return top 10 results
    output_fields=["filepath"],  # return the filepath field
)

Visualize results:

from IPython.display import display


width = 150 * 5
height = 150 * 2
concatenated_image = Image.new("RGB", (width, height))

result_images = []
for result in search_results:
    for hit in result:
        filename = hit["entity"]["filepath"]
        img = Image.open(filename)
        img = img.resize((150, 150))
        result_images.append(img)

for idx, img in enumerate(result_images):
    x = idx % 5
    y = idx // 5
    concatenated_image.paste(img, (x * 150, y * 150))
print(f"Query text: {query_text}")
print("\nSearch results:")
display(concatenated_image)
Query text: a white dog

Search results:

png png

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started
Feedback

Was this page helpful?