Efficient Data Loading into Milvus with VectorETL

In this tutorial, we’ll explore how to efficiently load data into Milvus using VectorETL, a lightweight ETL framework designed for vector databases. VectorETL simplifies the process of extracting data from various sources, transforming it into vector embeddings using AI models, and storing it in Milvus for fast and scalable retrieval. By the end of this tutorial, you’ll have a working ETL pipeline that allows you to integrate and manage vector search systems with ease. Let’s dive in!

Preparation

Dependency and Environment

$ pip install --upgrade vector-etl pymilvus

If you are using Google Colab, to enable dependencies just installed, you may need to restart the runtime (click on the “Runtime” menu at the top of the screen, and select “Restart session” from the dropdown menu).

VectorETL supports multiple data sources, including Amazon S3, Google Cloud Storage, Local File, etc. You can check out the full list of supported sources here. In this tutorial, we’ll focus on Amazon S3 as a data source example.

We will load documents from Amazon S3. Therefore, you need to prepare AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables to securely access your S3 bucket. Additionally, we will use OpenAI’s text-embedding-ada-002 embedding model to generate embeddings for the data. You should also prepare the api key OPENAI_API_KEY as an environment variable.

import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["AWS_ACCESS_KEY_ID"] = "your-aws-access-key-id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your-aws-secret-access-key"

Workflow

Defining the Data Source (Amazon S3)

In this case, we are extracting documents from an Amazon S3 bucket. VectorETL allows us to specify the bucket name, the path to the files, and the type of data we are working with.

source = {
    "source_data_type": "Amazon S3",
    "bucket_name": "my-bucket",
    "key": "path/to/files/",
    "file_type": ".csv",
    "aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
    "aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
}

Configuring the Embedding Model (OpenAI)

Once we have our data source set up, we need to define the embedding model that will transform our textual data into vector embeddings. Here, we use OpenAI’s text-embedding-ada-002 in this example.

embedding = {
    "embedding_model": "OpenAI",
    "api_key": os.environ["OPENAI_API_KEY"],
    "model_name": "text-embedding-ada-002",
}

Setting Up Milvus as the Target Database

We need to store the generated embeddings in Milvus. Here, we define our Milvus connection parameters using Milvus Lite.

target = {
    "target_database": "Milvus",
    "host": "./milvus.db",  # os.environ["ZILLIZ_CLOUD_PUBLIC_ENDPOINT"] if using Zilliz Cloud
    "api_key": "",  # os.environ["ZILLIZ_CLOUD_TOKEN"] if using Zilliz Cloud
    "collection_name": "my_collection",
    "vector_dim": 1536,  # 1536 for text-embedding-ada-002
}

For the host and api_key:

Setting the host as a local file, e.g../milvus.db, and leave api_key empty is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.
If you have large scale of data, you can set up a more performant Milvus server on docker or kubernetes. In this setup, please use the server uri, e.g.http://localhost:19530, as your host and leave api_key empty.
If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the host and api_key, which correspond to the Public Endpoint and Api key in Zilliz Cloud.

Specifying Columns for Embedding

Now, we need to specify which columns from our CSV files should be converted into embeddings. This ensures that only the relevant text fields are processed, optimizing both efficiency and storage.

embed_columns = ["col_1", "col_2", "col_3"]

Creating and Executing the VectorETL Pipeline

With all configurations in place, we now initialize the ETL pipeline, set up the data flow, and execute it.

from vector_etl import create_flow

flow = create_flow()
flow.set_source(source)
flow.set_embedding(embedding)
flow.set_target(target)
flow.set_embed_columns(embed_columns)

# Execute the flow
flow.execute()

By following this tutorial, we have successfully built an end-to-end ETL pipeline to move documents from Amazon S3 to Milvus using VectorETL. VectorETL is flexible in data sources, so you can choose whatever data sources you like based on your specific application needs. With VectorETL’s modular design, you can easily extend this pipeline to support other data sources, embedding models, making it a powerful tool for AI and data engineering workflows!

Efficient Data Loading into Milvus with VectorETL
Preparation
Dependency and Environment
Workflow
Defining the Data Source (Amazon S3)
Configuring the Embedding Model (OpenAI)
Setting Up Milvus as the Target Database
Specifying Columns for Embedding
Creating and Executing the VectorETL Pipeline

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started

Feedback

Was this page helpful?