Integrate Milvus with WhyHow
This guide demonstrates how to use whyhow.ai and Milvus Lite to conduct Rule-based Retrieval.
Overview
WhyHow is a platform which provides developers the building blocks they need to organize, contextualize, and reliably retrieve unstructured data to perform complex RAG. The Rule-based Retrieval package is a Python package developed by WhyHow that enables people to create and manage Retrieval Augmented Generation (RAG) applications with advanced filtering capabilities.
Installation
Before you start, please install all the necessary python packages for later usage.
pip install --upgrade pymilvus, whyhow_rbr
Next, we need to initialize the Milvus client to implement the Rule-based Retrieval by using Milvus Lite.
from pymilvus import MilvusClient
# Milvus Lite local path
path="./milvus_demo.db" # random name for local milvus lite db path
# Initialize the ClientMilvus
milvus_client = ClientMilvus(path)
You can also initialize the Milvus client through Milvus Cloud
from pymilvus import MilvusClient
# Milvus Cloud credentials
YOUR_MILVUS_CLOUD_END_POINT = "YOUR_MILVUS_CLOUD_END_POINT"
YOUR_MILVUS_CLOUD_TOKEN = "YOUR_MILVUS_CLOUD_TOKEN"
# Initialize the ClientMilvus
milvus_client = ClientMilvus(
milvus_uri=YOUR_MILVUS_CLOUD_END_POINT,
milvus_token=YOUR_MILVUS_CLOUD_TOKEN,
)
Create Collection
Defining necessary variables
# Define collection name
COLLECTION_NAME="YOUR_COLLECTION_NAME" # take your own collection name
# Define vector dimension size
DIMENSION=1536 # decide by the model you use
Add schema
Before inserting any data into Milvus Lite database, we need to first define the data field, which is called schema in here. Through create object CollectionSchema
and add data field through add_field()
, we can control our data type and their characteristics. This step is mandatory before inserting any data into Milvus.
schema = milvus_client.create_schema(auto_id=True) # Enable id matching
schema = milvus_client.add_field(schema=schema, field_name="id", datatype=DataType.INT64, is_primary=True)
schema = milvus_client.add_field(schema=schema, field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=DIMENSION)
Create index
For each schema, it is better to have an index so that the querying will be much more efficient. To create an index, we first need an index_params
and later add more index data on this IndexParams
object.
# Start to indexing data field
index_params = milvus_client.prepare_index_params()
index_params = milvus_client.add_index(
index_params=index_params, # pass in index_params object
field_name="embedding",
index_type="AUTOINDEX", # use autoindex instead of other complex indexing method
metric_type="COSINE", # L2, COSINE, or IP
)
This method is a thin wrapper around the official Milvus implementation (official docs).
Create collection
After defining all the data field and indexing them, we now need to create our database collection so that we can access our data quick and precise. What needs to be mentioned is that we initialized the enable_dynamic_field
to be true so that you can upload any data freely. The cost is that data querying might be inefficient.
# Create Collection
milvus_client.create_collection(
collection_name=COLLECTION_NAME,
schema=schema,
index_params=index_params
)
Upload documents
After creating a collection, we are ready to populate it with documents. In whyhow_rbr
this is done using the upload_documents
method of the MilvusClient
. It performs the following steps under the hood:
- Preprocessing: Reading and splitting the provided PDF files into chunks
- Embedding: Embedding all the chunks using an OpenAI model
- Inserting: Uploading both the embeddings and the metadata to Milvus Lite
# get pdfs
pdfs = ["harry-potter.pdf", "game-of-thrones.pdf"] # replace to your pdfs path
# Uploading the PDF document
milvus_client.upload_documents(
collection_name=COLLECTION_NAME,
documents=pdfs
)
Question answering
Now we can finally move to retrieval augmented generation.
# Search data and implement RAG!
res = milvus_client.search(
question='What food does Harry Potter like to eat?',
collection_name=COLLECTION_NAME,
anns_field='embedding',
output_fields='text'
)
print(res['answer'])
print(res['matches'])
Rules
In the previous example, every single document in our index was considered. However, sometimes it might be beneficial to only retrieve documents satisfying some predefined conditions (e.g. filename=harry-potter.pdf
). In whyhow_rbr
through Milvus Lite, this can be done via adjusting searching parameters.
A rule can control the following metadata attributes
filename
name of the filepage_numbers
list of integers corresponding to page numbers (0 indexing)id
unique identifier of a chunk (this is the most “extreme” filter)- Other rules base on Boolean Expressions
# RULES(search on book harry-potter on page 8):
PARTITION_NAME='harry-potter' # search on books
page_number='page_number == 8'
# first create a partitions to store the book and later search on this specific partition:
milvus_client.crate_partition(
collection_name=COLLECTION_NAME,
partition_name=PARTITION_NAME # separate base on your pdfs type
)
# search with rules
res = milvus_client.search(
question='Tell me about the greedy method',
collection_name=COLLECTION_NAME,
partition_names=PARTITION_NAME,
filter=page_number, # append any rules follow the Boolean Expression Rule
anns_field='embedding',
output_fields='text'
)
print(res['answer'])
print(res['matches'])
In this example, we first create a partition that store harry-potter related pdfs, and through searching within this partition, we can get the most direct information. Also, we apply page numbers as a filter to specify the exact page we wish to search on. Remember, the filer parameter needs to follow the boolean rule.
Clean up
At last, after implementing all the instructions, you can clean up the database by calling drop_collection()
.
# Clean up
milvus_client.drop_collection(
collection_name=COLLECTION_NAME
)