Home
Blog
Getting Started with Hybrid Semantic / Full-Text Search with Milvus 2.5

Getting Started with Hybrid Semantic / Full-Text Search with Milvus 2.5

Engineering

December 17, 2024

Stefan Webb

In this article, we will show you how to quickly get up and running with the new full-text search feature and combine it with the conventional semantic search based on vector embeddings.

Requirement

First, ensure you have installed Milvus 2.5:

pip install -U pymilvus[model]

and have a running instance of Milvus Standalone (e.g. on your local machine) using the installation instructions in the Milvus docs.

Building the Data Schema and Search Indices

We import the required classes and functions:

from pymilvus import MilvusClient, DataType, Function, FunctionType, model

You may have noticed two new entries for Milvus 2.5, Function and FunctionType, which we will explain shortly.

Next we open the database with Milvus Standalone, that is, locally, and create the data schema. The schema comprises an integer primary key, a text string, a dense vector of dimension 384, and a sparse vector (of unlimited dimensionality). Note that Milvus Lite does not currently support full-text search, only Milvus Standalone and Milvus Distributed.

client = MilvusClient(uri="http://localhost:19530")

schema = client.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="dense", datatype=DataType.FLOAT_VECTOR, dim=768),
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 1000, 'enable_analyzer': True}}, {'name': 'dense', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'sparse', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>}], 'enable_dynamic_field': False}

You may have noticed the enable_analyzer=True parameter. This tells Milvus 2.5 to enable the lexical parser on this field and build a list of tokens and token frequencies, which are required for full-text search. The sparse field will hold a vector representation of the documentation as a bag-of-words produced from the parsing text.

But how do we connect the text and sparse fields, and tell Milvus how sparse should be calculated from text? This is where we need to invoke the Function object and add it to the schema:

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)

schema.add_function(bm25_function)

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 1000, 'enable_analyzer': True}}, {'name': 'dense', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'sparse', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'is_function_output': True}], 'enable_dynamic_field': False, 'functions': [{'name': 'text_bm25_emb', 'description': '', 'type': <FunctionType.BM25: 1>, 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}

The abstraction of the Function object is more general than that of applying full-text search. In the future, it may be used for other cases where one field needs to be a function of another field. In our case, we specify that sparse is a function of text via the function FunctionType.BM25. BM25 refers to a common metric in information retrieval used for calculating a query’s similarity to a document (relative to a collection of documents).

We use the default embedding model in Milvus, which is paraphrase-albert-small-v2:

embedding_fn = model.DefaultEmbeddingFunction()

The next step is to add our search indices. We have one for the dense vector and a separate one for the sparse vector. The index type is SPARSE_INVERTED_INDEX with BM25 since full-text search requires a different search method than those for standard dense vectors.

index_params = client.prepare_index_params()

index_params.add_index(
    field_name="dense",
    index_type="AUTOINDEX", 
    metric_type="COSINE"
)

index_params.add_index(
    field_name="sparse",
    index_type="SPARSE_INVERTED_INDEX", 
    metric_type="BM25"
)

Finally, we create our collection:

client.drop_collection('demo')
client.list_collections()

[]

client.create_collection(
    collection_name='demo', 
    schema=schema, 
    index_params=index_params
)

client.list_collections()

['demo']

And with that, we have an empty database set up to accept text documents and perform semantic and full-text searches!

Inserting Data and Performing Full-Text Search

Inserting data is no different than previous versions of Milvus:

docs = [
    'information retrieval is a field of study.',
    'information retrieval focuses on finding relevant information in large datasets.',
    'data mining and information retrieval overlap in research.'
]

embeddings = embedding_fn(docs)

client.insert('demo', [
    {'text': doc, 'dense': vec} for doc, vec in zip(docs, embeddings)
])

{'insert_count': 3, 'ids': [454387371651630485, 454387371651630486, 454387371651630487], 'cost': 0}

Let’s first illustrate a full-text search before we move on to hybrid search:

search_params = {
    'params': {'drop_ratio_search': 0.2},
}

results = client.search(
    collection_name='demo', 
    data=['whats the focus of information retrieval?'],
    output_fields=['text'],
    anns_field='sparse',
    limit=3,
    search_params=search_params
)

The search parameter drop_ratio_search refers to the proportion of lower-scoring documents to drop during the search algorithm.

Let’s view the results:

for hit in results[0]:
    print(hit)

{'id': 454387371651630485, 'distance': 1.3352930545806885, 'entity': {'text': 'information retrieval is a field of study.'}}
{'id': 454387371651630486, 'distance': 0.29726022481918335, 'entity': {'text': 'information retrieval focuses on finding relevant information in large datasets.'}}
{'id': 454387371651630487, 'distance': 0.2715056240558624, 'entity': {'text': 'data mining and information retrieval overlap in research.'}}

Performing Hybrid Semantic and Full-Text Search

Let’s now combine what we’ve learned to perform a hybrid search that combines separate semantic and full-text searches with a reranker:

from pymilvus import AnnSearchRequest, RRFRanker
query = 'whats the focus of information retrieval?'
query_dense_vector = embedding_fn([query])

search_param_1 = {
    "data": query_dense_vector,
    "anns_field": "dense",
    "param": {
        "metric_type": "COSINE",
    },
    "limit": 3
}
request_1 = AnnSearchRequest(**search_param_1)

search_param_2 = {
    "data": [query],
    "anns_field": "sparse",
    "param": {
        "metric_type": "BM25",
        "params": {"drop_ratio_build": 0.0}
    },
    "limit": 3
}
request_2 = AnnSearchRequest(**search_param_2)

reqs = [request_1, request_2]

ranker = RRFRanker()

res = client.hybrid_search(
    collection_name="demo",
    output_fields=['text'],
    reqs=reqs,
    ranker=ranker,
    limit=3
)
for hit in res[0]:
    print(hit)

{'id': 454387371651630485, 'distance': 0.032786883413791656, 'entity': {'text': 'information retrieval is a field of study.'}}
{'id': 454387371651630486, 'distance': 0.032258063554763794, 'entity': {'text': 'information retrieval focuses on finding relevant information in large datasets.'}}
{'id': 454387371651630487, 'distance': 0.0317460335791111, 'entity': {'text': 'data mining and information retrieval overlap in research.'}}

As you may have noticed, this is no different than a hybrid search with two separate semantic fields (available since Milvus 2.4). The results are identical to full-text search in this simple example, but for larger databases and keyword specific searches hybrid search typically has higher recall.

Summary

You’re now equipped with all the knowledge needed to perform full-text and hybrid semantic/full-text search with Milvus 2.5. See the following articles for more discussion on how full-text search works and why it is complementary to semantic search: