milvus-logo
LFAI
Home
  • User Guide

Full Text Search

Full text search is a feature that retrieves documents containing specific terms or phrases in text datasets, then ranking the results based on relevance. This feature overcomes semantic search limitations, which might overlook precise terms, ensuring you receive the most accurate and contextually relevant results. Additionally, it simplifies vector searches by accepting raw text input, automatically converting your text data into sparse embeddings without the need to manually generate vector embeddings.

Using the BM25 algorithm for relevance scoring, this feature is particularly valuable in retrieval-augmented generation (RAG) scenarios, where it prioritizes documents that closely match specific search terms.

By integrating full text search with semantic-based dense vector search, you can enhance the accuracy and relevance of search results. For more information, refer to Hybrid Search.

Overview

Full text search simplifies the process of text-based searching by eliminating the need for manual embedding. This feature operates through the following workflow:

  1. Text input: You insert raw text documents or provide query text without any need for manual embedding.

  2. Text analysis: Milvus uses an analyzer to tokenize input text into individual, searchable terms. For more information on analyzers, refer to Analyzer Overview.

  3. Function processing: The built-in function receives tokenized terms and converts them into sparse vector representations.

  4. Collection store: Milvus stores these sparse embeddings in a collection for efficient retrieval.

  5. BM25 scoring: During a search, Milvus applies the BM25 algorithm to calculate scores for the stored documents and ranks matched results based on relevance to the query text.

Full text search Full text search

To use full text search, follow these main steps:

  1. Create a collection: Set up a collection with necessary fields and define a function to convert raw text into sparse embeddings.

  2. Insert data: Ingest your raw text documents to the collection.

  3. Perform searches: Use query texts to search through your collection and retrieve relevant results.

To enable full text search, create a collection with a specific schema. This schema must include three necessary fields:

  • The primary field that uniquely identifies each entity in a collection.

  • A VARCHAR field that stores raw text documents, with the enable_analyzer attribute set to True. This allows Milvus to tokenize text into specific terms for function processing.

  • A SPARSE_FLOAT_VECTOR field reserved to store sparse embeddings that Milvus will automatically generate for the VARCHAR field.

Define the collection schema

First, create the schema and add the necessary fields:

from pymilvus import MilvusClient, DataType, Function, FunctionType

client = MilvusClient(uri="http://localhost:19530")

schema = client.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)

import io.milvus.v2.common.DataType;
import io.milvus.v2.service.collection.request.AddFieldReq;
import io.milvus.v2.service.collection.request.CreateCollectionReq;

CreateCollectionReq.CollectionSchema schema = CreateCollectionReq.CollectionSchema.builder()
        .build();
schema.addField(AddFieldReq.builder()
        .fieldName("id")
        .dataType(DataType.Int64)
        .isPrimaryKey(true)
        .autoID(true)
        .build());
schema.addField(AddFieldReq.builder()
        .fieldName("text")
        .dataType(DataType.VarChar)
        .maxLength(1000)
        .enableAnalyzer(true)
        .build());
schema.addField(AddFieldReq.builder()
        .fieldName("sparse")
        .dataType(DataType.SparseFloatVector)
        .build());
import { MilvusClient, DataType } from "@zilliz/milvus2-sdk-node";

const address = "http://localhost:19530";
const token = "root:Milvus";
const client = new MilvusClient({address, token});
const schema = [
  {
    name: "id",
    data_type: DataType.Int64,
    is_primary_key: true,
  },
  {
    name: "text",
    data_type: "VarChar",
    enable_analyzer: true,
    enable_match: true,
    max_length: 1000,
  },
  {
    name: "sparse",
    data_type: DataType.SparseFloatVector,
  },
];


console.log(res.results)
export schema='{
        "autoId": true,
        "enabledDynamicField": false,
        "fields": [
            {
                "fieldName": "id",
                "dataType": "Int64",
                "isPrimary": true
            },
            {
                "fieldName": "text",
                "dataType": "VarChar",
                "elementTypeParams": {
                    "max_length": 1000,
                    "enable_analyzer": true
                }
            },
            {
                "fieldName": "sparse",
                "dataType": "SparseFloatVector"
            }
        ]
    }'

In this configuration,

  • id: serves as the primary key and is automatically generated with auto_id=True.

  • text: stores your raw text data for full text search operations. The data type must be VARCHAR, as VARCHAR is Milvus’ string data type for text storage. Set enable_analyzer=True to allow Milvus to tokenize the text. By default, Milvus uses the standard analyzer for text analysis. To configure a different analyzer, refer to Overview.

  • sparse: a vector field reserved to store internally generated sparse embeddings for full text search operations. The data type must be SPARSE_FLOAT_VECTOR.

Now, define a function that will convert your text into sparse vector representations and then add it to the schema:

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)

schema.add_function(bm25_function)

import io.milvus.common.clientenum.FunctionType;
import io.milvus.v2.service.collection.request.CreateCollectionReq.Function;

import java.util.*;

schema.addFunction(Function.builder()
        .functionType(FunctionType.BM25)
        .name("text_bm25_emb")
        .inputFieldNames(Collections.singletonList("text"))
        .outputFieldNames(Collections.singletonList("vector"))
        .build());
const functions = [
    {
      name: 'text_bm25_emb',
      description: 'bm25 function',
      type: FunctionType.BM25,
      input_field_names: ['text'],
      output_field_names: ['vector'],
      params: {},
    },
];
export schema='{
        "autoId": true,
        "enabledDynamicField": false,
        "fields": [
            {
                "fieldName": "id",
                "dataType": "Int64",
                "isPrimary": true
            },
            {
                "fieldName": "text",
                "dataType": "VarChar",
                "elementTypeParams": {
                    "max_length": 1000,
                    "enable_analyzer": true
                }
            },
            {
                "fieldName": "sparse",
                "dataType": "SparseFloatVector"
            }
        ],
        "functions": [
            {
                "name": "text_bm25_emb",
                "type": "BM25",
                "inputFieldNames": ["text"],
                "outputFieldNames": ["sparse"],
                "params": {}
            }
        ]
    }'

Parameter

Description

name

The name of the function. This function converts your raw text from the text field into searchable vectors that will be stored in the sparse field.

input_field_names

The name of the VARCHAR field requiring text-to-sparse-vector conversion. For FunctionType.BM25, this parameter accepts only one field name.

output_field_names

The name of the field where the internally generated sparse vectors will be stored. For FunctionType.BM25, this parameter accepts only one field name.

function_type

The type of the function to use. Set the value to FunctionType.BM25.

For collections with multiple VARCHAR fields requiring text-to-sparse-vector conversion, add separate functions to the collection schema, ensuring each function has a unique name and output_field_names value.

Configure the index

After defining the schema with necessary fields and the built-in function, set up the index for your collection. To simplify this process, use AUTOINDEX as the index_type, an option that allows Milvus to choose and configure the most suitable index type based on the structure of your data.

index_params = client.prepare_index_params()

index_params.add_index(
    field_name="sparse",
    index_type="AUTOINDEX", 
    metric_type="BM25"
)

import io.milvus.v2.common.IndexParam;

List<IndexParam> indexes = new ArrayList<>();
indexes.add(IndexParam.builder()
        .fieldName("sparse")
        .indexType(IndexParam.IndexType.SPARSE_INVERTED_INDEX)
        .metricType(IndexParam.MetricType.BM25)
        .build());
const index_params = [
  {
    field_name: "sparse",
    metric_type: "BM25",
    index_type: "AUTOINDEX",
  },
];
export indexParams='[
        {
            "fieldName": "sparse",
            "metricType": "BM25",
            "indexType": "AUTOINDEX"
        }
    ]'

Parameter

Description

field_name

The name of the vector field to index. For full text search, this should be the field that stores the generated sparse vectors. In this example, set the value to sparse.

index_type

The type of the index to create. AUTOINDEX allows Milvus to automatically optimize index settings. If you need more control over your index settings, you can choose from various index types available for sparse vectors in Milvus. For more information, refer to Indexes supported in Milvus.

metric_type

The value for this parameter must be set to BM25 specifically for full text search functionality.

Create the collection

Now create the collection using the schema and index parameters defined.

client.create_collection(
    collection_name='demo', 
    schema=schema, 
    index_params=index_params
)

import io.milvus.v2.service.collection.request.CreateCollectionReq;

CreateCollectionReq requestCreate = CreateCollectionReq.builder()
        .collectionName("demo")
        .collectionSchema(schema)
        .indexParams(indexes)
        .build();
client.createCollection(requestCreate);
await client.create_collection(
    collection_name: 'demo', 
    schema: schema, 
    index_params: index_params,
    functions: functions
);
export CLUSTER_ENDPOINT="http://localhost:19530"
export TOKEN="root:Milvus"

curl --request POST \
--url "${CLUSTER_ENDPOINT}/v2/vectordb/collections/create" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d "{
    \"collectionName\": \"demo\",
    \"schema\": $schema,
    \"indexParams\": $indexParams
}"

Insert text data

After setting up your collection and index, you’re ready to insert text data. In this process, you need only to provide the raw text. The built-in function we defined earlier automatically generates the corresponding sparse vector for each text entry.

client.insert('demo', [
    {'text': 'information retrieval is a field of study.'},
    {'text': 'information retrieval focuses on finding relevant information in large datasets.'},
    {'text': 'data mining and information retrieval overlap in research.'},
])

import com.google.gson.Gson;
import com.google.gson.JsonObject;

import io.milvus.v2.service.vector.request.InsertReq;

Gson gson = new Gson();
List<JsonObject> rows = Arrays.asList(
        gson.fromJson("{\"text\": \"information retrieval is a field of study.\"}", JsonObject.class),
        gson.fromJson("{\"text\": \"information retrieval focuses on finding relevant information in large datasets.\"}", JsonObject.class),
        gson.fromJson("{\"text\": \"data mining and information retrieval overlap in research.\"}", JsonObject.class)
);

client.insert(InsertReq.builder()
        .collectionName("demo")
        .data(rows)
        .build());
await client.insert({
collection_name: 'demo', 
data: [
    {'text': 'information retrieval is a field of study.'},
    {'text': 'information retrieval focuses on finding relevant information in large datasets.'},
    {'text': 'data mining and information retrieval overlap in research.'},
]);
curl --request POST \
--url "${CLUSTER_ENDPOINT}/v2/vectordb/entities/insert" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d '{
    "data": [
        {"text": "information retrieval is a field of study."},
        {"text": "information retrieval focuses on finding relevant information in large datasets."},
        {"text": "data mining and information retrieval overlap in research."}       
    ],
    "collectionName": "demo"
}'

Once you’ve inserted data into your collection, you can perform full text searches using raw text queries. Milvus automatically converts your query into a sparse vector and ranks the matched search results using the BM25 algorithm, and then returns the topK (limit) results.

search_params = {
    'params': {'drop_ratio_search': 0.2},
}

client.search(
    collection_name='demo', 
    data=['whats the focus of information retrieval?'],
    anns_field='sparse',
    limit=3,
    search_params=search_params
)

import io.milvus.v2.service.vector.request.SearchReq;
import io.milvus.v2.service.vector.request.data.EmbeddedText;
import io.milvus.v2.service.vector.response.SearchResp;

Map<String,Object> searchParams = new HashMap<>();
searchParams.put("drop_ratio_search", 0.2);
SearchResp searchResp = client.search(SearchReq.builder()
        .collectionName("demo")
        .data(Collections.singletonList(new EmbeddedText("whats the focus of information retrieval?")))
        .annsField("sparse")
        .topK(3)
        .searchParams(searchParams)
        .outputFields(Collections.singletonList("text"))
        .build());
await client.search(
    collection_name: 'demo', 
    data: ['whats the focus of information retrieval?'],
    anns_field: 'sparse',
    limit: 3,
    params: {'drop_ratio_search': 0.2},
)
curl --request POST \
--url "${CLUSTER_ENDPOINT}/v2/vectordb/entities/search" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
--data-raw '{
    "collectionName": "demo",
    "data": [
        "whats the focus of information retrieval?"
    ],
    "annsField": "sparse",
    "limit": 3,
    "outputFields": [
        "text"
    ],
    "searchParams":{
        "params":{
            "drop_ratio_search":0.2
        }
    }
}'

Parameter

Description

search_params

A dictionary containing search parameters.

params.drop_ratio_search

Proportion of low-frequency terms to ignore during search. For details, refer to Sparse Vector.

data

The raw query text.

anns_field

The name of the field that contains internally generated sparse vectors.

limit

Maximum number of top matches to return.

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started
Feedback

Was this page helpful?