milvus-logo

Sparse Vector

Sparse vectors represent words or phrases using vector embeddings where most elements are zero, with only one non-zero element indicating the presence of a specific word. Sparse vector models, such as SPLADEv2, outperform dense models in out-of-domain knowledge search, keyword-awareness, and interpretability. They are particularly useful in information retrieval, natural language processing, and recommendation systems, where combining sparse vectors for recall with a large model for ranking can significantly improve retrieval results.

In Milvus, the use of sparse vectors follows a similar workflow to that of dense vectors. It involves creating a collection with a sparse vector column, inserting data, creating an index, and conducting similarity searches and scalar queries.

In this tutorial, you will learn how to:

  • Prepare sparse vector embeddings;
  • Create a collection with a sparse vector field;
  • Insert entities with sparse vector embeddings;
  • Index the collection and perform ANN search on sparse vectors.

To see sparse vectors in action, refer to hello_sparse.py.

notes

  • Currently, the support for sparse vectors is a beta feature in 2.4.0, with plans to make it generally available in 3.0.0.
  • The code snippets on this page use the PyMilvus ORM module to interact with Milvus. Code snippets with the new MilvusClient SDK will be available soon.

Prepare sparse vector embeddings

To use sparse vectors in Milvus, prepare vector embeddings in one of the supported formats:

  • Sparse Matrices: Utilize the scipy.sparse class family to represent your sparse embeddings. This method is efficient for handling large-scale, high-dimensional data.

  • List of Dictionaries: Represent each sparse embedding as a dictionary, structured as {dimension_index: value, ...}, where each key-value pair represents the dimension index and its corresponding value.

    Example:

    {2: 0.33, 98: 0.72, ...}
    
  • List of Iterables of Tuples: Similar to the list of dictionaries, but use an iterable of tuples, (dimension_index, value)], to specify only the non-zero dimensions and their values.

    Example:

    [(2, 0.33), (98, 0.72), ...]
    

The following example prepares sparse embeddings by generating a random sparse matrix for 10,000 entities, each with 10,000 dimensions and a sparsity density of 0.005.

# Prepare entities with sparse vector representation
rng = np.random.default_rng()

num_entities, dim = 10000, 10000

# Generate random sparse rows with an average of 25 non-zero elements per row
entities = [
    {
        "scalar_field": rng.random(),
        # To represent a single sparse vector row, you can use:
        # - Any of the scipy.sparse sparse matrices class family with shape[0] == 1
        # - Dict[int, float]
        # - Iterable[Tuple[int, float]]
        "sparse_vector": {
            d: rng.random() for d in random.sample(range(dim), random.randint(20, 30))
        },
    }
    for _ in range(num_entities)
]

# print the first entity to check the representation
print(entities[0])

# Output:
# {
#     'scalar_field': 0.520821523849214,
#     'sparse_vector': {
#         5263: 0.2639375518635271,
#         3573: 0.34701499565746674,
#         9637: 0.30856525997853057,
#         4399: 0.19771651149001523,
#         6959: 0.31025067641541815,
#         1729: 0.8265339135915016,
#         1220: 0.15303302147479103,
#         7335: 0.9436728846033107,
#         6167: 0.19929870545596562,
#         5891: 0.8214617920371853,
#         2245: 0.7852255053773395,
#         2886: 0.8787982039149889,
#         8966: 0.9000606703940665,
#         4910: 0.3001170013981104,
#         17: 0.00875671667413136,
#         3279: 0.7003425473001098,
#         2622: 0.7571360018373428,
#         4962: 0.3901879090102064,
#         4698: 0.22589525720196246,
#         3290: 0.5510228492587324,
#         6185: 0.4508413201390492
#     }
# }

notes

The vector dimensions must be of Python int or numpy.integer type, and the values must be of Python float or numpy.floating type.

To generate embeddings, you can also use the model package built in the PyMilvus library, which offers a range of embedding functions. For details, refer to Embeddings.

Create a collection with a sparse vector field

To create a collection with a sparse vector field, set the datatype of the sparse vector field to DataType.SPARSE_FLOAT_VECTOR. Unlike dense vectors, there is no need to specify a dimension for sparse vectors.

import numpy as np
import random
from pymilvus import MilvusClient, DataType

# Create a MilvusClient instance
client = MilvusClient(uri="http://localhost:19530")

# Create a collection with a sparse vector field
schema = client.create_schema(
    auto_id=True,
    enable_dynamic_fields=True,
)

schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="scalar_field", datatype=DataType.DOUBLE)
# For sparse vector, no need to specify dimension
schema.add_field(field_name="sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR) # set `datatype` to `SPARSE_FLOAT_VECTOR`

client.create_collection(collection_name="test_sparse_vector", schema=schema)

For details on common collection parameters, refer to create_collection() .

Insert entities with sparse vector embeddings

To insert entities with sparse vector embeddings, simply pass the list of entities to the insert() method.

# Insert entities
client.insert(collection_name="test_sparse_vector", data=entities)

Index the collection

Before performing similarity searches, create an index for the collection.

# Index the collection

# Prepare index params
index_params = client.prepare_index_params()

index_params.add_index(
    field_name="sparse_vector",
    index_name="sparse_inverted_index",
    index_type="SPARSE_INVERTED_INDEX", # the type of index to be created. set to `SPARSE_INVERTED_INDEX` or `SPARSE_WAND`.
    metric_type="IP", # the metric type to be used for the index. Currently, only `IP` (Inner Product) is supported.
    params={"drop_ratio_build": 0.2}, # the ratio of small vector values to be dropped during indexing.
)

# Create index
client.create_index(collection_name="test_sparse_vector", index_params=index_params)

For index building on sparse vectors, take note of the following:

  • index_type: The type of index to be built. Possible options for sparse vectors:

    • SPARSE_INVERTED_INDEX: An inverted index that maps each dimension to its non-zero vectors, facilitating direct access to relevant data during searches. Ideal for datasets with sparse but high-dimensional data.

    • SPARSE_WAND: Utilizes the Weak-AND (WAND) algorithm to quickly bypass unlikely candidates, focusing evaluation on those with higher ranking potential. Treats dimensions as terms and vectors as documents, speeding up searches in large, sparse datasets.

  • metric_type: Only IP (Inner Product) distance metric is supported for sparse vectors.

  • params.drop_ratio_build: The index parameter used specifically for sparse vectors. It controls the proportion of small vector values that are excluded during the indexing process. This parameter enables fine-tuning of the trade-off between efficiency and accuracy by disregarding small values when constructing the index. For instance, if drop_ratio_build = 0.3, during the index construction, all values from all sparse vectors are gathered and sorted. The smallest 30% of these values are not included in the index, thereby reducing the computational workload during search.

For more information, refer to In-memory Index.

After the collection is indexed and loaded into memory, use the search() method to retrieve the relevant documents based on the query.

# Load the collection into memory
client.load_collection(collection_name="test_sparse_vector")

# Perform ANN search on sparse vectors

# for demo purpose we search for the last inserted vector
query_vector = entities[-1]["sparse_vector"]

search_params = {
    "metric_type": "IP",
    "params": {"drop_ratio_search": 0.2}, # the ratio of small vector values to be dropped during search.
}

search_res = client.search(
    collection_name="test_sparse_vector",
    data=[query_vector],
    limit=3,
    output_fields=["pk", "scalar_field"],
    search_params=search_params,
)

for hits in search_res:
    for hit in hits:
        print(f"hit: {hit}")
        
# Output:
# hit: {'id': '448458373272710786', 'distance': 7.220192909240723, 'entity': {'pk': '448458373272710786', 'scalar_field': 0.46767865218233806}}
# hit: {'id': '448458373272708317', 'distance': 1.2287548780441284, 'entity': {'pk': '448458373272708317', 'scalar_field': 0.7315987515699472}}
# hit: {'id': '448458373272702005', 'distance': 0.9848432540893555, 'entity': {'pk': '448458373272702005', 'scalar_field': 0.9871869181562156}}

When configuring search parameters, take note of the following:

  • params.drop_ratio_search: The search parameter used specifically for sparse vectors. This option allows fine-tuning of the search process by specifying the ratio of the smallest values in the query vector to ignore. It helps balance search precision and performance. The smaller the value set for drop_ratio_search, the less these small values contribute to the final score. By ignoring some small values, search performance can be improved with minimal impact on accuracy.

Perform scalar queries

In addition to ANN search, Milvus also supports scalar queries on sparse vectors. These queries allow you to retrieve documents based on a scalar value associated with the sparse vector.

Filter entities with scalar_field greater than 3:

# Perform a query by specifying filter expr
filter_query_res = client.query(
    collection_name="test_sparse_vector",
    filter="scalar_field > 0.999",
)

print(filter_query_res[:2])

# Output:
# [{'pk': '448458373272701862', 'scalar_field': 0.9994093623822689, 'sparse_vector': {173: 0.35266244411468506, 400: 0.49995484948158264, 480: 0.8757831454277039, 661: 0.9931875467300415, 1040: 0.0965644046664238, 1728: 0.7478245496749878, 2365: 0.4351981580257416, 2923: 0.5505295395851135, 3181: 0.7396837472915649, 3848: 0.4428485333919525, 4701: 0.39119353890419006, 5199: 0.790219783782959, 5798: 0.9623121619224548, 6213: 0.453134149312973, 6341: 0.745091438293457, 6775: 0.27766478061676025, 6875: 0.017947908490896225, 8093: 0.11834774166345596, 8617: 0.2289179265499115, 8991: 0.36600416898727417, 9346: 0.5502803921699524}}, {'pk': '448458373272702421', 'scalar_field': 0.9990218525410719, 'sparse_vector': {448: 0.587817907333374, 1866: 0.0994109958410263, 2438: 0.8672442436218262, 2533: 0.8063794374465942, 2595: 0.02122959867119789, 2828: 0.33827054500579834, 2871: 0.1984412521123886, 2938: 0.09674275666475296, 3154: 0.21552987396717072, 3662: 0.5236313343048096, 3711: 0.6463911533355713, 4029: 0.4041993021965027, 7143: 0.7370485663414001, 7589: 0.37588241696357727, 7776: 0.436136394739151, 7962: 0.06377989053726196, 8385: 0.5808192491531372, 8592: 0.8865005970001221, 8648: 0.05727503448724747, 9071: 0.9450633525848389, 9161: 0.146037295460701, 9358: 0.1903032660484314, 9679: 0.3146636486053467, 9974: 0.8561339378356934, 9991: 0.15841573476791382}}]

Filter entities by primary key:

# primary keys of entities that satisfy the filter
pks = [ret["pk"] for ret in filter_query_res]

# Perform a query by primary key
pk_query_res = client.query(
    collection_name="test_sparse_vector", filter=f"pk == '{pks[0]}'"
)

print(pk_query_res)

# Output:
# [{'scalar_field': 0.9994093623822689, 'sparse_vector': {173: 0.35266244411468506, 400: 0.49995484948158264, 480: 0.8757831454277039, 661: 0.9931875467300415, 1040: 0.0965644046664238, 1728: 0.7478245496749878, 2365: 0.4351981580257416, 2923: 0.5505295395851135, 3181: 0.7396837472915649, 3848: 0.4428485333919525, 4701: 0.39119353890419006, 5199: 0.790219783782959, 5798: 0.9623121619224548, 6213: 0.453134149312973, 6341: 0.745091438293457, 6775: 0.27766478061676025, 6875: 0.017947908490896225, 8093: 0.11834774166345596, 8617: 0.2289179265499115, 8991: 0.36600416898727417, 9346: 0.5502803921699524}, 'pk': '448458373272701862'}]

Limits

When using sparse vectors in Milvus, consider the following limits:

  • Currently, only the IP distance metric is supported for sparse vectors.

  • For sparse vector fields, only the SPARSE_INVERTED_INDEX and SPARSE_WAND index types are supported.

  • Currently, range search, grouping search, and search iterator are not supported for sparse vectors.

FAQ

  • What distance metric is supported for sparse vectors?

    Sparse vectors only support the Inner Product (IP) distance metric due to the high dimensionality of sparse vectors, which makes L2 distance and cosine distance impractical.

  • Can you explain the difference between SPARSE_INVERTED_INDEX and SPARSE_WAND, and how do I choose between them?

    SPARSE_INVERTED_INDEX is a traditional inverted index, while SPARSE_WAND uses the Weak-AND algorithm to reduce the number of full IP distance evaluations during search. SPARSE_WAND is typically faster, but its performance can decline with increasing vector density. To choose between them, conduct experiments and benchmarks based on your specific dataset and use case.

  • How should I choose the drop_ratio_build and drop_ratio_search parameters?

    The choice of drop_ratio_build and drop_ratio_search depends on the characteristics of your data and your requirements for search latency/throughput and accuracy.

  • What data types are supported for sparse embeddings?

    The dimension part must be an unsigned 32-bit integer, and the value part can be any 32-bit float.

  • Can the dimension of a sparse embedding be any discrete value within the uint32 space?

    Yes, the dimension of a sparse embedding can be any value from 0 to 4.2 billion (maximum of uint32 - 1).

  • Are searches on growing segments conducted through an index or by brute force?

    Searches on growing segments are conducted through an index of the same type as the sealed segment index. For new growing segments before the index is built, a brute force search is used.

  • Is it possible to have both sparse and dense vectors in a single collection?

    Yes, with multiple vector type support, you can create collections with both sparse and dense vector columns and perform hybrid searches on them.

  • What are the requirements for sparse embeddings to be inserted or searched?

    Sparse embeddings must have at least one non-zero value, and vector indices must be non-negative.

On this page