• User Guide

Index with GPU

This guide outlines the steps to build an index with GPU support in Milvus, which can significantly improve search performance in high-throughput and high-recall scenarios. For details on the types of GPU indexes supported by Milvus, refer to GPU Index.

Configure Milvus settings for GPU memory control

Milvus uses a global graphics memory pool to allocate GPU memory.

It supports two parameters initMemSize and maxMemSize in Milvus config file. The pool size is initially set to initMemSize, and will be automatically expanded to maxMemSize after exceeding this limit.

The default initMemSize is 1/2 of the available GPU memory when Milvus starts, and the default maxMemSize is equal to all available GPU memory.

Up until Milvus 2.4.1( including version 2.4.1), Milvus used a unified GPU memory pool. For versions prior to 2.4.1( including version 2.4.1), it was recommended to set both of the value to 0.

  initMemSize: 0 #set the initial memory pool size.
  maxMemSize: 0 #maxMemSize sets the maximum memory usage limit. When the memory usage exceed initMemSize, Milvus will attempt to expand the memory pool. 

From Milvus 2.4.1 onwards, the GPU memory pool is only used for temporary GPU data during searches. Therefore, it is recommended to set it to 2048 and 4096.

  initMemSize: 2048 #set the initial memory pool size.
  maxMemSize: 4096 #maxMemSize sets the maximum memory usage limit. When the memory usage exceed initMemSize, Milvus will attempt to expand the memory pool. 

Build an index

The following examples demonstrate how to build GPU indexes of different types.

Prepare index parameters

When setting up GPU index parameters, define index_type, metric_type, and params:

  • index_type (string): The type of index used to accelerate vector search. Valid options include GPU_CAGRA, GPU_IVF_FLAT, GPU_IVF_PQ, and GPU_BRUTE_FORCE.

  • metric_type (string): The type of metrics used to measure the similarity of vectors. Valid options are IP and L2.

  • params(dict): The index-specific building parameters. The valid options for this parameter depend on the index type.

Here are example configurations for different index types:

  • GPU_CAGRA index

    index_params = {
        "metric_type": "L2",
        "index_type": "GPU_CAGRA",
        "params": {
            'intermediate_graph_degree': 64,
            'graph_degree': 32

    Possible options for params include:

    • intermediate_graph_degree (int): Affects recall and build time by determining the graph's degree before pruning. Recommended values are 32 or 64.

    • graph_degree (int): Affects search performance and recall by setting the graph's degree after pruning. Typically, it is half of the intermediate_graph_degree. A larger difference between these two degrees results in a longer build time. Its value must be smaller than the value of intermediate_graph_degree.

    • build_algo (string): Selects the graph generation algorithm before pruning. Possible options:

      • IVF_PQ: Offers higher quality but slower build time.

      • NN_DESCENT: Provides a quicker build with potentially lower recall.

    • cache_dataset_on_device (string, "true" | "false"): Decides whether to cache the original dataset in GPU memory. Setting this to "true" enhances recall by refining search results, while setting it to "false" conserves GPU memory.

  • GPU_IVF_FLAT or GPU_IVF_PQ index

    index_params = {
        "metric_type": "L2",
        "index_type": "GPU_IVF_FLAT", # Or GPU_IVF_PQ
        "params": {
            "nlist": 1024

    The params options are identical to those used in IVF_FLAT and IVF_PQ.


    index_params = {
        'index_type': 'GPU_BRUTE_FORCE',
        'metric_type': 'L2',
        'params': {}

    No additional params configurations are required.

Build index

After configuring the index parameters in index_params, call the create_index() method to build the index.

# Get an existing collection
collection = Collection("YOUR_COLLECTION_NAME")

    field_name="vector", # Name of the vector field on which an index is built

Once you have built your GPU index, the next step is to prepare the search parameters before conducting a search.

Prepare search parameters

Below are example configurations for different index types:


    search_params = {
        "metric_type": "L2",
        "params": {}

    No additional params configurations are required.

  • GPU_CAGRA index

    search_params = {
        "metric_type": "L2",
        "params": {
            "itopk_size": 128,
            "search_width": 4,
            "min_iterations": 0,
            "max_iterations": 0,
            "team_size": 0

    Key search parameters include:

    • itopk_size: Determines the size of intermediate results kept during the search. A larger value may improve recall at the expense of search performance. It should be at least equal to the final top-k (limit) value and is typically a power of 2 (e.g., 16, 32, 64, 128).

    • search_width: Specifies the number of entry points into the CAGRA graph during the search. Increasing this value can enhance recall but may impact search performance.

    • min_iterations / max_iterations: These parameters control the search iteration process. By default, they are set to 0, and CAGRA automatically determines the number of iterations based on itopk_size and search_width. Adjusting these values manually can help balance performance and accuracy.

    • team_size: Specifies the number of CUDA threads used for calculating metric distance on the GPU. Common values are a power of 2 up to 32 (e.g. 2, 4, 8, 16, 32). It has a minor impact on search performance. The default value is 0, where Milvus automatically selects the team_size based on the vector dimension.

  • GPU_IVF_FLAT or GPU_IVF_PQ index

    search_params = {
        "metric_type": "L2", 
        "params": {"nprobe": 10}

    Search parameters for these two index types are similar to those used in IVF_FLAT and IVF_PQ. For more information, refer to Conduct a Vector Similarity Search.

Use the search() method to perform a vector similarity search on the GPU index.

# Load data into memory
    data=[[query_vector]], # Your query vector
    anns_field="vector", # Name of the vector field
    limit=100 # Number of the results to return


When using GPU indexes, be aware of certain constraints:

  • For GPU_IVF_FLAT, the maximum value for limit is 256.

  • For GPU_IVF_PQ and GPU_CAGRA, the maximum value for limit is 1024.

  • While there is no set limit for limit on GPU_BRUTE_FORCE, it is recommended not to exceed 4096 to avoid potential performance issues.

  • Currently, GPU indexes do not support COSINE distance. If COSINE distance is required, data should be normalized first, and then inner product (IP) distance can be used as a substitute.

  • Loading OOM protection for GPU indexes is not fully supported, too much data might lead to QueryNode crashes.

  • GPU indexes do not support search functions like range search and grouping search.


  • When is it appropriate to utilize a GPU index?

    A GPU index is particularly beneficial in situations that demand high throughput or high recall. For instance, when dealing with large batches, the throughput of GPU indexing can surpass that of CPU indexing by as much as 100 times. In scenarios with smaller batches, GPU indexes still significantly outshine CPU indexes in terms of performance. Furthermore, if there's a requirement for rapid data insertion, incorporating a GPU can substantially speed up the process of building indexes.

  • In which scenarios are GPU indexes like CAGRA, GPU_IVF_PQ, GPU_IVF_FLAT, and GPU_BRUTE_FORCE most suitable?

    CAGRA indexes are ideal for scenarios that demand enhanced performance, albeit at the cost of consuming more memory. For environments where memory conservation is a priority, the GPU_IVF_PQ index can help minimize storage requirements, though this comes with a higher loss in precision. The GPU_IVF_FLAT index serves as a balanced option, offering a compromise between performance and memory usage. Lastly, the GPU_BRUTE_FORCE index is designed for exhaustive search operations, guaranteeing a recall rate of 1 by performing traversal searches.


Was this page helpful?