🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • In practice, what steps are involved in constructing an index (like training quantizers or building graph connections), and how do these steps scale with the size of the dataset?

In practice, what steps are involved in constructing an index (like training quantizers or building graph connections), and how do these steps scale with the size of the dataset?

Constructing an index for large-scale datasets typically involves three core steps: preprocessing data, training quantization models (if applicable), and building the index structure (e.g., graph-based or tree-based). First, preprocessing includes normalizing or reducing dimensionality of the data to ensure compatibility with the indexing method. For example, when using vector similarity search, data is often normalized to unit length to enable efficient cosine distance calculations. Next, quantization—like training a Product Quantizer (PQ)—compresses high-dimensional vectors into compact codes. This involves splitting vectors into subvectors, clustering each subset using k-means, and mapping original vectors to the nearest cluster centroids. Finally, building the index structure, such as a Hierarchical Navigable Small World (HNSW) graph, involves iteratively connecting data points to neighbors across hierarchical layers to balance search speed and accuracy.

Scaling these steps depends on the indexing method and dataset size. Training quantizers like PQ requires clustering subvectors, which scales linearly with the dataset size (O(n)) but becomes computationally intensive for very large n due to repeated distance calculations. For example, training PQ on 1 million vectors might take minutes, but scaling to 1 billion could require distributed computing or approximations. Graph-based index construction, such as HNSW, involves inserting nodes and creating connections, which typically scales near-linearly (O(n log n)) due to the hierarchical layering. However, memory usage grows with the number of connections per node, making it challenging for datasets exceeding available RAM. Hybrid approaches, like combining IVF (Inverted File Index) with PQ, partition data into clusters first, reducing the scope of both quantization and graph-building steps, which helps manage scalability.

Practical implementations often involve trade-offs. For instance, FAISS (a library for efficient similarity search) uses IVF-PQ to split data into clusters and compress vectors, reducing memory and search time. Building the IVF structure scales with the number of clusters (e.g., 10,000 clusters for 1B vectors), while PQ training remains manageable per subvector. HNSW, used in libraries like hnswlib, prioritizes search speed over build time, making it suitable for datasets that fit in memory but less ideal for disk-based systems. Developers must balance factors like build time, memory, and query accuracy: quantization sacrifices some accuracy for speed and compression, while graph methods prioritize accuracy at higher memory costs. Tools like Annoy (tree-based) or ScaNN (learned quantization) offer alternative scaling strategies, but the core principles of partitioning, compressing, and structuring data remain consistent across methods.

Like the article? Spread the word