🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • How does product quantization (PQ) reduce the memory footprint of a vector index, and what impact does this compression have on search recall and precision?

How does product quantization (PQ) reduce the memory footprint of a vector index, and what impact does this compression have on search recall and precision?

Product quantization (PQ) reduces memory footprint by compressing high-dimensional vectors into compact codes, enabling efficient storage. Instead of storing full-precision floating-point values (e.g., 32-bit floats), PQ divides a vector into smaller subvectors, each quantized independently using a separate codebook. For example, a 128-dimensional vector might be split into 8 subvectors of 16 dimensions each. Each subvector is replaced by the index of its closest centroid (a representative vector) from a pre-trained codebook. If each codebook has 256 centroids (8-bit indices), the original 128-dimensional vector (512 bytes in float32) becomes 8 bytes (8 subvectors × 1 byte per index). This reduces memory usage by ~98%, making large-scale vector indices feasible on limited hardware.

The compression impacts search quality by introducing approximation errors. Since PQ replaces original vectors with quantized approximations, distance calculations (e.g., Euclidean or cosine similarity) become approximate. For instance, when comparing a query vector to compressed vectors, distances are computed using precomputed lookup tables between query subvectors and codebook centroids. This approximation can reduce recall (the ability to find all relevant items) and precision (the accuracy of top results) because the compressed representations may not perfectly preserve the original vector relationships. However, the impact is often manageable: PQ’s design ensures that errors are distributed across subvectors, and tuning parameters like the number of subvectors or codebook size can mitigate losses. For example, using 16 subvectors with 256 centroids each may yield better accuracy than 8 subvectors with 512 centroids, but at a higher memory cost.

In practice, PQ is often paired with coarse quantization methods (e.g., IVF) to balance speed, memory, and accuracy. For example, a system might first filter candidates using a coarse index (IVF) and then refine results using PQ-compressed vectors. Developers can adjust PQ parameters based on their recall-precision trade-off requirements. A smaller codebook (e.g., 64 centroids per subvector) saves more memory but risks higher approximation errors, while larger codebooks (e.g., 1024 centroids) improve accuracy at the cost of memory. Testing on real datasets is critical: a 5% recall drop might be acceptable for a 10x memory reduction in a recommendation system, whereas a medical image search might prioritize precision, requiring finer quantization.

Like the article? Spread the word