Vector quantization, such as Product Quantization (PQ), reduces storage requirements by compressing high-dimensional vectors into compact codes. Instead of storing full-precision floating-point values (e.g., 32-bit floats), PQ divides a vector into subvectors, assigns each to a representative centroid (from a pre-trained codebook), and stores only the indices of these centroids. For example, a 128-dimensional vector split into 8 subvectors (16 dimensions each) with a codebook of 256 centroids per subvector would require 8 bits per subvector index (since 256 centroids = 8 bits). This reduces storage from 128 * 32 bits = 4096 bits (for a float32 vector) to 8 * 8 bits = 64 bits—a 64x reduction. The compressed codes are stored in the index, drastically cutting memory usage while retaining approximate similarity relationships.
The impact on search accuracy depends on the balance between compression and approximation error. Quantization introduces errors because the original vectors are replaced with their nearest centroid approximations. Coarser quantization (e.g., fewer centroids or more subvectors) increases compression but reduces accuracy, as the approximated vectors may lose fine-grained details. For instance, if a subvector codebook has too few centroids, dissimilar subvectors might be mapped to the same centroid, leading to false positives during search. However, PQ mitigates this by preserving local structure within subvectors. In practice, modern systems often achieve near-original accuracy with careful tuning (e.g., larger codebooks for subvectors) and hybrid approaches like multi-stage search, where quantized vectors filter candidates and full-precision vectors refine results.
Developers must weigh storage savings against accuracy trade-offs. For example, in a billion-scale dataset, PQ can reduce index size from terabytes to gigabytes, enabling in-memory search on affordable hardware. However, retrieval quality might drop by 5-10% compared to exact search, depending on the dataset and parameters. Techniques like residual quantization or joint training with neural networks can further improve accuracy. Additionally, hybrid indexes (e.g., combining PQ with graph-based methods like HNSW) leverage quantization for scalability while maintaining high recall. Ultimately, quantization is a practical tool for balancing efficiency and accuracy, especially when paired with re-ranking steps or error-aware search algorithms.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word