Chengming Li, R & D Engineer of Zilliz, graduated from SouthEast University with a Master degree in Computer Science. His current focus is on ANNS problems on high-dimensional data, including graph-based and quantization-based solutions.
“DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node” is a paper published on NeurIPS in 2019. The paper introduces a state-of-the-art method to perform index building and search on the billion-scale dataset using a single machine with only 64GB of RAM and a large enough SSD. Moreover, it satisfies the three requirements of ANNS (Approximate Nearest Neighbor Search) on the large-scale dataset: high recall, low latency, and high density (number of nodes in a single machine). This method builds a graph-based index on a billion-scale dataset SIFT-1B using a single machine with 64GB of RAM and a 16-core CPU, reaching 5000 QPS (queries per second) at over 95 % recall@1, and the average latency lower than 3ms.
Suhas Jayaram Subramanya: Former employee of Microsoft India Research Institute, doctoral student of CMU. The main research interests are high-performance computing and machine learning algorithms for large-scale data.
Devvrit: Graduate Research Assistant at The University of Texas at Austin. His research interests are theoretical computer science, machine learning, and deep learning.
Rohan Kadekodi: A doctoral student at the University of Texas. His research direction is system and storage, mainly including persistent storage, file system, and kV storage.
Ravishankar Krishaswamy: Microsoft Indian research institute principal researcher. Doctor of CMU. The research direction is the approximation algorithm based on graph and clustering.
Harsha Vardhan Simhadri: Microsoft Indian research institute principal researcher. Doctor of CMU. In the past, he studied parallel algorithms and runtime systems. Now his main work is to develop new algorithms and write programming models.
Most of the mainstream ANNS algorithms make some trade-offs among index building performance, search performance, and recall. Graph-based algorithms such as HNSW and NSG are state-of-art methods in terms of search performance and recall at present. Since the memory-resident graph-based indexing method occupies too much memory, it is relatively difficult to index and search a large-scale dataset using a single machine with limited memory resources.
Many applications require quick responses of Euclidean distance-based ANNS on the billion-scale dataset. Below are two major solutions:
Both solutions mentioned above are limited by the memory restriction of a single machine. This paper proposes the design of an SSD-resident indexing mechanism to solve this problem. The challenge of SSD-resident indexing is to reduce the number of random disk access and the number of requests for disk access.
This paper presents an SSD-resident ANNS scheme called DiskANN, which can effectively support search on large-scale datasets. This scheme is based on a graph-based algorithm presented in this paper: Vamana. Contributions of this paper include:
This algorithm is similar to the idea of NSG (for those who don't understand NSG, please refer to Reference , and if you do not want to read papers, you can refer to Reference ). Their main difference lies in the trimming strategy. To be precise, a switch alpha has been added to the NSG's trimming strategy. The main idea of the NSG trimming strategy is that the choice of neighbors of the target point is as diverse as possible. If the new neighbor is closer to a neighbor of the target point than the target point, we do not need to add this point to the neighbor point set. In other words, for each neighbor of the target point, there can be no other neighbor points within the surrounding radius dist (target point, neighbor point). This trimming strategy effectively controls the out-degree of graph, and is relatively radical. It reduces the memory footprint of the index, improves the search speed, but also reduces the search accuracy. Vamana's trimming strategy is to freely control the scale of trimming through the parameter alpha. The working principle is to multiply the dist (a neighbor point, candidate point) in the trimming condition with a parameter alpha (not less than 1). Only when the dist (target point, a certain candidate point) is greater than the enlarged reference distance is the trimming strategy adopted, increasing the tolerance of mutual exclusion between neighbors of the target point.
Vamana's indexing process is relatively simple:
This paper compares the three graph indexes, i.e. Vamana, NSG, and HNSW. In terms of indexing and query performance, Vamana and NSG are relatively close, and both outmatch HNSW slightly. Refer to the Experiment section below for the data.
To visualize the building process of Vamana index, the paper provides a graph, in which 200 two-dimensional points are used to simulate two rounds of iteration. The first row uses alpha = 1 to trim the edges. It can be seen that the trimming strategy is relatively radical, and a large number of edges are trimmed. After increasing the value alpha and loosening the trimming conditions, a lot of edges are obviously added back. In the final graph, quite some long edges are added. It can effectively reduce the search radius.
A personal computer with only 64GB of memory would not even hold a billion pieces of raw data, let alone the index built on them. There are two challenges ahead: 1. How to index such a large-scale data set with limited memory resources? 2. How to calculate the distance when searching if the original data cannot be loaded in memory?
The paper proposed the following solutions:
The index layout of DiskANN is similar to those of the general graph indexes. The neighbor set of each point and the original vector data are stored together. This makes better use of the locality of the data.
As mentioned earlier, if the index data is stored on the SSD, the number of disk accesses and the disk read and write requests must be reduced as much as possible to ensure low search delay. Therefore DiskANN proposes two optimization strategies:
The experiment consists of three groups:
Data sets: SIFT1M (128 dimensions), GIST1M (960 dimensions), DEEP1M (96 dimensions) and a 1M data set randomly sampled from DEEP1B.
Index parameters (all data sets use the same set of parameters):
HNSW：M = 128, efc = 512.
Vamana: R = 70, L = 75, alpha = 1.2.
NSG: R = 60, L = 70, C= 500.
The search parameters are not provided in the paper, which may be consistent with the indexing parameters. For the parameter selection, the parameters of NSG mentioned in the article are based on the parameters listed in the GitHub repository of NSG to select the group with better performance. Vamana and NSG are relatively close, so the parameters are also set close. However, the reason of HNSW parameters selection is not given. We believe that the parameter M of HNSW is set relatively large. It might lead to a less convincing comparison between graph-based indexes if their out-degrees are not set at the same level.
Under the above indexing parameters, the indexing time of Vamana, HNSW, and NSG are 129s, 219s, and 480s respectively. The NSG indexing time includes the time to construct the initial neighbor graph with EFANN .
It can be seen from Figure 3 that Vamana has an excellent performance on the three data sets, similar to NSG and slightly better than HNSW.
Comparison of search radius:
From Figure 2.c, we can see that Vamana has the shortest average search path under the same recall rate compared to those of NSG and HNSW.
Data set: SIFT1B
The one-time built index parameters: L = 50, R = 128, alpha = 1.2. After running for 2 days on a 1800G DDR3 machine, the peak memory is about 1100 G, and the average out-degree is 113.9.
Indexing procedure based on the merging:
This index generated a 384GB index with an average out-of-degree of 92.1. This index ran for 5 days on a 64GB DDR4 machine.
The comparison results are as follows (Figure 2a):
IVFOADC+G+P is an algorithm proposed in Reference .
This paper only compares DiskANN with IVFOADC+G+P, since the reference  has proved that IVFOADC+G+P is better than FAISS. In addition, FAISS requires GPU resources, which are not supported by all platforms.
IVF-OADC+G+P seems to be a combination of HNSW and IVF-PQ. It determines clusters using HNSW, and performs search by adding some pruning strategies to the target cluster.
The result is in Figure 2a. The 16 and 32 in the figure are the codebook size. The dataset is SIFT1B, quantified by OPQ.
The source code of DiskANN is open-sourced on https://github.com/microsoft/DiskANN
In January 2021, the source code of the disk solution was open-sourced.
The following mainly introduces the indexing process and the search process.
There are 8 parameters for building index:
data_type: options include float/int8/uint8.
data_file.bin: The original data binary file. The first two integers in the file respectively represent the total number n of the dataset vector and the vector dimension dim. The last n dim sizeof(data_type) bytes are continuous vector data.
index_prefix_path: The path prefix of the output file. After the index is built, several index-related files will be generated. This parameter is the common prefix of the directory where they are stored.
R: The maximum out-degree of the global index.
L: The parameter L of Vamana index, the upper bound of the candidate set size.
B: The memory threshold when querying. It controls the PQ codebook size, in GB.
M: The memory threshold when building an index. It determines the size of the fragment, in GB.
T: The number of threads.
Indexing process (entry function: aux_utils.cpp::build_disk_index):
create_disk_layout: The global index generated in step 6 only has only a compact adjacency table. This step is to align the index. The adjacency table and the original data are stored together. When searching, load the adjacency table and read the original vector together for accurate distance calculation. There is also the concept of SECTOR, with the default size is 4096. Each SECTOR only contains 4096 / node_size pieces of vector information. node_size = single vector size + single node adjacency table size.
Finally, do a global uniform sampling of 150000 / n, save it, and use it for warmup when searching.
There are 10 search parameters:
Although this is a relatively lengthy piece of work, it is overall excellent. The paper and code ideas are clear: divide a number of overlapping buckets through k-means, and then divide the buckets to build a map index, and finally merge the indexes, which is a relatively new idea. As for the memory-based graph index Vamana, it is essentially a randomly initialized version of NSG that can control the trimming granularity. When querying, it makes full use of cache + pipeline, covers up part of the io time, and improves QPS. However, according to the paper, even if the machine condition is not extraordinary, the training time takes up to 5 days, and the usability is relatively low. Optimizations to the training are definitely necessary in the future. From the code perspective, the quality is relatively high and can be directly used in the production environment.