milvus-logo

February 14, 2022by Yudong Cai

How Bitset Enables the Versatility of Vector Similarity Search

  • engineering

Bitset Cover Image

By Yudong Cai and Angela Ni.

Various new essential features of a vector database are provided together with the release of Milvus 2.0. Among the new features, Time Travel, attribute filtering, and delete operations are correlated as these three features are achieved by one common mechanism - bitset.

Therefore, this article aims to clarify the concept of bitset in Milvus and explain how it works to support delete operations, Time Travel, and attribute filtering with three examples.

What is bitset?

A bitset is an array of bit numbers ("0" and "1") that can be used to represent certain data information. With bitsets, you can store certain types of data compactly and efficiently as opposed to store them in Ints, floats, or chars. Bitsets work on boolean logic, according to which the value of an output is either valid or invalid, usually denoted by "1" and "0" respectively. "1" stands for valid, and "0" for invalid. Since bitsets are highly efficient and can save storage, they can also be used to achieve many features such as attribute filtering, delete operations, Time Travel, and more.

Starting from version 0.7.0, the concept of bitset has been introduced in Milvus to enable the delete function. More specifically, bitset is used to mark if each row in the segment is deleted. Deleted entities are marked with "1" in the corresponding bitset, and as a result, the deleted entities will not be computed during a search or query.

In the Milvus 2.0 version, the application of bitset is extended to enable more features, like attribute filtering and Time Travel. The general principle in a bitset remains the same. That is, if an entity is marked with "1" in the corresponding bitset, the entity will be ignored during a search or query. Bitsets are used to enable 3 features in Milvus:

  • Attribute filtering
  • Data deletion
  • Query with Time Travel

How does bitset work in Milvus?

The examples below are used to illustrate how bitset works in Milvus.

Prerequisites

Suppose there is a segment with eight entities and a series of data manipulation language (DML) events happens in the order shown in the figure below.

  • Four of the entities, whose primary_keys are [1, 2, 3, 4] respectively, are inserted when the timestamp ts equals 100.
  • The remaining four entities, whose primary_keys are [5, 6, 7, 8], are inserted when the timestamp ts equals 200.
  • Entities whose primary_keys are [7, 8] are deleted when the timestamp ts equals 300.
  • Only entities, whose primary_keys are [1, 3, 5, 7], satisfy the conditions of attribute filtering.

DML events

Case one

Suppose the value a user sets for time_travel is 150. In other words, the user conducts a query on the data stored in Milvus when ts = 150. The bitset generation process is illustrated by Figure 1.

During the initial filtering stage, the result of the filter_bitset should be [1, 0, 1, 0, 1, 0, 1, 0] as entities [1, 3, 5, 7] are valid filtering results and marked as "1" in the bitset. However, entities [4, 5, 6, 7] were not even inserted to the vector database when ts equals 150. Therefore, these four entities should be marked as "0" regardless of the filtering condition. Now the bitset result should be [1, 0, 1, 0, 0, 0, 0, 0]. Since in Milvus, the general principle of bitset computing is that entities marked with "1" in the bitset are ignored during a search or query, the bitset result after Time Travel and attribute filtering needs to be flipped in order to be combined with the deletion bitmap. The flipped result of filter_bitset should be [0, 1, 0, 1, 1, 1, 1, 1].

As for the deletion bitset del_bitset, the initial value should be [0, 0, 0, 0, 0, 0, 1, 1]. However, entities 7 and 8 are not deleted until ts is 300. Therefore, when ts is 150, entities 7 and 8 are still valid. As a result, the del_bitset value after Time Travel should be [0, 0, 0, 0, 0, 0, 0, 0].

Now we have two bitsets after Time Travel and attribute filtering: filter_bitset [0, 1, 0, 1, 1, 1, 1, 1] and del_bitset [0, 0, 0, 0, 0, 0, 0, 0]. Combine these two bitsets with the "OR" binary logic operator. The ultimate value of result_bitset is [0, 1, 0, 1, 1, 1, 1, 1]. That is to say, only entities 1 and 3 will be computed in the following search or query stage.

Figure 1

Case two

Suppose the value the user sets for time_travel is 250. In other words, the user conducts a query on the data stored in Milvus when ts = 250. The bitset generation process is illustrated by Figure 2.

Like in case one, the resultant filter_bitset of the initial attribute filtering stage should be [1, 0, 1, 0, 1, 0, 1, 0].

All entities [1, 2, 3, 4, 5, 6, 7, 8] are inserted to the vector database when ts= 250. Therefore, the previous result of filter_bitset remains the same. Again, we need to flip the result of the filter_bitset, and we will get [0, 1, 0, 1, 0, 1, 0, 1].

As for the deletion bitset del_bitset, the initial value should be [0, 0, 0, 0, 0, 0, 1, 1]. However, entities 7 and 8 were not deleted until ts is 300. Therefore, when ts is 250, entities 7 and 8 are still valid. As a result, the del_bitset value after Time Travel should be [0, 0, 0, 0, 0, 0, 0, 0].

Now we have two bitsets after Time Travel and attribute filtering: filter_bitset [0, 1, 0, 1, 0, 1, 0, 1] and del_bitset [0, 0, 0, 0, 0, 0, 0, 0]. Combine these two bitsets with the "OR" binary logic operator. The ultimate value of result_bitset is [0, 1, 0, 1, 0, 1, 0, 1]. That is to say, only entities [1, 3, 5, 7] will be computed in the following search or query stage.

Figure 2

Case three

Suppose the value the user sets for time_travel is 350. In other words, the user conducts a query on the data stored in Milvus when ts = 350. The bitset generation process is illustrated by Figure 3.

Same as case one and two, the resultant filter_bitset of the initial attribute filtering stage is [0, 1, 0, 1, 0, 1, 0, 1].

All entities [1, 2, 3, 4, 5, 6, 7, 8] are inserted to the vector database when ts= 350. Therefore, the final flipped result of the filter_bitset is [0, 1, 0, 1, 0, 1, 0, 1], the same as in case two.

As for the deletion bitset del_bitset, since entities 7 and 8 are already deleted when ts=350, therefore, the result of del_bitset should be [0, 0, 0, 0, 0, 0, 1, 1].

Now we have two bitsets after Time Travel and attribute filtering: filter_bitset [0, 1, 0, 1, 0, 1, 0, 1] and del_bitset [0, 0, 0, 0, 0, 0, 1, 1]. Combine these two bitsets with the "OR" binary logic operator. The ultimate value of result_bitset is [0, 1, 0, 1, 0, 1, 1, 1]. That is to say, only entities [1, 3, 5] will be computed in the following search or query stage.

Figure 3

What's next?

In the 2.0 new feature series blog, we aim to explain the design of the new features. Read more in this blog series!

Keep Reading