About Milvus
Get Started
User Guide
Administration Guide
Integrations
Benchmarks
Tools
Reference
Example Applications
FAQs
API Reference

Home
Docs
Reference
Similarity Metrics

Similarity Metrics

In Milvus, similarity metrics are used to measure similarities among vectors. Choosing a good distance metric helps improve the classification and clustering performance significantly.

The following table shows how these widely used similarity metrics fit with various input data forms and Milvus indexes.

Floating point embeddings Binary embeddings

Similarity Metrics	Index Types
Euclidean distance (L2) Inner product (IP)	FLAT IVF_FLAT IVF_SQ8 IVF_PQ HNSW ANNOY DISKANN

Distance Metrics	Index Types
Jaccard Tanimoto Hamming	BIN_FLAT BIN_IVF_FLAT
Superstructure Substructure	* BIN_FLAT

Euclidean distance (L2)

Essentially, Euclidean distance measures the length of a segment that connects 2 points.

The formula for Euclidean distance is as follows:

euclidean

where a = (a1, a2,…, an) and b = (b1, b2,…, bn) are two points in n-dimensional Euclidean space

It’s the most commonly used distance metric and is very useful when the data are continuous.

Milvus only caculates the value before applying square root when Euclidean distance is chosen as the distance metric.

Inner product (IP)

The IP distance between two embeddings are defined as follows:

Where A and B are embeddings, ||A|| and ||B|| are the norms of A and B.

IP is more useful if you want to compare non-normalized data or when you care about magnitude and angle.

If you use IP to calculate embeddings similarities, you must normalize your embeddings. After normalization, the inner product equals cosine similarity.

Suppose X’ is normalized from embedding X:

normalize

The correlation between the two embeddings is as follows:

normalization

Jaccard distance

Jaccard similarity coefficient measures the similarity between two sample sets and is defined as the cardinality of the intersection of the defined sets divided by the cardinality of the union of them. It can only be applied to finite sample sets.

Jaccard similarity coefficient

Jaccard distance measures the dissimilarity between data sets and is obtained by subtracting the Jaccard similarity coefficient from 1. For binary variables, Jaccard distance is equivalent to the Tanimoto coefficient.

Jaccard distance

Tanimoto distance

For binary variables, the Tanimoto coefficient is equivalent to Jaccard distance:

tanimoto coefficient

In Milvus, the Tanimoto coefficient is only applicable for a binary variable, and for binary variables, the Tanimoto coefficient ranges from 0 to +1 (where +1 is the highest similarity).

For binary variables, the formula of Tanimoto distance is:

tanimoto distance

The value ranges from 0 to +infinity.

Hamming distance

Hamming distance measures binary data strings. The distance between two strings of equal length is the number of bit positions at which the bits are different.

For example, suppose there are two strings, 1101 1001 and 1001 1101.

11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance, d (11011001, 10011101) = 2.

Superstructure

The Superstructure is used to measure the similarity of a chemical structure and its superstructure. When the value equals 0, this means the chemical structure in the database is the superstructure of the target chemical structure.

Superstructure similarity can be measured by:

superstructure

Where

B is the superstructure of A
N_A specifies the number of bits in the fingerprint of molecular A.
N_B specifies the number of bits in the fingerprint of molecular B.
N_AB specifies the number of shared bits in the fingerprint of molecular A and B.

Substructure

The Substructure is used to measure the similarity of a chemical structure and its substructure. When the value equals 0, this means the chemical structure in the database is the substructure of the target chemical structure.

Substructure similarity can be measured by:

substructure

Where

B is the substructure of A
N_A specifies the number of bits in the fingerprint of molecular A.
N_B specifies the number of bits in the fingerprint of molecular B.
N_AB specifies the number of shared bits in the fingerprint of molecular A and B.

FAQ

Why is the top1 result of a vector search not the search vector itself, if the metric type is inner product?

This occurs if you have not normalized the vectors when using inner product as the distance metric.

What is normalization? Why is normalization needed?

Normalization refers to the process of converting an embedding (vector) so that its norm equals 1. If you use Inner Product to calculate embeddings similarities, you must normalize your embeddings. After normalization, inner product equals cosine similarity.

See Wikipedia for more information.

Why do I get different results using Euclidean distance (L2) and inner product (IP) as the distance metric?

Check if the vectors are normalized. If not, you need to normalize the vectors first. Theoretically speaking, similarities worked out by L2 are different from similarities worked out by IP, if the vectors are not normalized.

What’s next

Learn more about the supported index types in Milvus.

Similarity Metrics
FAQ
What's next

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started

Feedback

Was this page helpful?

Similarity Metrics

Euclidean distance (L2)

Inner product (IP)

Jaccard distance

Tanimoto distance

Hamming distance

Superstructure

Substructure

FAQ

What’s next

Table of contents

Try Managed Milvus for Free

Feedback