Embedding Overview
Embedding is a machine learning concept for mapping data into a high-dimensional space, where data of similar semantic are placed close together. Typically being a Deep Neural Network from BERT or other Transformer families, the embedding model can effectively represent the semantics of text, images, and other data types with a series of numbers known as vectors. A key feature of these models is that the mathematical distance between vectors in the high-dimensional space can indicate the similarity of the semantics of original text or images. This property unlocks many information retrieval applications, such as web search engines like Google and Bing, product search and recommendations on e-commerce sites, and the recently popular Retrieval Augmented Generation (RAG) paradigm in generative AI.
There are two main categories of embeddings, each producing a different type of vector:
Dense embedding: Most embedding models represent information as a floating point vector of hundreds to thousands of dimensions. The output is called “dense” vectors as most dimensions have non-zero values. For instance, the popular open-source embedding model BAAI/bge-base-en-v1.5 outputs vectors of 768 floating point numbers (768-dimension float vector).
Sparse embedding: In contrast, the output vectors of sparse embeddings has most dimensions being zero, namely “sparse” vectors. These vectors often have much higher dimensions (tens of thousands or more) which is determined by the size of the token vocabulary. Sparse vectors can be generated by Deep Neural Networks or statistical analysis of text corpora. Due to their interpretability and observed better out-of-domain generalization capabilities, sparse embeddings are increasingly adopted by developers as a complement to dense embeddings.
Milvus is a vector database designed for vector data management, storage, and retrieval. By integrating mainstream embedding and reranking models, you can easily transform original text into searchable vectors or rerank the results using powerful models to achieve more accurate results for RAG. This integration simplifies text transformation and eliminates the need for additional embedding or reranking components, thereby streamlining RAG development and validation.
To create embeddings in action, refer to Using PyMilvus’s Model To Generate Text Embeddings.
Embedding Function | Type | API or Open-sourced |
---|---|---|
openai | Dense | API |
sentence-transformer | Dense | Open-sourced |
bm25 | Sparse | Open-sourced |
Splade | Sparse | Open-sourced |
bge-m3 | Hybrid | Open-sourced |
voyageai | Dense | API |
jina | Dense | API |
cohere | Dense | API |
Instructor | Dense | Open-sourced |
Mistral AI | Dense | API |
Nomic | Dense | API |
mGTE | Hybrid | Open-sourced |
Example 1: Use default embedding function to generate dense vectors
To use embedding functions with Milvus, first install the PyMilvus client library with the model
subpackage that wraps all the utilities for embedding generation.
pip install "pymilvus[model]"
The model
subpackage supports various embedding models, from OpenAI, Sentence Transformers, BGE M3, BM25, to SPLADE pretrained models. For simpilicity, this example uses the DefaultEmbeddingFunction
which is all-MiniLM-L6-v2 sentence transformer model, the model is about 70MB and it will be downloaded during first use:
from pymilvus import model
# This will download "all-MiniLM-L6-v2", a light weight model.
ef = model.DefaultEmbeddingFunction()
# Data from which embeddings are to be generated
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
embeddings = ef.encode_documents(docs)
# Print embeddings
print("Embeddings:", embeddings)
# Print dimension and shape of embeddings
print("Dim:", ef.dim, embeddings[0].shape)
The expected output is similar to the following:
Embeddings: [array([-3.09392996e-02, -1.80662833e-02, 1.34775648e-02, 2.77156215e-02,
-4.86349640e-03, -3.12581174e-02, -3.55921760e-02, 5.76934684e-03,
2.80773244e-03, 1.35783911e-01, 3.59678417e-02, 6.17732145e-02,
...
-4.61330153e-02, -4.85207550e-02, 3.13997865e-02, 7.82178566e-02,
-4.75336798e-02, 5.21207601e-02, 9.04406682e-02, -5.36676683e-02],
dtype=float32)]
Dim: 384 (384,)
Example 2: Generate dense and sparse vectors in one call with BGE M3 model
In this example, we use BGE M3 hybrid model to embed text into both dense and sparse vectors and use them to retrieve relevant documents. The overall steps are as follows:
Embed the text as dense and sparse vectors using BGE-M3 model;
Set up a Milvus collection to store the dense and sparse vectors;
Insert the data to Milvus;
Search and inspect the result.
First, we need to install the necessary dependencies.
from pymilvus.model.hybrid import BGEM3EmbeddingFunction
from pymilvus import (
utility,
FieldSchema, CollectionSchema, DataType,
Collection, AnnSearchRequest, RRFRanker, connections,
)
Use BGE M3 to encode docs and queries for embedding retrieval.
# 1. prepare a small corpus to search
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
query = "Who started AI research?"
# BGE-M3 model can embed texts as dense and sparse vectors.
# It is included in the optional `model` module in pymilvus, to install it,
# simply run "pip install pymilvus[model]".
bge_m3_ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu")
docs_embeddings = bge_m3_ef(docs)
query_embeddings = bge_m3_ef([query])
Example 3: Generate sparse vectors using BM25 model
BM25 is a well-known method that uses word occurrence frequencies to determine the relevance between queries and documents. In this example, we will show how to use BM25EmbeddingFunction
to generate sparse embeddings for both queries and documents.
First, import the BM25EmbeddingFunction class.
from pymilvus.model.sparse import BM25EmbeddingFunction
In BM25, it’s important to calculate the statistics in your documents to obtain the IDF (Inverse Document Frequency), which can represent the pattern in your documents. The IDF is a measure of how much information a word provides, that is, whether it’s common or rare across all documents.
# 1. prepare a small corpus to search
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
query = "Where was Turing born?"
bm25_ef = BM25EmbeddingFunction()
# 2. fit the corpus to get BM25 model parameters on your documents.
bm25_ef.fit(docs)
# 3. store the fitted parameters to disk to expedite future processing.
bm25_ef.save("bm25_params.json")
# 4. load the saved params
new_bm25_ef = BM25EmbeddingFunction()
new_bm25_ef.load("bm25_params.json")
docs_embeddings = new_bm25_ef.encode_documents(docs)
query_embeddings = new_bm25_ef.encode_queries([query])
print("Dim:", new_bm25_ef.dim, list(docs_embeddings)[0].shape)
The expected output is similar to the following:
Dim: 21 (1, 21)