🚀 免费试用 Zilliz Cloud,完全托管的 Milvus,体验 10 倍的性能提升!立即试用>

milvus-logo
LFAI
  • Home
  • Blog
  • 介绍 PyMilvus 与嵌入模型的集成

介绍 PyMilvus 与嵌入模型的集成

  • Engineering
June 05, 2024
Stephen Batifol

Milvus是专为人工智能应用设计的开源向量数据库。无论您是在进行机器学习、深度学习还是其他任何人工智能相关项目,Milvus 都能为您提供处理大规模向量数据的强大而高效的方法。

现在,通过 Milvus 的 Python SDK PyMilvus 中的模型模块集成,添加 Embeddings 和 Rerankers 模型变得更加容易。这种集成简化了将数据转化为可搜索向量或重新排名结果的过程,从而获得更准确的结果,例如在检索增强生成(RAG)中。

在本博客中,我们将回顾密集嵌入模型、稀疏嵌入模型和重排序器,并演示如何使用Milvus Lite(Milvus 的轻量级版本,可在您的 Python 应用程序中本地运行)在实践中使用它们。

密集嵌入与稀疏嵌入

在介绍如何使用我们的集成之前,我们先来了解一下向量嵌入的两大类。

向量嵌入通常分为两大类:密集嵌入Dense Embeddings稀疏嵌入(Sparse Embeddings)。

  • 密集嵌入(Dense Embeddings)是高维向量,其中大部分或所有元素都不为零,因此非常适合编码文本语义或模糊含义。

  • 稀疏嵌入(Sparse Embeddings)是高维向量,其中有很多零元素,更适合编码精确或相邻的概念。

Milvus 支持这两种类型的嵌入,并提供混合搜索。混合搜索允许你在同一个 Collections 中跨各种向量场进行搜索。这些向量可以代表数据的不同方面,使用不同的嵌入模型,或采用不同的数据处理方法,使用重新排序器组合结果。

如何使用我们的 Embeddings 和 Rerankers 集成

在下面的章节中,我们将演示使用我们的集成生成嵌入和进行向量搜索的三个实际示例。

示例 1:使用默认嵌入函数生成密集向量

要使用 Milvus 的嵌入和 Rerankers 功能,必须安装pymilvus 客户端和model 软件包。

pip install "pymilvus[model]"

这一步将安装Milvus Lite,允许你在 Python 应用程序中本地运行 Milvus。它还包含模型子包,其中包括 Embeddings 和 Rerankers 的所有实用程序。

模型子包支持各种嵌入模型,包括来自 OpenAI、Sentence TransformersBGE-M3、BM25、SPLADE 和 Jina AI 预训练模型的嵌入模型。

本示例使用DefaultEmbeddingFunction ,基于all-MiniLM-L6-v2 Sentence Transformers 模型,以简化操作。该模型约 70MB,将在首次使用时下载:

from pymilvus import model

# This will download "all-MiniLM-L6-v2", a lightweight model.
ef = model.DefaultEmbeddingFunction()

# Data from which embeddings are to be generated
docs = [
   "Artificial intelligence was founded as an academic discipline in 1956.",
   "Alan Turing was the first person to conduct substantial research in AI.",
   "Born in Maida Vale, London, Turing was raised in southern England.",
]

embeddings = ef.encode_documents(docs)

print("Embeddings:", embeddings)
# Print dimension and shape of embeddings
print("Dim:", ef.dim, embeddings[0].shape)

预期输出结果如下:

Embeddings: [array([-3.09392996e-02, -1.80662833e-02,  1.34775648e-02,  2.77156215e-02,
      -4.86349640e-03, -3.12581174e-02, -3.55921760e-02,  5.76934684e-03,
       2.80773244e-03,  1.35783911e-01,  3.59678417e-02,  6.17732145e-02,
...
      -4.61330153e-02, -4.85207550e-02,  3.13997865e-02,  7.82178566e-02,
      -4.75336798e-02,  5.21207601e-02,  9.04406682e-02, -5.36676683e-02],
     dtype=float32)]
Dim: 384 (384,)

例 2:使用 BM25 模型生成稀疏向量

BM25 是一种著名的方法,它使用单词出现频率来确定查询和文档之间的相关性。在本例中,我们将展示如何使用BM25EmbeddingFunction 为查询和文档生成稀疏嵌入。

在 BM25 中,计算文档中的统计数据以获得 IDF(反文档频率)非常重要,它可以代表文档中的模式。IDF 衡量一个词提供了多少信息,在所有文档中是常见还是罕见。

from pymilvus.model.sparse import BM25EmbeddingFunction

# 1. Prepare a small corpus to search
docs = [
   "Artificial intelligence was founded as an academic discipline in 1956.",
   "Alan Turing was the first person to conduct substantial research in AI.",
   "Born in Maida Vale, London, Turing was raised in southern England.",
]
query = "Where was Turing born?"
bm25_ef = BM25EmbeddingFunction()

# 2. Fit the corpus to get BM25 model parameters on your documents.
bm25_ef.fit(docs)

# 3. Store the fitted parameters to expedite future processing.
bm25_ef.save("bm25_params.json")

# 4. Load the saved params
new_bm25_ef = BM25EmbeddingFunction()
new_bm25_ef.load("bm25_params.json")

docs_embeddings = new_bm25_ef.encode_documents(docs)
query_embeddings = new_bm25_ef.encode_queries([query])
print("Dim:", new_bm25_ef.dim, list(docs_embeddings)[0].shape)

示例 3:使用 Reranker

搜索系统旨在快速高效地找到最相关的结果。传统上,BM25 或 TF-IDF 等方法用于根据关键词匹配对搜索结果进行排序。最近的方法,如基于 Embeddings 的余弦相似度,虽然简单明了,但有时会忽略语言的微妙之处,最重要的是,忽略了文档与查询意图之间的相互作用。

这就是使用重排序器的好处。重新排序器是一种先进的人工智能模型,它可以从搜索中获取初始结果集(通常由基于 Embeddings/ 标记的搜索提供)并对其进行重新评估,以确保它们更符合用户的意图。它的着眼点不仅仅是表面上的术语匹配,而是考虑搜索查询与文档内容之间更深层次的交互。

在这个例子中,我们将使用Jina AI Reranker

from pymilvus.model.reranker import JinaRerankFunction

jina_api_key = "<YOUR_JINA_API_KEY>"

rf = JinaRerankFunction("jina-reranker-v1-base-en", jina_api_key)

query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"

documents = [
   "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
   "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
   "In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
   "The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
]

results = rf(query, documents)

for result in results:
   print(f"Index: {result.index}")
   print(f"Score: {result.score:.6f}")
   print(f"Text: {result.text}\n")

预期的输出结果类似于下图:

Index: 1
Score: 0.937096
Text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.

Index: 3
Score: 0.354210
Text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.

Index: 0
Score: 0.349866
Text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.

Index: 2
Score: 0.272896
Text: In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.

在 GitHub 星级我们,加入我们的 Discord!

如果您喜欢这篇博文,请考虑在GitHub 上给 Milvus 加星,并随时加入我们的Discord!💙

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started

Like the article? Spread the word

扩展阅读