milvus-logo
LFAI
首页
  • 模型

BM25

BM25是信息检索中的一种排序函数,用于估计文档与给定搜索查询的相关性。它结合了文档长度归一化和术语频率饱和,从而增强了基本术语频率方法。BM25 可以通过将文档表示为术语重要性得分向量来生成稀疏嵌入,从而在稀疏向量空间中实现高效检索和排序。

Milvus 使用BM25EmbeddingFunction类与BM25模型集成。该类处理嵌入的计算,并以与 Milvus 兼容的格式返回,用于索引和搜索。这一过程的关键是建立一个标记化分析器。

要使用这一功能,请安装必要的依赖项:

pip install --upgrade pymilvus
pip install "pymilvus[model]"

要轻松创建一个标记化器,Milvus 提供了一个默认的分析器,只需指定文本的语言即可。

例如

from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction

# there are some built-in analyzers for several languages, now we use 'en' for English.
analyzer = build_default_analyzer(language="en")

corpus = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print("tokens:", tokens)

参数

  • 语言(字符串)

    要标记化的文本语言。有效选项为en(英语)、de(德语)、fr(法语)、ru(俄语)、sp(西班牙语)、it(意大利语)、pt(葡萄牙语)、zh(中文)、jp(日语)、kr(韩语)。

预期的输出结果类似于下图:

tokens: ['artifici', 'intellig', 'found', 'academ', 'disciplin', '1956']

BM25 算法在处理文本时,首先使用内置分析器将文本分解成词块,如图所示的"artifici"、 "intellig ""academ "等英语词块然后,它收集这些标记的统计数据,评估它们在文档中的出现频率和分布情况。BM25 的核心是根据每个标记的重要性计算其相关性得分,较罕见的标记得分较高。这一简洁的过程可以根据与查询的相关性对文档进行有效的排序。

要收集语料库的统计数据,请使用fit()方法:

# Use the analyzer to instantiate the BM25EmbeddingFunction
bm25_ef = BM25EmbeddingFunction(analyzer)

# Fit the model on the corpus to get the statstics of the corpus
bm25_ef.fit(corpus)

然后,使用encode_documents()为文档创建嵌入:

docs = [
    "The field of artificial intelligence was established as an academic subject in 1956.",
    "Alan Turing was the pioneer in conducting significant research in artificial intelligence.",
    "Originating in Maida Vale, London, Turing grew up in the southern regions of England.",
    "In 1956, artificial intelligence emerged as a scholarly field.",
    "Turing, originally from Maida Vale, London, was brought up in the south of England."
]

# Create embeddings for the documents
docs_embeddings = bm25_ef.encode_documents(docs)

# Print embeddings
print("Embeddings:", docs_embeddings)
# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("Sparse dim:", bm25_ef.dim, list(docs_embeddings)[0].shape)

预期输出结果类似于下图:

Embeddings:   (0, 0)        1.0208816705336425
  (0, 1)        1.0208816705336425
  (0, 3)        1.0208816705336425
...
  (4, 16)        0.9606986899563318
  (4, 17)        0.9606986899563318
  (4, 20)        0.9606986899563318
Sparse dim: 21 (1, 21)

使用encode_queries()方法为查询创建嵌入式内容:

queries = ["When was artificial intelligence founded", 
           "Where was Alan Turing born?"]

query_embeddings = bm25_ef.encode_queries(queries)

# Print embeddings
print("Embeddings:", query_embeddings)
# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("Sparse dim:", bm25_ef.dim, list(query_embeddings)[0].shape)

预期输出类似于下图:

Embeddings:   (0, 0)        0.5108256237659907
  (0, 1)        0.5108256237659907
  (0, 2)        0.5108256237659907
  (1, 6)        0.5108256237659907
  (1, 7)        0.11554389108992644
  (1, 14)        0.5108256237659907
Sparse dim: 21 (1, 21)

注释:

使用BM25EmbeddingFunction 时,请注意encoding_queries()encoding_documents( )操作在数学上不能互换。因此,没有可用的bm25_ef(text)

翻译自DeepLogo

目录
反馈

此页对您是否有帮助?