🚀 免費嘗試 Zilliz Cloud,完全托管的 Milvus,體驗速度提升 10 倍!立即嘗試

milvus-logo
LFAI
主頁

BM25

BM25是資訊檢索中使用的排序函數,用來估計文件與指定搜尋查詢的相關性。它結合了文件長度規範化和詞彙頻率飽和,增強了基本的詞彙頻率方法。BM25 可以將文件表示成詞彙重要性分數的向量,從而產生稀疏嵌入,允許在稀疏向量空間中進行有效的檢索和排序。

Milvus 使用BM25EmbeddingFunction類與BM25模型整合。這個類別處理嵌入的計算,並將它們以與 Milvus 相容的格式傳回,以便進行索引和搜尋。這個過程中不可或缺的是建立一個標記化的分析器。

要使用此功能,請安裝必要的相依性:

pip install --upgrade pymilvus
pip install "pymilvus[model]"

要輕鬆建立標記化分析器,Milvus 提供預設的分析器,只需要指定文字的語言。

範例

from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction

# there are some built-in analyzers for several languages, now we use 'en' for English.
analyzer = build_default_analyzer(language="en")

corpus = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print("tokens:", tokens)

參數

  • 語言(字串)

    要標記化的文字語言。有效選項為en(英文)、de(德文)、fr(法文)、ru(俄文)、sp(西班牙文)、it(義大利文)、pt(葡萄牙文)、zh(中文)、jp(日文)、kr(韓文)。

預期的輸出與以下相似:

tokens: ['artifici', 'intellig', 'found', 'academ', 'disciplin', '1956']

BM25 演算法在處理文字時,會先使用內建的分析器將文字分割成字元,如圖中所示的英文字元,如'artifici'、 'intellig「」academ'。然後,它會收集這些標記的統計資料,評估它們在文件中的頻率和分佈情況。BM25 的核心會根據每個標記的重要性計算其相關性分數,較罕見的標記會獲得較高的分數。這個簡潔的過程可以根據與查詢的相關性對文件進行有效的排序。

要收集語料庫的統計資料,請使用fit()方法:

# Use the analyzer to instantiate the BM25EmbeddingFunction
bm25_ef = BM25EmbeddingFunction(analyzer)

# Fit the model on the corpus to get the statstics of the corpus
bm25_ef.fit(corpus)

然後,使用encode_documents()為文件建立內嵌:

docs = [
    "The field of artificial intelligence was established as an academic subject in 1956.",
    "Alan Turing was the pioneer in conducting significant research in artificial intelligence.",
    "Originating in Maida Vale, London, Turing grew up in the southern regions of England.",
    "In 1956, artificial intelligence emerged as a scholarly field.",
    "Turing, originally from Maida Vale, London, was brought up in the south of England."
]

# Create embeddings for the documents
docs_embeddings = bm25_ef.encode_documents(docs)

# Print embeddings
print("Embeddings:", docs_embeddings)
# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("Sparse dim:", bm25_ef.dim, list(docs_embeddings)[0].shape)

預期的輸出與以下相似:

Embeddings:   (0, 0)        1.0208816705336425
  (0, 1)        1.0208816705336425
  (0, 3)        1.0208816705336425
...
  (4, 16)        0.9606986899563318
  (4, 17)        0.9606986899563318
  (4, 20)        0.9606986899563318
Sparse dim: 21 (1, 21)

要為查詢建立嵌入式資料,請使用encode_queries()方法:

queries = ["When was artificial intelligence founded", 
           "Where was Alan Turing born?"]

query_embeddings = bm25_ef.encode_queries(queries)

# Print embeddings
print("Embeddings:", query_embeddings)
# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("Sparse dim:", bm25_ef.dim, list(query_embeddings)[0].shape)

預期的輸出與下面相似:

Embeddings:   (0, 0)        0.5108256237659907
  (0, 1)        0.5108256237659907
  (0, 2)        0.5108256237659907
  (1, 6)        0.5108256237659907
  (1, 7)        0.11554389108992644
  (1, 14)        0.5108256237659907
Sparse dim: 21 (1, 21)

注意事項:

使用BM25EmbeddingFunction 時,請注意encoding_queries()encoding_documents()操作無法在數學上互換。因此,沒有已實作的bm25_ef(text)

目錄

免費嘗試托管的 Milvus

Zilliz Cloud 無縫接入,由 Milvus 提供動力,速度提升 10 倍。

開始使用
反饋

這個頁面有幫助嗎?