重排器概述
在資訊檢索和生成式人工智慧領域中,Reranker 是優化初始搜尋結果順序的重要工具。Reranker 與傳統的嵌入模型不同,它將查詢和文件作為輸入,並直接返回相似性分數,而不是嵌入。此分數表示輸入查詢與文件之間的相關性。
Reranker 通常在第一階段的檢索之後使用,通常是透過向量近似近鄰 (ANN) 技術完成。雖然 ANN 搜尋能夠有效率地取得廣泛的潛在相關結果集,但它們不一定會依據與查詢的實際語意親近程度來排列結果的優先順序。在此,rerankers 使用更深入的上下文分析來優化結果順序,通常會利用先進的機器學習模型,例如 BERT 或其他以 Transformer 為基礎的模型。如此一來,rerankers 就能大幅提升呈現給使用者的最終結果的準確性與相關性。
PyMilvus 模型函式庫整合了 rerank 功能,以最佳化初始搜尋所返回結果的順序。當您從 Milvus 擷取最接近的 embedings 之後,您可以利用這些 reranking 工具來優化搜尋結果,以提高搜尋結果的精確度。
重排功能 | API 或開放源碼 |
---|---|
BGE | 開放源碼 |
交叉編碼器 | 開放源碼 |
航海 | API |
Cohere | API |
Jina AI | API |
使用開放原始碼的 reranker 之前,請確認已下載並安裝所有必要的相依性與模型。
對於以 API 為基礎的 rerankers,請向提供者取得 API 金鑰,並將其設定在適當的環境變數或參數中。
範例 1:使用 BGE rerank 函式根據查詢對文件進行排序
在本範例中,我們將示範如何使用BGE reranker根據特定查詢將搜尋結果重新排序。
要在PyMilvus 模型庫中使用 reranker,首先要安裝 PyMilvus 模型庫以及包含所有必要 reranking 工具的模型子套件:
pip install pymilvus[model]
# or pip install "pymilvus[model]" for zsh.
要使用 BGE reranker,首先匯入BGERerankFunction
class:
from pymilvus.model.reranker import BGERerankFunction
然後,建立一個BGERerankFunction
範例來進行 reranking:
bge_rf = BGERerankFunction(
model_name="BAAI/bge-reranker-v2-m3", # Specify the model name. Defaults to `BAAI/bge-reranker-v2-m3`.
device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)
若要根據查詢對文件進行排序,請使用下列程式碼:
query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"
documents = [
"In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
"The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
"In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
"The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
]
bge_rf(query, documents)
預期的輸出與以下相似:
[RerankResult(text="The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", score=0.9911615761470803, index=1),
RerankResult(text="In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.", score=0.0326971950177779, index=0),
RerankResult(text='The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.', score=0.006514905766152258, index=3),
RerankResult(text='In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.', score=0.0042116724917325935, index=2)]
範例 2:使用 reranker 來強化搜尋結果的相關性
在本指南中,我們將探討如何利用 PyMilvus 中的search()
方法來進行相似性搜尋,以及如何使用 reranker 來增強搜尋結果的相關性。我們的示範將使用下列資料集:
entities = [
{'doc_id': 0, 'doc_vector': [-0.0372721,0.0101959,...,-0.114994], 'doc_text': "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence."},
{'doc_id': 1, 'doc_vector': [-0.00308882,-0.0219905,...,-0.00795811], 'doc_text': "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals."},
{'doc_id': 2, 'doc_vector': [0.00945078,0.00397605,...,-0.0286199], 'doc_text': 'In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.'},
{'doc_id': 3, 'doc_vector': [-0.0391119,-0.00880096,...,-0.0109257], 'doc_text': 'The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.'}
]
資料集元件:
doc_id
:每個文件的唯一識別碼。doc_vector
:代表文件的向量嵌入。關於產生內嵌的指引,請參閱Embeddings。doc_text
:文件的文字內容。
準備工作
在啟動相似性搜尋之前,您需要與 Milvus 建立連線、建立資料集、準備資料並將資料插入該資料集中。下面的程式碼片段說明了這些初步步驟。
from pymilvus import MilvusClient, DataType
client = MilvusClient(
uri="http://10.102.6.214:19530" # replace with your own Milvus server address
)
client.drop_collection('test_collection')
# define schema
schema = client.create_schema(auto_id=False, enabel_dynamic_field=True)
schema.add_field(field_name="doc_id", datatype=DataType.INT64, is_primary=True, description="document id")
schema.add_field(field_name="doc_vector", datatype=DataType.FLOAT_VECTOR, dim=384, description="document vector")
schema.add_field(field_name="doc_text", datatype=DataType.VARCHAR, max_length=65535, description="document text")
# define index params
index_params = client.prepare_index_params()
index_params.add_index(field_name="doc_vector", index_type="IVF_FLAT", metric_type="IP", params={"nlist": 128})
# create collection
client.create_collection(collection_name="test_collection", schema=schema, index_params=index_params)
# insert data into collection
client.insert(collection_name="test_collection", data=entities)
# Output:
# {'insert_count': 4, 'ids': [0, 1, 2, 3]}
執行相似性搜尋
插入資料後,使用search
方法執行相似性搜尋。
# search results based on our query
res = client.search(
collection_name="test_collection",
data=[[-0.045217834, 0.035171617, ..., -0.025117004]], # replace with your query vector
limit=3,
output_fields=["doc_id", "doc_text"]
)
for i in res[0]:
print(f'distance: {i["distance"]}')
print(f'doc_text: {i["entity"]["doc_text"]}')
預期的輸出與下面相似:
distance: 0.7235960960388184
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
distance: 0.6269873976707458
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
distance: 0.5340118408203125
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.
使用 reranker 來強化搜尋結果
然後,使用 reranking 步驟改善搜尋結果的相關性。在這個範例中,我們使用 PyMilvus 內建的CrossEncoderRerankFunction
來重新排列結果,以提高精確度。
# use reranker to rerank search results
from pymilvus.model.reranker import CrossEncoderRerankFunction
ce_rf = CrossEncoderRerankFunction(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", # Specify the model name.
device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)
reranked_results = ce_rf(
query='What event in 1956 marked the official birth of artificial intelligence as a discipline?',
documents=[
"In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
"The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
"In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
"The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
],
top_k=3
)
# print the reranked results
for result in reranked_results:
print(f'score: {result.score}')
print(f'doc_text: {result.text}')
預期的輸出與以下相似:
score: 6.250532627105713
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
score: -2.9546022415161133
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
score: -4.771512031555176
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.