Milvus 混合搜尋檢索器

混合搜尋結合了不同搜尋範式的優點,以提高檢索的精確度和穩健性。它充分利用了密集向量搜尋和稀疏向量搜尋的能力,以及多種密集向量搜尋策略的組合,確保對各種查詢進行全面而精確的檢索。

此圖說明最常見的混合搜尋情況,也就是密集 + 稀疏混合搜尋。在這種情況下,候選人會同時使用語意向量相似性和精確的關鍵字匹配進行檢索。來自這些方法的結果會合併、重新排序,並傳送到 LLM 以產生最終答案。這種方法兼顧了精確度與語意理解,對於不同的查詢情境非常有效。

除了密集 + 稀疏混合搜尋之外,混合策略也可以結合多種密集向量模型。例如,一個密集向量模型可能專門捕捉語義上的細微差異,而另一個則著重於上下文嵌入或特定領域的表達。透過合併這些模型的結果並重新排序,這類型的混合搜尋可確保檢索過程更仔細、更能感知上下文。

LangChain Milvus 整合提供了一個彈性的方式來實現混合搜尋,它支援任何數量的向量領域,以及任何自訂的密集或稀疏嵌入模型,這使得 LangChain Milvus 能夠靈活地適應各種混合搜尋的使用情境,同時與 LangChain 的其他功能相容。

在本教程中,我們將從最常見的 dense + sparse 情況開始,然後介紹任何數目的一般混合搜尋使用方式。

MilvusCollectionHybridSearchRetriever 是 Milvus 和 LangChain 混合搜尋的另一種實作,即將被廢棄。請使用本文件中的方法來實作混合搜尋,因為它比較有彈性,而且與 LangChain 相容。

先決條件

在執行本筆記本之前,請確認您已經安裝下列的相依性:

$ pip install --upgrade --quiet  langchain langchain-core langchain-community langchain-text-splitters langchain-milvus langchain-openai bs4 pymilvus[model] #langchain-voyageai

如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時(點選畫面上方的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。

我們將使用 OpenAI 的模型。您應該準備OpenAI 的環境變數OPENAI_API_KEY

import os

os.environ["OPENAI_API_KEY"] = "sk-***********"

指定您的 Milvus 伺服器URI (也可選擇TOKEN)。關於如何安裝和啟動 Milvus 伺服器,請參考本指南

URI = "http://localhost:19530"
# TOKEN = ...

準備一些範例文件,這些文件是依主題或類型分類的虛構故事摘要。

from langchain_core.documents import Document

docs = [
    Document(
        page_content="In 'The Whispering Walls' by Ava Moreno, a young journalist named Sophia uncovers a decades-old conspiracy hidden within the crumbling walls of an ancient mansion, where the whispers of the past threaten to destroy her own sanity.",
        metadata={"category": "Mystery"},
    ),
    Document(
        page_content="In 'The Last Refuge' by Ethan Blackwood, a group of survivors must band together to escape a post-apocalyptic wasteland, where the last remnants of humanity cling to life in a desperate bid for survival.",
        metadata={"category": "Post-Apocalyptic"},
    ),
    Document(
        page_content="In 'The Memory Thief' by Lila Rose, a charismatic thief with the ability to steal and manipulate memories is hired by a mysterious client to pull off a daring heist, but soon finds themselves trapped in a web of deceit and betrayal.",
        metadata={"category": "Heist/Thriller"},
    ),
    Document(
        page_content="In 'The City of Echoes' by Julian Saint Clair, a brilliant detective must navigate a labyrinthine metropolis where time is currency, and the rich can live forever, but at a terrible cost to the poor.",
        metadata={"category": "Science Fiction"},
    ),
    Document(
        page_content="In 'The Starlight Serenade' by Ruby Flynn, a shy astronomer discovers a mysterious melody emanating from a distant star, which leads her on a journey to uncover the secrets of the universe and her own heart.",
        metadata={"category": "Science Fiction/Romance"},
    ),
    Document(
        page_content="In 'The Shadow Weaver' by Piper Redding, a young orphan discovers she has the ability to weave powerful illusions, but soon finds herself at the center of a deadly game of cat and mouse between rival factions vying for control of the mystical arts.",
        metadata={"category": "Fantasy"},
    ),
    Document(
        page_content="In 'The Lost Expedition' by Caspian Grey, a team of explorers ventures into the heart of the Amazon rainforest in search of a lost city, but soon finds themselves hunted by a ruthless treasure hunter and the treacherous jungle itself.",
        metadata={"category": "Adventure"},
    ),
    Document(
        page_content="In 'The Clockwork Kingdom' by Augusta Wynter, a brilliant inventor discovers a hidden world of clockwork machines and ancient magic, where a rebellion is brewing against the tyrannical ruler of the land.",
        metadata={"category": "Steampunk/Fantasy"},
    ),
    Document(
        page_content="In 'The Phantom Pilgrim' by Rowan Welles, a charismatic smuggler is hired by a mysterious organization to transport a valuable artifact across a war-torn continent, but soon finds themselves pursued by deadly assassins and rival factions.",
        metadata={"category": "Adventure/Thriller"},
    ),
    Document(
        page_content="In 'The Dreamwalker's Journey' by Lyra Snow, a young dreamwalker discovers she has the ability to enter people's dreams, but soon finds herself trapped in a surreal world of nightmares and illusions, where the boundaries between reality and fantasy blur.",
        metadata={"category": "Fantasy"},
    ),
]

密集嵌入 + 稀疏嵌入

選項 1(推薦):密集嵌入 + Milvus BM25 內建功能

使用 dense embedding + Milvus BM25 內建函式來組合混合式檢索向量儲存範例。

from langchain_milvus import Milvus, BM25BuiltInFunction
from langchain_openai import OpenAIEmbeddings


vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(),
    builtin_function=BM25BuiltInFunction(),  # output_field_names="sparse"),
    vector_field=["dense", "sparse"],
    connection_args={
        "uri": URI,
    },
    consistency_level="Bounded",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/tune_consistency.md#Consistency-Level for more details.
    drop_old=False,
)
  • 當您使用BM25BuiltInFunction 時,請注意全文檢索在 Milvus Standalone 和 Milvus Distributed 中可用,但在 Milvus Lite 中不可用,儘管它在未來加入的路線圖上。不久之後,Zilliz Cloud (完全管理的 Milvus) 也會提供此功能。如需更多資訊,請聯絡support@zilliz.com

在上面的程式碼中,我們定義了BM25BuiltInFunction 的一個實體,並將它傳給Milvus 物件。BM25BuiltInFunction 是一個輕量級的包裝類,用於 Function的一個輕量級包裝類。我們可以使用OpenAIEmbeddings 來初始化密集 + 稀疏混合搜尋 Milvus 向量儲存的實例。

BM25BuiltInFunction 在Milvus中,所有的詞彙和訓練都是在Milvus伺服器端自動處理,因此使用者不需要關心任何詞彙和詞彙庫。此外,使用者也可以自訂分析器,在 BM25 中實現客製化的文字處理。

關於BM25BuiltInFunction 的更多資訊,請參考Full-Text-SearchUsing Full-Text Search with LangChain and Milvus

選項二:使用密集且客製化的 LangChain 稀疏嵌入

您可以從langchain_milvus.utils.sparse 繼承類別BaseSparseEmbedding ,並實作embed_queryembed_documents 方法來自訂稀疏嵌入過程。這允許您自訂任何基於術語頻率統計(例如BM25)或神經網路(例如SPADE)的稀疏嵌入方法。

以下是一個範例:

from typing import Dict, List
from langchain_milvus.utils.sparse import BaseSparseEmbedding


class MyCustomEmbedding(BaseSparseEmbedding):  # inherit from BaseSparseEmbedding
    def __init__(self, model_path): ...  # code to init or load model

    def embed_query(self, query: str) -> Dict[int, float]:
        ...  # code to embed query
        return {  # fake embedding result
            1: 0.1,
            2: 0.2,
            3: 0.3,
            # ...
        }

    def embed_documents(self, texts: List[str]) -> List[Dict[int, float]]:
        ...  # code to embed documents
        return [  # fake embedding results
            {
                1: 0.1,
                2: 0.2,
                3: 0.3,
                # ...
            }
        ] * len(texts)

我們有一個示範類BM25SparseEmbedding ,它繼承自langchain_milvus.utils.sparse 中的BaseSparseEmbedding 。您可以將它傳入 Milvus 向量存儲實例的初始化嵌入列表,就像傳入其他 langchain 密集嵌入類別一樣。

# BM25SparseEmbedding is inherited from BaseSparseEmbedding
from langchain_milvus.utils.sparse import BM25SparseEmbedding

embedding1 = OpenAIEmbeddings()

corpus = [doc.page_content for doc in docs]
embedding2 = BM25SparseEmbedding(
    corpus=corpus
)  # pass in corpus to initialize the statistics

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=[embedding1, embedding2],
    vector_field=["dense", "sparse"],
    connection_args={
        "uri": URI,
    },
    consistency_level="Bounded",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/tune_consistency.md#Consistency-Level for more details.
    drop_old=False,
)

雖然這是使用 BM25 的一種方式,但它需要使用者管理詞彙頻率統計的語料庫。我們建議使用 BM25 的內建函式(選項 1),因為它會在 Milvus 伺服器端處理一切。這樣使用者就不需要擔心管理語料或訓練詞彙的問題。如需更多資訊,請參閱使用 LangChain 和 Milvus 的全文檢索

定義多個任意向量字段

當初始化 Milvus 向量存儲時,您可以傳入 embeddings 清單 (未來也會傳入內建函數清單) 來實現多路回傳,然後對這些候選字進行重新排序。 以下是一個範例:

# from langchain_voyageai import VoyageAIEmbeddings

embedding1 = OpenAIEmbeddings(model="text-embedding-ada-002")
embedding2 = OpenAIEmbeddings(model="text-embedding-3-large")
# embedding3 = VoyageAIEmbeddings(model="voyage-3")  # You can also use embedding from other embedding model providers, e.g VoyageAIEmbeddings


vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=[embedding1, embedding2],  # embedding3],
    builtin_function=BM25BuiltInFunction(output_field_names="sparse"),
    # `sparse` is the output field name of BM25BuiltInFunction, and `dense1` and `dense2` are the output field names of embedding1 and embedding2
    vector_field=["dense1", "dense2", "sparse"],
    connection_args={
        "uri": URI,
    },
    consistency_level="Bounded",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/tune_consistency.md#Consistency-Level for more details.
    drop_old=False,
)

vectorstore.vector_fields
['dense1', 'dense2', 'sparse']

在這個範例中,我們有三個向量場。其中,sparse 用作BM25BuiltInFunction 的輸出欄位,而另外兩個,dense1dense2 ,則自動指定為兩個OpenAIEmbeddings 模型的輸出欄位(根據順序)。

指定多向量欄位的索引參數

預設情況下,每個向量欄位的索引類型會由嵌入或內建函數的類型自動決定。不過,您也可以指定每個向量欄位的索引類型,以最佳化搜尋效能。

dense_index_param_1 = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
}
dense_index_param_2 = {
    "metric_type": "IP",
    "index_type": "HNSW",
}
sparse_index_param = {
    "metric_type": "BM25",
    "index_type": "AUTOINDEX",
}

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=[embedding1, embedding2],
    builtin_function=BM25BuiltInFunction(output_field_names="sparse"),
    index_params=[dense_index_param_1, dense_index_param_2, sparse_index_param],
    vector_field=["dense1", "dense2", "sparse"],
    connection_args={
        "uri": URI,
    },
    consistency_level="Bounded",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/tune_consistency.md#Consistency-Level for more details.
    drop_old=False,
)

vectorstore.vector_fields
['dense1', 'dense2', 'sparse']

請保持索引參數清單的順序與vectorstore.vector_fields 的順序一致,以免混淆。

對候選人重新排序

在第一階段的檢索之後,我們需要將候選人重新排序,以獲得更好的結果。您可以根據需求選擇WeightedRankerRRFRanker。您可以參考Reranking以獲得更多資訊。

以下是加權重排的範例:

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(),
    builtin_function=BM25BuiltInFunction(),
    vector_field=["dense", "sparse"],
    connection_args={
        "uri": URI,
    },
    consistency_level="Bounded",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/tune_consistency.md#Consistency-Level for more details.
    drop_old=False,
)

query = "What are the novels Lila has written and what are their contents?"

vectorstore.similarity_search(
    query, k=1, ranker_type="weighted", ranker_params={"weights": [0.6, 0.4]}
)
[Document(metadata={'pk': 454646931479252186, 'category': 'Heist/Thriller'}, page_content="In 'The Memory Thief' by Lila Rose, a charismatic thief with the ability to steal and manipulate memories is hired by a mysterious client to pull off a daring heist, but soon finds themselves trapped in a web of deceit and betrayal.")]

以下是 RRF reranking 的範例:

vectorstore.similarity_search(query, k=1, ranker_type="rrf", ranker_params={"k": 100})
[Document(metadata={'category': 'Heist/Thriller', 'pk': 454646931479252186}, page_content="In 'The Memory Thief' by Lila Rose, a charismatic thief with the ability to steal and manipulate memories is hired by a mysterious client to pull off a daring heist, but soon finds themselves trapped in a web of deceit and betrayal.")]

如果您沒有傳送任何有關 rerank 的參數,預設會使用平均加權 rerank 策略。

在 RAG 中使用混合搜索和重排

在 RAG 的情況下,最普遍的混合搜尋方法是密集 + 稀疏檢索,接著是重新排序。隨後的範例展示了直接的端對端程式碼。

準備資料

我們使用 Langchain WebBaseLoader 從網路來源載入文件,並使用 RecursiveCharacterTextSplitter 將文件分割成小塊。

import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a WebBaseLoader instance to load documents from web sources
loader = WebBaseLoader(
    web_paths=(
        "https://lilianweng.github.io/posts/2023-06-23-agent/",
        "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
# Load documents from web sources using the loader
documents = loader.load()
# Initialize a RecursiveCharacterTextSplitter for splitting text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

# Split the documents into chunks using the text_splitter
docs = text_splitter.split_documents(documents)

# Let's take a look at the first document
docs[1]
Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.\nComponent One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.\nTask decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.\nAnother quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.\nSelf-Reflection#')

將文件載入 Milvus 向量儲存庫

如上文的介紹,我們將準備好的文件初始化並載入 Milvus 向量存儲,其中包含兩個向量領域:dense 是 OpenAI 嵌入,sparse 是 BM25 函數。

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(),
    builtin_function=BM25BuiltInFunction(),
    vector_field=["dense", "sparse"],
    connection_args={
        "uri": URI,
    },
    consistency_level="Bounded",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/tune_consistency.md#Consistency-Level for more details.
    drop_old=False,
)

建立 RAG 鏈

我們準備好 LLM 實例和提示,然後用 LangChain Expression Language 將它們結合成 RAG 管道。

from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

# Initialize the OpenAI language model for response generation
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Define the prompt template for generating AI responses
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""

# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()


# Define a function to format the retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

使用 LCEL(LangChain Expression Language) 建立 RAG 鏈。

# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# rag_chain.get_graph().print_ascii()

以特定的問題來啟動 RAG 鏈,並擷取回應

query = "What is PAL and PoT?"
res = rag_chain.invoke(query)
res
'PAL (Program-aided Language models) and PoT (Program of Thoughts prompting) are approaches that involve using language models to generate programming language statements to solve natural language reasoning problems. This method offloads the solution step to a runtime, such as a Python interpreter, allowing for complex computation and reasoning to be handled externally. PAL and PoT rely on language models with strong coding skills to effectively perform these tasks.'

恭喜您!您已經建立了一個由 Milvus 和 LangChain 支援的混合(密集向量 + 稀疏 bm25 函數)搜尋 RAG 鍊。