使用 Milvus 和 Haystack 的檢索-擴充世代 (RAG)
本指南展示了如何使用 Haystack 和 Milvus 建立一個檢索-增強生成 (RAG) 系統。
RAG 系統結合了檢索系統與生成模型,可根據給定的提示產生新的文字。該系統首先使用 Milvus 從語料庫中檢索相關文件,然後根據檢索到的文件使用生成模型生成新文本。
Haystack是 deepset 開放源碼的 Python 框架,用於使用大型語言模型 (LLM) 建立自訂應用程式。Milvus是世界上最先進的開放原始碼向量資料庫,用於嵌入相似性搜尋和人工智能應用程式。
先決條件
執行本筆記本之前,請確定您已安裝下列相依性:
! pip install --upgrade --quiet pymilvus milvus-haystack markdown-it-py mdit_plain
如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時(點選畫面頂端的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。
我們將使用 OpenAI 的模型。您應該準備api key OPENAI_API_KEY
作為環境變數。
import os
os.environ["OPENAI_API_KEY"] = "sk-***********"
準備資料
我們使用一個有關Leonardo Da Vinci的線上內容,作為我們 RAG 管道的私人知識儲存庫,對於簡單的 RAG 管道而言,這是一個很好的資料來源。
下載並儲存為本機文字檔。
import os
import urllib.request
url = "https://www.gutenberg.org/cache/epub/7785/pg7785.txt"
file_path = "./davinci.txt"
if not os.path.exists(file_path):
urllib.request.urlretrieve(url, file_path)
建立索引管道
建立一個索引管道,將文字轉換成文件、分割成句子並嵌入其中。然後將文件寫入 Milvus 文件儲存庫。
from haystack import Pipeline
from haystack.components.converters import MarkdownToDocument
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.utils import Secret
from milvus_haystack import MilvusDocumentStore
from milvus_haystack.milvus_embedding_retriever import MilvusEmbeddingRetriever
document_store = MilvusDocumentStore(
connection_args={"uri": "./milvus.db"},
# connection_args={"uri": "http://localhost:19530"},
# connection_args={"uri": YOUR_ZILLIZ_CLOUD_URI, "token": Secret.from_env_var("ZILLIZ_CLOUD_API_KEY")},
drop_old=True,
)
對於 connection_args:
- 將
uri
設定為本機檔案,例如./milvus.db
,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。 - 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如
http://localhost:19530
,作為您的uri
。 - 如果您要使用Zilliz Cloud,Milvus 的完全管理雲端服務,請調整
uri
和token
,對應 Zilliz Cloud 的Public Endpoint 和 Api key。
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", MarkdownToDocument())
indexing_pipeline.add_component(
"splitter", DocumentSplitter(split_by="sentence", split_length=2)
)
indexing_pipeline.add_component("embedder", OpenAIDocumentEmbedder())
indexing_pipeline.add_component("writer", DocumentWriter(document_store))
indexing_pipeline.connect("converter", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")
indexing_pipeline.run({"converter": {"sources": [file_path]}})
print("Number of documents:", document_store.count_documents())
Converting markdown files to Documents: 100%|█| 1/
Calculating embeddings: 100%|█| 9/9 [00:05<00:00,
E20240516 10:40:32.945937 5309095 milvus_local.cpp:189] [SERVER][GetCollection][] Collecton HaystackCollection not existed
E20240516 10:40:32.946677 5309095 milvus_local.cpp:189] [SERVER][GetCollection][] Collecton HaystackCollection not existed
E20240516 10:40:32.946704 5309095 milvus_local.cpp:189] [SERVER][GetCollection][] Collecton HaystackCollection not existed
E20240516 10:40:32.946725 5309095 milvus_local.cpp:189] [SERVER][GetCollection][] Collecton HaystackCollection not existed
Number of documents: 277
建立檢索管道
建立檢索管道,使用向量相似性搜尋引擎從 Milvus 文件儲存庫檢索文件。
question = 'Where is the painting "Warrior" currently stored?'
retrieval_pipeline = Pipeline()
retrieval_pipeline.add_component("embedder", OpenAITextEmbedder())
retrieval_pipeline.add_component(
"retriever", MilvusEmbeddingRetriever(document_store=document_store, top_k=3)
)
retrieval_pipeline.connect("embedder", "retriever")
retrieval_results = retrieval_pipeline.run({"embedder": {"text": question}})
for doc in retrieval_results["retriever"]["documents"]:
print(doc.content)
print("-" * 10)
). The
composition of this oil-painting seems to have been built up on the
second cartoon, which he had made some eight years earlier, and which
was apparently taken to France in 1516 and ultimately lost.
----------
This "Baptism of Christ," which is now in the Accademia in Florence
and is in a bad state of preservation, appears to have been a
comparatively early work by Verrocchio, and to have been painted
in 1480-1482, when Leonardo would be about thirty years of age.
To about this period belongs the superb drawing of the "Warrior," now
in the Malcolm Collection in the British Museum.
----------
" Although he
completed the cartoon, the only part of the composition which he
eventually executed in colour was an incident in the foreground
which dealt with the "Battle of the Standard." One of the many
supposed copies of a study of this mural painting now hangs on the
south-east staircase in the Victoria and Albert Museum.
----------
建立 RAG 管道
建立 RAG 管道,結合 MilvusEmbeddingRetriever 與 OpenAIGenerator,使用擷取的文件回答問題。
from haystack.utils import Secret
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
prompt_template = """Answer the following query based on the provided context. If the context does
not include an answer, reply with 'I don't know'.\n
Query: {{query}}
Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Answer:
"""
rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", OpenAITextEmbedder())
rag_pipeline.add_component(
"retriever", MilvusEmbeddingRetriever(document_store=document_store, top_k=3)
)
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipeline.add_component(
"generator",
OpenAIGenerator(
api_key=Secret.from_token(os.getenv("OPENAI_API_KEY")),
generation_kwargs={"temperature": 0},
),
)
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")
results = rag_pipeline.run(
{
"text_embedder": {"text": question},
"prompt_builder": {"query": question},
}
)
print("RAG answer:", results["generator"]["replies"][0])
RAG answer: The painting "Warrior" is currently stored in the Malcolm Collection in the British Museum.
有關如何使用 milvus-haystack 的詳細資訊,請參閱milvus-haystack Readme。