使用 Milvus 和 LlamaIndex 的检索增强生成(RAG)
本指南演示了如何使用 LlamaIndex 和 Milvus 构建检索-增强生成(RAG)系统。
RAG 系统结合了检索系统和生成模型,可根据给定提示生成新文本。该系统首先使用 Milvus 从语料库中检索相关文档,然后使用生成模型根据检索到的文档生成新文本。
LlamaIndex是一个简单、灵活的数据框架,用于将自定义数据源连接到大型语言模型(LLMs)。Milvus是世界上最先进的开源向量数据库,专为支持嵌入式相似性搜索和人工智能应用而构建。
在本笔记本中,我们将快速演示如何使用 MilvusVectorStore。
开始之前
安装依赖项
本页面上的代码片段需要 pymilvus 和 llamaindex 依赖项。您可以使用以下命令安装它们:
$ pip install pymilvus>=2.4.2
$ pip install llama-index-vector-stores-milvus
$ pip install llama-index
如果使用的是 Google Colab,要启用刚刚安装的依赖项,可能需要重新启动运行时。(点击屏幕上方的 "Runtime(运行时)"菜单,从下拉菜单中选择 "Restart session(重新启动会话)")。
设置 OpenAI
首先让我们添加 openai api 密钥。这将允许我们访问 chatgpt。
import openai
openai.api_key = "sk-***********"
准备数据
您可以使用以下命令下载样本数据:
! mkdir -p 'data/'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham_essay.txt'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/uber_2021.pdf'
开始
生成数据
作为第一个例子,让我们从文件paul_graham_essay.txt
中生成一个文档。这是保罗-格雷厄姆(Paul Graham)的一篇题为What I Worked On
的文章。我们将使用 SimpleDirectoryReader 生成文档。
from llama_index.core import SimpleDirectoryReader
# load documents
documents = SimpleDirectoryReader(
input_files=["./data/paul_graham_essay.txt"]
).load_data()
print("Document ID:", documents[0].doc_id)
Document ID: 95f25e4d-f270-4650-87ce-006d69d82033
创建数据索引
现在我们有了文档,可以创建索引并插入文档。
请注意,Milvus Lite需要
pymilvus>=2.4.2
。
# Create an index over the documents
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore
vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
MilvusVectorStore
的参数:
- 将
uri
设置为本地文件,如./milvus.db
,是最方便的方法,因为它会自动利用Milvus Lite将所有数据存储在此文件中。 - 如果数据规模较大,可以在docker 或 kubernetes 上设置性能更强的 Milvus 服务器。在此设置中,请使用服务器 uri,例如
http://localhost:19530
,作为您的uri
。 - 如果你想使用Zilliz Cloud(Milvus 的全托管云服务),请调整
uri
和token
,它们与 Zilliz Cloud 中的公共端点和 Api 密钥相对应。
查询数据
现在我们已经将文档存储到了索引中,可以针对索引提出问题。索引会将自身存储的数据作为 chatgpt 的知识库。
query_engine = index.as_query_engine()
res = query_engine.query("What did the author learn?")
print(res)
The author learned that philosophy courses in college were boring to him, leading him to switch his focus to studying AI.
res = query_engine.query("What challenges did the disease pose for the author?")
print(res)
The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in her losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.
下一个测试显示覆盖会删除之前的数据。
from llama_index.core import Document
vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
[Document(text="The number that is being searched for is ten.")],
storage_context,
)
query_engine = index.as_query_engine()
res = query_engine.query("Who is the author?")
print(res)
The author is the individual who created the context information.
下一个测试显示的是向已有索引添加额外数据。
del index, vector_store, storage_context, query_engine
vector_store = MilvusVectorStore(uri="./milvus_demo.db", overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
query_engine = index.as_query_engine()
res = query_engine.query("What is the number?")
print(res)
The number is ten.
res = query_engine.query("Who is the author?")
print(res)
Paul Graham
元数据过滤
我们可以通过过滤特定来源生成结果。下面的示例说明了从目录中加载所有文档,然后根据元数据对其进行过滤。
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
# Load all the two documents loaded before
documents_all = SimpleDirectoryReader("./data/").load_data()
vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents_all, storage_context)
我们只想检索文件uber_2021.pdf
中的文档。
filters = MetadataFilters(
filters=[ExactMatchFilter(key="file_name", value="uber_2021.pdf")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query("What challenges did the disease pose for the author?")
print(res)
The disease posed challenges related to the adverse impact on the business and operations, including reduced demand for Mobility offerings globally, affecting travel behavior and demand. Additionally, the pandemic led to driver supply constraints, impacted by concerns regarding COVID-19, with uncertainties about when supply levels would return to normal. The rise of the Omicron variant further affected travel, resulting in advisories and restrictions that could adversely impact both driver supply and consumer demand for Mobility offerings.
当从文件paul_graham_essay.txt
中检索时,我们会得到不同的结果。
filters = MetadataFilters(
filters=[ExactMatchFilter(key="file_name", value="paul_graham_essay.txt")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query("What challenges did the disease pose for the author?")
print(res)
The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in his mother losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.