🚀 免費嘗試 Zilliz Cloud,完全托管的 Milvus,體驗速度提升 10 倍!立即嘗試

milvus-logo
LFAI
主頁
  • 整合

使用 Milvus 和 LlamaIndex 的檢索增強世代 (RAG)

Open In Colab GitHub Repository

本指南展示了如何使用 LlamaIndex 和 Milvus 建立一個檢索-增強生成 (RAG) 系統。

RAG 系統結合了檢索系統與生成模型,可根據給定的提示生成新的文字。該系統首先使用 Milvus 從語料庫中檢索相關文件,然後根據檢索到的文件使用生成模型生成新文本。

LlamaIndex是一個簡單、靈活的資料框架,可將自訂資料來源連接至大型語言模型 (LLM)。Milvus是世界上最先進的開放原始碼向量資料庫,是為了強化嵌入式相似性搜尋與 AI 應用程式而建立的。

在本筆記簿中,我們將展示使用 MilvusVectorStore 的快速示範。

開始之前

安裝相依性

本頁面的程式碼片段需要 pymilvus 和 llamaindex 的相依性。您可以使用下列指令安裝它們:

$ pip install pymilvus>=2.4.2
$ pip install llama-index-vector-stores-milvus
$ pip install llama-index

如果您使用 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時間。(按一下螢幕上方的「Runtime」功能表,然後從下拉式功能表中選擇「Restart session」)。

設定 OpenAI

讓我們先加入 openai api 金鑰。這將允許我們存取 chatgpt。

import openai

openai.api_key = "sk-***********"

準備資料

您可以使用下列指令下載範例資料:

! mkdir -p 'data/'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham_essay.txt'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/uber_2021.pdf'

開始

產生我們的資料

作為第一個範例,讓我們從檔案paul_graham_essay.txt 產生一個文件。這是一篇來自 Paul Graham 的單篇文章,標題是What I Worked On 。我們將使用 SimpleDirectoryReader 來產生文件。

from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(
    input_files=["./data/paul_graham_essay.txt"]
).load_data()

print("Document ID:", documents[0].doc_id)
Document ID: 95f25e4d-f270-4650-87ce-006d69d82033

在資料中建立索引

現在我們有了一個文件,我們可以建立一個索引並插入文件。

請注意,Milvus Lite需要pymilvus>=2.4.2

# Create an index over the documents
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore


vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

關於MilvusVectorStore 的參數:

  • uri 設定為本機檔案,例如./milvus.db ,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存於此檔案中。
  • 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如http://localhost:19530 ,作為您的uri
  • 如果您要使用Zilliz Cloud,Milvus 的完全管理雲端服務,請調整uritoken ,對應 Zilliz Cloud 的Public Endpoint 和 Api key

查詢資料

現在我們的文件已儲存在索引中,我們可以針對索引提出問題。索引會使用本身儲存的資料作為 chatgpt 的知識庫。

query_engine = index.as_query_engine()
res = query_engine.query("What did the author learn?")
print(res)
The author learned that philosophy courses in college were boring to him, leading him to switch his focus to studying AI.
res = query_engine.query("What challenges did the disease pose for the author?")
print(res)
The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in her losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.

下一個測試顯示覆寫會移除先前的資料。

from llama_index.core import Document


vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    [Document(text="The number that is being searched for is ten.")],
    storage_context,
)
query_engine = index.as_query_engine()
res = query_engine.query("Who is the author?")
print(res)
The author is the individual who created the context information.

下一個測試顯示在已存在的索引中加入額外的資料。

del index, vector_store, storage_context, query_engine

vector_store = MilvusVectorStore(uri="./milvus_demo.db", overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
query_engine = index.as_query_engine()
res = query_engine.query("What is the number?")
print(res)
The number is ten.
res = query_engine.query("Who is the author?")
print(res)
Paul Graham

元資料篩選

我們可以透過篩選特定來源來產生結果。以下範例說明從目錄載入所有文件,並隨後根據元資料過濾這些文件。

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# Load all the two documents loaded before
documents_all = SimpleDirectoryReader("./data/").load_data()

vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents_all, storage_context)

我們只想擷取檔案uber_2021.pdf 中的文件。

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="uber_2021.pdf")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query("What challenges did the disease pose for the author?")

print(res)
The disease posed challenges related to the adverse impact on the business and operations, including reduced demand for Mobility offerings globally, affecting travel behavior and demand. Additionally, the pandemic led to driver supply constraints, impacted by concerns regarding COVID-19, with uncertainties about when supply levels would return to normal. The rise of the Omicron variant further affected travel, resulting in advisories and restrictions that could adversely impact both driver supply and consumer demand for Mobility offerings.

當從檔案paul_graham_essay.txt 擷取時,我們會得到不同的結果。

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="paul_graham_essay.txt")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query("What challenges did the disease pose for the author?")

print(res)
The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in his mother losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.

免費嘗試托管的 Milvus

Zilliz Cloud 無縫接入,由 Milvus 提供動力,速度提升 10 倍。

開始使用
反饋

這個頁面有幫助嗎?