使用 Ragas 進行評估
本指南展示了如何使用 Ragas 來評估建立在Milvus 之上的檢索-增強生成 (RAG) 管道。
RAG 系統結合了檢索系統與生成模型,可根據給定的提示生成新的文字。該系統首先使用 Milvus 從語料庫中檢索相關文件,然後根據檢索到的文件使用生成模型生成新文本。
Ragas是一個可以幫助您評估 RAG 管道的框架。現有的工具和框架可以幫助您建立這些管道,但是評估它和量化您的管道效能可能很困難。這就是 Ragas (RAG 評估) 的用武之地。
先決條件
在執行本筆記本之前,請確定您已安裝下列依賴項目:
$ pip install --upgrade pymilvus openai requests tqdm pandas ragas
如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時(點選畫面上方的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。
在本範例中,我們將使用 OpenAI 作為 LLM。您應該準備api key OPENAI_API_KEY
作為環境變數。
import os
os.environ["OPENAI_API_KEY"] = "sk-***********"
定義 RAG 管道
我們將定義以 Milvus 作為向量儲存、OpenAI 作為 LLM 的 RAG 類。該類包含load
方法 (將文字資料載入 Milvus)、retrieve
方法 (擷取與給定問題最相似的文字資料),以及answer
方法 (使用擷取的知識回答給定問題)。
from typing import List
from tqdm import tqdm
from openai import OpenAI
from pymilvus import MilvusClient
class RAG:
"""
RAG (Retrieval-Augmented Generation) class built upon OpenAI and Milvus.
"""
def __init__(self, openai_client: OpenAI, milvus_client: MilvusClient):
self._prepare_openai(openai_client)
self._prepare_milvus(milvus_client)
def _emb_text(self, text: str) -> List[float]:
return (
self.openai_client.embeddings.create(input=text, model=self.embedding_model)
.data[0]
.embedding
)
def _prepare_openai(
self,
openai_client: OpenAI,
embedding_model: str = "text-embedding-3-small",
llm_model: str = "gpt-3.5-turbo",
):
self.openai_client = openai_client
self.embedding_model = embedding_model
self.llm_model = llm_model
self.SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
self.USER_PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""
def _prepare_milvus(
self, milvus_client: MilvusClient, collection_name: str = "rag_collection"
):
self.milvus_client = milvus_client
self.collection_name = collection_name
if self.milvus_client.has_collection(self.collection_name):
self.milvus_client.drop_collection(self.collection_name)
embedding_dim = len(self._emb_text("foo"))
self.milvus_client.create_collection(
collection_name=self.collection_name,
dimension=embedding_dim,
metric_type="IP", # Inner product distance
consistency_level="Strong", # Strong consistency level
)
def load(self, texts: List[str]):
"""
Load the text data into Milvus.
"""
data = []
for i, line in enumerate(tqdm(texts, desc="Creating embeddings")):
data.append({"id": i, "vector": self._emb_text(line), "text": line})
self.milvus_client.insert(collection_name=self.collection_name, data=data)
def retrieve(self, question: str, top_k: int = 3) -> List[str]:
"""
Retrieve the most similar text data to the given question.
"""
search_res = self.milvus_client.search(
collection_name=self.collection_name,
data=[self._emb_text(question)],
limit=top_k,
search_params={"metric_type": "IP", "params": {}}, # Inner product distance
output_fields=["text"], # Return the text field
)
retrieved_texts = [res["entity"]["text"] for res in search_res[0]]
return retrieved_texts[:top_k]
def answer(
self,
question: str,
retrieval_top_k: int = 3,
return_retrieved_text: bool = False,
):
"""
Answer the given question with the retrieved knowledge.
"""
retrieved_texts = self.retrieve(question, top_k=retrieval_top_k)
user_prompt = self.USER_PROMPT.format(
context="\n".join(retrieved_texts), question=question
)
response = self.openai_client.chat.completions.create(
model=self.llm_model,
messages=[
{"role": "system", "content": self.SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
)
if not return_retrieved_text:
return response.choices[0].message.content
else:
return response.choices[0].message.content, retrieved_texts
讓我們用 OpenAI 和 Milvus 客戶端初始化 RAG 類別。
openai_client = OpenAI()
milvus_client = MilvusClient(uri="./milvus_demo.db")
my_rag = RAG(openai_client=openai_client, milvus_client=milvus_client)
至於MilvusClient
的參數 :
- 將
uri
設定為本機檔案,例如./milvus.db
,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。 - 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如
http://localhost:19530
,作為您的uri
。 - 如果您想使用Zilliz Cloud(Milvus 的完全管理雲端服務),請調整
uri
和token
,與 Zilliz Cloud 中的Public Endpoint 和 Api key對應。
執行 RAG 管道並獲得結果
我們使用Milvus 開發指南作為 RAG 中的私有知識,這是簡單 RAG 管道的良好資料來源。
下載並載入 RAG 管道。
import os
import urllib.request
url = "https://raw.githubusercontent.com/milvus-io/milvus/master/DEVELOPMENT.md"
file_path = "./Milvus_DEVELOPMENT.md"
if not os.path.exists(file_path):
urllib.request.urlretrieve(url, file_path)
with open(file_path, "r") as file:
file_text = file.read()
# We simply use "# " to separate the content in the file, which can roughly separate the content of each main part of the markdown file.
text_lines = file_text.split("# ")
my_rag.load(text_lines) # Load the text data into RAG pipeline
Creating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:16<00:00, 2.80it/s]
讓我們定義一個關於開發指南文件內容的查詢問題。然後使用answer
方法取得答案和擷取的上下文文字。
question = "what is the hardware requirements specification if I want to build Milvus and run from source code?"
my_rag.answer(question, return_retrieved_text=True)
('The hardware requirements specification to build and run Milvus from source code is 8GB of RAM and 50GB of free disk space.',
['Hardware Requirements\n\nThe following specification (either physical or virtual machine resources) is recommended for Milvus to build and run from source code.\n\n```\n- 8GB of RAM\n- 50GB of free disk space\n```\n\n##',
'Building Milvus on a local OS/shell environment\n\nThe details below outline the hardware and software requirements for building on Linux and MacOS.\n\n##',
"Software Requirements\n\nAll Linux distributions are available for Milvus development. However a majority of our contributor worked with Ubuntu or CentOS systems, with a small portion of Mac (both x86_64 and Apple Silicon) contributors. If you would like Milvus to build and run on other distributions, you are more than welcome to file an issue and contribute!\n\nHere's a list of verified OS types where Milvus can successfully build and run:\n\n- Debian/Ubuntu\n- Amazon Linux\n- MacOS (x86_64)\n- MacOS (Apple Silicon)\n\n##"])
現在讓我們準備一些問題與其相對應的地面真實答案。我們從 RAG 管道取得答案和上下文。
from datasets import Dataset
import pandas as pd
question_list = [
"what is the hardware requirements specification if I want to build Milvus and run from source code?",
"What is the programming language used to write Knowhere?",
"What should be ensured before running code coverage?",
]
ground_truth_list = [
"If you want to build Milvus and run from source code, the recommended hardware requirements specification is:\n\n- 8GB of RAM\n- 50GB of free disk space.",
"The programming language used to write Knowhere is C++.",
"Before running code coverage, you should make sure that your code changes are covered by unit tests.",
]
contexts_list = []
answer_list = []
for question in tqdm(question_list, desc="Answering questions"):
answer, contexts = my_rag.answer(question, return_retrieved_text=True)
contexts_list.append(contexts)
answer_list.append(answer)
df = pd.DataFrame(
{
"question": question_list,
"contexts": contexts_list,
"answer": answer_list,
"ground_truth": ground_truth_list,
}
)
rag_results = Dataset.from_pandas(df)
df
Answering questions: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.29s/it]
問題 | 上下文 | 答案 | 地面真相 | |
---|---|---|---|---|
0 | 什麼是硬體需求規格? | [Hardware Requirements/硬體需求規格]以下是... | 硬體需求規格是什麼? | 如果您想建立 Milvus 並從來源執行... |
1 | 用什麼編程語言來寫... | [CMake & Conan\n\nMilvus 的演算法函式庫... | 用來編寫知乎的程式語言是什麼? | 用什麼編程語言來寫Knowher... |
2 | 在運行代碼覆蓋之前應確保哪些... | [Code coverage\n/nBefore submitting your pull ... | 在執行程式碼覆蓋之前,您應該確保... | 在執行程式碼覆蓋之前,您應該確保 ... |
使用 Ragas 進行評估
我們使用 Ragas 來評估 RAG 管道結果的效能。
Ragas 提供了一套易於使用的度量指標。我們以Answer relevancy
,Faithfulness
,Context recall
, 和Context precision
作為評估 RAG 管道的指標。有關指標的詳細資訊,請參閱Ragas Metrics。
from ragas import evaluate
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_recall,
context_precision,
)
result = evaluate(
rag_results,
metrics=[
answer_relevancy,
faithfulness,
context_recall,
context_precision,
],
)
result
Evaluating: 0%| | 0/12 [00:00<?, ?it/s]
{'answer_relevancy': 0.9445, 'faithfulness': 1.0000, 'context_recall': 1.0000, 'context_precision': 1.0000}