使用 Milvus 和 Feast 建立 RAG
在本教程中,我們將使用Feast和Milvus 建立一個檢索增強生成 (RAG) 管道。Feast 是一個開放源碼的特徵儲存,可以簡化機器學習的特徵管理,為訓練和即時推理實現高效的結構化資料儲存和檢索。Milvus 是專為快速相似性搜尋而設計的高效能向量資料庫,非常適合在 RAG 工作流程中檢索相關文件。
基本上,我們會使用 Feast 將文件和結構化資料(即特徵)注入到 LLM(大型語言模型)的上下文中,以 Milvus 作為線上向量資料庫,為 RAG 應用程式(Retrieval Augmented Generation)提供動力。
為何選擇 Feast?
Feast 可解決此流程中的幾個常見問題:
- 線上檢索:在推論時,LLM 經常需要存取並非現成可用的資料,而且需要從其他資料來源預先計算。
- Feast 可以管理部署到各種線上商店(例如 Milvus、DynamoDB、Redis、Google Cloud Datastore)的資料,並確保必要的特徵在推論時一致可用,而且是新計算的。
- 向量搜尋:Feast 內建向量相似性搜尋支援,可輕鬆以宣告方式進行設定,讓使用者能專注於其應用程式。Milvus 提供強大且有效率的向量相似性搜尋功能。
- 更豐富的結構化資料:除了向量搜尋外,使用者也可以查詢標準結構化欄位,以注入 LLM 上下文,獲得更好的使用者體驗。
- 特徵/上下文與版本管理:組織內的不同團隊通常無法跨專案和服務重複使用資料,導致應用程式邏輯重複。模型具有需要版本化的資料依賴性,例如在模型/提示版本上執行 A/B 測試時。
- Feast 可讓您發現先前使用過的文件、功能並進行協作,還可讓資料集進行版本化。
我們將
- 使用Parquet 檔案離線儲存與Milvus 線上儲存部署本機特徵儲存。
- 將離線儲存(Parquet 檔案)中的資料(即特徵值)寫入/實體化到線上儲存(Milvus)中。
- 使用具有 Milvus 向量搜尋功能的 Feast SDK 來提供特徵。
- 將文件注入 LLM 的上下文以回答問題
本教學是基於Feast Repository 的官方 Milvus 整合指南。雖然我們努力保持本教學最新,但如果您遇到任何差異,請參閱官方指南,並隨時在我們的資源庫中開啟一個問題,以進行任何必要的更新。
準備工作
依賴
$ pip install 'feast[milvus]' openai -U -q
如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時(點選畫面上方的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。
我們將使用 OpenAI 作為 LLM 提供者。您可以登入其官方網站,並將OPENAI_API_KEY準備為環境變數。
import os
from openai import OpenAI
os.environ["OPENAI_API_KEY"] = "sk-**************"
llm_client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"),
)
準備資料
我們將以下列資料夾中的資料為例:
Feast RAG Feature Repo
下載資料後,您會發現以下檔案:
feature_repo/
│── data/ # Contains pre-processed Wikipedia city data in Parquet format
│── example_repo.py # Defines feature views and entities for the city data
│── feature_store.yaml # Configures Milvus and feature store settings
│── test_workflow.py # Example workflow for Feast operations
金鑰設定檔案
1. feature_store.yaml
此檔案設定功能商店基礎架構:
project: rag
provider: local
registry: data/registry.db
online_store:
type: milvus # Uses Milvus for vector storage
path: data/online_store.db
vector_enabled: true # Enables vector similarity search
embedding_dim: 384 # Dimension of our embeddings
index_type: "FLAT" # Vector index type
metric_type: "COSINE" # Similarity metric
offline_store:
type: file # Uses file-based offline storage
此設定建立:
- Milvus 作為快速向量擷取的線上儲存空間
- 歷史資料處理的檔案式離線儲存
- 使用 COSINE 相似度的向量搜尋功能
2. example_repo.py
包含我們城市資料的特徵定義,包括
- 城市的實體定義
- 城市資訊和嵌入的特徵檢視
- 向量資料庫的模式規格
3.資料目錄
包含我們預先處理過的維基百科城市資料與
- 城市描述與摘要
- 預先計算的嵌入(384 維向量)
- 城市名稱和州等相關元資料
這些檔案共同建立一個特徵儲存,結合 Milvus 的向量搜尋功能與 Feast 的特徵管理,讓我們的 RAG 應用程式能夠有效率地檢索相關的城市資訊。
檢查資料
本範例中的原始特徵資料儲存在本機的 parquet 檔案中。資料集為不同城市的維基百科摘要。讓我們先檢查資料。
import pandas as pd
df = pd.read_parquet(
"/path/to/feature_repo/data/city_wikipedia_summaries_with_embeddings.parquet"
)
df["vector"] = df["vector"].apply(lambda x: x.tolist())
embedding_length = len(df["vector"][0])
print(f"embedding length = {embedding_length}")
embedding length = 384
from IPython.display import display
display(df.head())
| id | item_id | 事件時間戳 | 狀態 | wiki_summary | 句群 | 向量 | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 2025-01-09 13:36:59.280589 | 紐約 | 紐約,通常被稱為紐約市或簡... | 紐約,通常被稱為紐約市或簡... | [0.1465730518102646, -0.07317650318145752, 0.0... |
| 1 | 1 | 1 | 2025-01-09 13:36:59.280589 | 紐約 | 紐約,通常被稱為紐約市或簡稱紐約。 | 該市由五個行政區組成,每個行政區都有自己的... | [0.05218901485204697, -0.08449874818325043, 0.... |
| 2 | 2 | 2 | 2025-01-09 13:36:59.280589 | 紐約 | 紐約,通常被稱為紐約市或簡稱為... | 紐約是全球的金融和商業中心。 | [0.06769222766160965, -0.07371102273464203, -0... |
| 3 | 3 | 3 | 2025-01-09 13:36:59.280589 | 紐約 | 紐約,通常稱為紐約市或簡稱... | 紐約市是世界上 ... | [0.12095861881971359, -0.04279915615916252, 0.... |
| 4 | 4 | 4 | 2025-01-09 13:36:59.280589 | 紐約 | 紐約,通常稱為紐約市或簡稱... | 2022 年人口估計為 8,335 人,... | [0.17943550646305084, -0.09458263963460922, 0.... |
註冊功能定義並部署功能商店
下載feature_repo 之後,我們需要執行feast apply 來註冊example_repo.py 中定義的特徵檢視和實體,並設定Milvus為線上商店表。
在執行指令之前,請確定您已 nagivated 到feature_repo 目錄。
feast apply
載入特徵到 Milvus
現在我們將特徵載入 Milvus。這一步驟包括將離線商店中的特徵值序列化,並將它們寫入 Milvus。
from datetime import datetime
from feast import FeatureStore
import warnings
warnings.filterwarnings("ignore")
store = FeatureStore(repo_path="/path/to/feature_repo")
store.write_to_online_store(feature_view_name="city_embeddings", df=df)
Connecting to Milvus in local mode using /Users/jinhonglin/Desktop/feature_repo/data/online_store.db
請注意,現在有online_store.db 和registry.db ,分別儲存物化特徵和模式資訊。我們可以看看online_store.db 檔案。
pymilvus_client = store._provider._online_store._connect(store.config)
COLLECTION_NAME = pymilvus_client.list_collections()[0]
milvus_query_result = pymilvus_client.query(
collection_name=COLLECTION_NAME,
filter="item_id == '0'",
)
pd.DataFrame(milvus_query_result[0]).head()
| item_id_pk | created_ts | 事件_ts | item_id | 句群 | 狀態 | 向量 | wiki_summary | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0100000002000000070000006974656d5f696404000000... | 0 | 1736447819280589 | 0 | 紐約,通常稱為紐約市或簡稱... | 紐約, 紐約市 | 0.146573 | 紐約,通常被稱為紐約市或簡稱... |
| 1 | 0100000002000000070000006974656d5f696404000000... | 0 | 1736447819280589 | 0 | 紐約,通常稱為紐約市或簡稱... | 紐約 | -0.073177 | 紐約,通常被稱為紐約市或簡稱... |
| 2 | 0100000002000000070000006974656d5f696404000000... | 0 | 1736447819280589 | 0 | 紐約,通常稱為紐約市或簡稱... | 紐約, 紐約市 | 0.052114 | 紐約,通常被稱為紐約市或簡稱... |
| 3 | 0100000002000000070000006974656d5f696404000000... | 0 | 1736447819280589 | 0 | 紐約,通常稱為紐約市或簡稱... | 紐約, 紐約市 | 0.033187 | 紐約,通常被稱為紐約市或簡稱... |
| 4 | 0100000002000000070000006974656d5f696404000000... | 0 | 1736447819280589 | 0 | 紐約,通常稱為紐約市或簡稱... | 紐約, 紐約市 | 0.012013 | 紐約,通常被稱為紐約市或簡... |
建立 RAG
1.使用 PyTorch 和句子轉換器嵌入查詢
在推理過程中(例如,當使用者提交聊天訊息時),我們需要嵌入輸入的文字。這可以視為輸入資料的特徵變換。在這個範例中,我們會使用 Hugging Face 的小型 Sentence Transformer 來做這個動作。
import torch
import torch.nn.functional as F
from feast import FeatureStore
from pymilvus import MilvusClient, DataType, FieldSchema
from transformers import AutoTokenizer, AutoModel
from example_repo import city_embeddings_feature_view, item
TOKENIZER = "sentence-transformers/all-MiniLM-L6-v2"
MODEL = "sentence-transformers/all-MiniLM-L6-v2"
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[
0
] # First element of model_output contains all token embeddings
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
def run_model(sentences, tokenizer, model):
encoded_input = tokenizer(
sentences, padding=True, truncation=True, return_tensors="pt"
)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings
2.擷取即時向量與資料以進行線上推理
一旦查詢已轉換成嵌入,下一步就是從向量儲存擷取相關文件。在推論時,我們會利用向量相似性搜尋,找出儲存在線上特徵儲存庫中最相關的文件嵌入,使用retrieve_online_documents_v2() 。然後,這些特徵向量就可以送入 LLM 的上下文。
question = "Which city has the largest population in New York?"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
model = AutoModel.from_pretrained(MODEL)
query_embedding = run_model(question, tokenizer, model)
query = query_embedding.detach().cpu().numpy().tolist()[0]
from IPython.display import display
# Retrieve top k documents
context_data = store.retrieve_online_documents_v2(
features=[
"city_embeddings:vector",
"city_embeddings:item_id",
"city_embeddings:state",
"city_embeddings:sentence_chunks",
"city_embeddings:wiki_summary",
],
query=query,
top_k=3,
distance_metric="COSINE",
).to_df()
display(context_data)
| 向量 | item_id | 狀態 | 句群 | wiki_summary | 距離 | |
|---|---|---|---|---|---|---|
| 0 | [0.15548758208751678, -0.08017724752426147, -0... | 0 | 紐約 | 紐約,通常稱為紐約市或簡稱... | 紐約,通常被稱為紐約市或簡... | 0.743023 |
| 1 | [0.15548758208751678, -0.08017724752426147, -0... | 6 | 紐約 | 紐約是美國的地理和人口中心。 | 紐約,通常被稱為紐約市或簡稱為... | 0.739733 |
| 2 | [0.15548758208751678, -0.08017724752426147, -0... | 7 | 紐約, 紐約市 | 紐約市的人口超過 201... | 紐約,通常被稱為紐約市或簡稱為"... | 0.728218 |
3.為 RAG 上下文格式化擷取的文件
擷取相關文件後,我們需要將資料格式化為結構化的上下文,以便在下游應用程式中有效使用。此步驟確保擷取的資訊是乾淨、有條理的,並可整合至 RAG 管道。
def format_documents(context_df):
output_context = ""
unique_documents = context_df.drop_duplicates().apply(
lambda x: "City & State = {"
+ x["state"]
+ "}\nSummary = {"
+ x["wiki_summary"].strip()
+ "}",
axis=1,
)
for i, document_text in enumerate(unique_documents):
output_context += f"****START DOCUMENT {i}****\n{document_text.strip()}\n****END DOCUMENT {i}****"
return output_context
RAG_CONTEXT = format_documents(context_data[["state", "wiki_summary"]])
print(RAG_CONTEXT)
****START DOCUMENT 0****
City & State = {New York, New York}
Summary = {New York, often called New York City or simply NYC, is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each of which is coextensive with a respective county. New York is a global center of finance and commerce, culture and technology, entertainment and media, academics and scientific output, and the arts and fashion, and, as home to the headquarters of the United Nations, is an important center for international diplomacy. New York City is the epicenter of the world's principal metropolitan economy.
With an estimated population in 2022 of 8,335,897 distributed over 300.46 square miles (778.2 km2), the city is the most densely populated major city in the United States. New York has more than double the population of Los Angeles, the nation's second-most populous city. New York is the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the U.S. by both population and urban area. With more than 20.1 million people in its metropolitan statistical area and 23.5 million in its combined statistical area as of 2020, New York City is one of the world's most populous megacities. The city and its metropolitan area are the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York, making it the most linguistically diverse city in the world. In 2021, the city was home to nearly 3.1 million residents born outside the U.S., the largest foreign-born population of any city in the world.
New York City traces its origins to Fort Amsterdam and a trading post founded on the southern tip of Manhattan Island by Dutch colonists in approximately 1624. The settlement was named New Amsterdam (Dutch: Nieuw Amsterdam) in 1626 and was chartered as a city in 1653. The city came under English control in 1664 and was temporarily renamed New York after King Charles II granted the lands to his brother, the Duke of York. before being permanently renamed New York in November 1674. New York City was the capital of the United States from 1785 until 1790. The modern city was formed by the 1898 consolidation of its five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island, and has been the largest U.S. city ever since.
Anchored by Wall Street in the Financial District of Lower Manhattan, New York City has been called both the world's premier financial and fintech center and the most economically powerful city in the world. As of 2022, the New York metropolitan area is the largest metropolitan economy in the world with a gross metropolitan product of over US$2.16 trillion. If the New York metropolitan area were its own country, it would have the tenth-largest economy in the world. The city is home to the world's two largest stock exchanges by market capitalization of their listed companies: the New York Stock Exchange and Nasdaq. New York City is an established safe haven for global investors. As of 2023, New York City is the most expensive city in the world for expatriates to live. New York City is home to the highest number of billionaires, individuals of ultra-high net worth (greater than US$30 million), and millionaires of any city in the world.}
****END DOCUMENT 0****
4.使用擷取的上下文產生回應
現在我們已經格式化了擷取的文件,可以將它們整合到結構化的提示中,以便產生回應。這一步驟可確保助手只依賴擷取的資訊,避免產生幻覺回應。
FULL_PROMPT = f"""
You are an assistant for answering questions about states. You will be provided documentation from Wikipedia. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Here are document(s) you should use when answer the users question:
{RAG_CONTEXT}
"""
response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": FULL_PROMPT},
{"role": "user", "content": question},
],
)
print("\n".join([c.message.content for c in response.choices]))
The city with the largest population in New York is New York City itself, often referred to as NYC. It is the most populous city in the United States, with an estimated population of about 8.3 million in 2022.