使用 Milvus + PII Masker 建立 RAG
PII(個人識別資訊)是一種可用於識別個人身份的敏感資料。
PII Masker 由HydroX AI 開發,是一款先進的開放原始碼工具,旨在利用尖端的 AI 模型保護您的敏感資料。無論您是要處理客戶資料、執行資料分析,或是確保符合隱私權法規,PII Masker 都能提供強大、可擴充的解決方案,確保您的資訊安全。
在本教程中,我們將介紹如何使用 PII Masker 與 Milvus 來保護 RAG(Retrieval-Augmented Generation)應用程式中的隱私資料。透過結合 PII Masker 的資料遮蔽功能與 Milvus 的高效資料擷取功能,您可以建立安全、符合隱私權規定的管道,放心地處理敏感資料。此方法可確保您的應用程式符合隱私權標準,並有效保護使用者資料。
準備工作
開始使用 PII Masker
依照 PII Masker 的安裝指南,安裝所需的相依性並下載模型。以下是一個簡單的指南:
$ git clone https://github.com/HydroXai/pii-masker-v1.git
$ cd pii-masker-v1/pii-masker
從https://huggingface.co/hydroxai/pii_model_weight
下載模型,並取代其中的檔案:pii-masker/output_model/deberta3base_1024/
依賴與環境
$ pip install --upgrade pymilvus openai requests tqdm dataset
在本範例中,我們將使用 OpenAI 作為 LLM。您應該準備api key OPENAI_API_KEY
作為環境變數。
$ export OPENAI_API_KEY=sk-***********
然後您可以建立一個 python 或 jupyter 記事本,執行下列程式碼。
準備資料
為了測試或示範目的,讓我們產生一些包含 PII 資訊的假行。
text_lines = [
"Alice Johnson, a resident of Dublin, Ireland, attended a flower festival at Hyde Park on May 15, 2023. She entered the park at noon using her digital passport, number 23456789. Alice spent the afternoon admiring various flowers and plants, attending a gardening workshop, and having a light snack at one of the food stalls. While there, she met another visitor, Mr. Thompson, who was visiting from London. They exchanged tips on gardening and shared contact information: Mr. Thompson's address was 492, Pine Lane, and his cell phone number was +018.221.431-4517. Alice gave her contact details: home address, Ranch 16",
"Hiroshi Tanaka, a businessman from Tokyo, Japan, went to attend a tech expo at the Berlin Convention Center on November 10, 2023. He registered for the event at 9 AM using his digital passport, number Q-24567680. Hiroshi networked with industry professionals, participated in panel discussions, and had lunch with some potential partners. One of the partners he met was from Munich, and they decided to keep in touch: the partner's office address was given as house No. 12, Road 7, Block E. Hiroshi offered his business card with the address, 654 Sakura Road, Tokyo.",
"In an online forum discussion about culinary exchanges around the world, several participants shared their experiences. One user, Male, with the email 2022johndoe@example.com, shared his insights. He mentioned his ID code 1A2B3C4D5E and reference number L87654321 while residing in Italy but originally from Australia. He provided his +0-777-123-4567 and described his address at 456, Flavorful Lane, Pasta, IT, 00100.",
"Another user joined the conversation on the topic of international volunteering opportunities. Identified as Female, she used the email 2023janedoe@example.com to share her story. She noted her 9876543210123 and M1234567890123 while residing in Germany but originally from Brazil. She provided her +0-333-987-6543 and described her address at 789, Sunny Side Street, Berlin, DE, 10178.",
]
使用 PIIMasker 屏蔽資料
讓我們初始化 PIIMasker 物件並載入模型。
from model import PIIMasker
masker = PIIMasker()
然後,我們會從文字行清單中遮罩 PII,並列印遮罩後的結果。
masked_results = []
for full_text in text_lines:
masked_text, _ = masker.mask_pii(full_text)
masked_results.append(masked_text)
for res in masked_results:
print(res + "\n")
Alice [B-NAME] , a resident of Dublin Ireland attended flower festival at Hyde Park on May 15 2023 [B-PHONE_NUM] She entered the park noon using her digital passport number 23 [B-ID_NUM] [B-NAME] afternoon admiring various flowers and plants attending gardening workshop having light snack one food stalls While there she met another visitor Mr Thompson who was visiting from London They exchanged tips shared contact information : ' s address 492 [I-STREET_ADDRESS] his cell phone + [B-PHONE_NUM] [B-NAME] details home Ranch [B-STREET_ADDRESS]
Hiroshi [B-NAME] [I-STREET_ADDRESS] a businessman from Tokyo Japan went to attend tech expo at the Berlin Convention Center on November 10 2023 . He registered for event 9 AM using his digital passport number Q [B-ID_NUM] [B-NAME] with industry professionals participated in panel discussions and had lunch some potential partners One of he met was Munich they decided keep touch : partner ' s office address given as house No [I-STREET_ADDRESS] [B-NAME] business card 654 [B-STREET_ADDRESS]
In an online forum discussion about culinary exchanges around the world [I-STREET_ADDRESS] several participants shared their experiences [I-STREET_ADDRESS] One user Male with email 2022 [B-EMAIL] his insights He mentioned ID code 1 [B-ID_NUM] [I-PHONE_NUM] reference number L [B-ID_NUM] residing in Italy but originally from Australia provided + [B-PHONE_NUM] [I-PHONE_NUM] described address at 456 [I-STREET_ADDRESS]
Another user joined the conversation on topic of international volunteering opportunities . Identified as Female , she used email 2023 [B-EMAIL] share her story She noted 98 [B-ID_NUM] [I-PHONE_NUM] M [B-ID_NUM] residing in Germany but originally from Brazil provided + [B-PHONE_NUM] [I-PHONE_NUM] described address at 789 [I-STREET_ADDRESS] DE 10 178
準備嵌入模型
我們初始化 OpenAI 用戶端以準備嵌入模型。
from openai import OpenAI
openai_client = OpenAI()
定義使用 OpenAI 用戶端產生文字嵌入的函式。我們使用text-embedding-3-small
模型作為範例。
def emb_text(text):
return (
openai_client.embeddings.create(input=text, model="text-embedding-3-small")
.data[0]
.embedding
)
產生測試嵌入並列印其尺寸和前幾個元素。
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])
1536
[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]
將資料載入 Milvus
建立集合
from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./milvus_demo.db")
至於MilvusClient
的參數:
- 將
uri
設定為本機檔案,例如./milvus.db
,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。 - 如果您有大規模的資料,例如超過一百萬個向量,您可以在Docker 或 Kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器位址和連接埠作為您的 uri,例如
http://localhost:19530
。如果您啟用 Milvus 上的驗證功能,請使用「<your_username>:<your_password>」作為令牌,否則請勿設定令牌。 - 如果您要使用Zilliz Cloud(Milvus 的完全管理雲端服務),請調整
uri
和token
,這兩個項目對應 Zilliz Cloud 中的Public Endpoint 和 Api key。
檢查集合是否已經存在,如果已經存在,請將其刪除。
collection_name = "my_rag_collection"
if milvus_client.has_collection(collection_name):
milvus_client.drop_collection(collection_name)
使用指定的參數建立新的集合。
如果我們沒有指定任何欄位資訊,Milvus 會自動建立一個預設的id
欄位做為主索引鍵,以及一個vector
欄位來儲存向量資料。保留的 JSON 欄位用來儲存非結構描述定義的欄位及其值。
milvus_client.create_collection(
collection_name=collection_name,
dimension=embedding_dim,
metric_type="IP", # Inner product distance
consistency_level="Strong", # Strong consistency level
)
插入資料
遍歷遮罩的文字行,建立嵌入,然後將資料插入 Milvus。
這裡有一個新欄位text
,它是集合模式中的非定義欄位。它會自動加入保留的 JSON 動態欄位,在高層次上可視為一般欄位。
from tqdm import tqdm
data = []
for i, line in enumerate(tqdm(masked_results, desc="Creating embeddings")):
data.append({"id": i, "vector": emb_text(line), "text": line})
milvus_client.insert(collection_name=collection_name, data=data)
Creating embeddings: 100%|██████████| 4/4 [00:01<00:00, 2.60it/s]
{'insert_count': 4, 'ids': [0, 1, 2, 3], 'cost': 0}
建立 RAG
為查詢擷取資料
讓我們指定一個關於文件的問題。
question = "What was the office address of Hiroshi's partner from Munich?"
在集合中搜尋問題,並擷取語義上 top-1 的匹配項目。
search_res = milvus_client.search(
collection_name=collection_name,
data=[
emb_text(question)
], # Use the `emb_text` function to convert the question to an embedding vector
limit=1, # Return top 1 results
search_params={"metric_type": "IP", "params": {}}, # Inner product distance
output_fields=["text"], # Return the text field
)
讓我們來看看查詢的搜尋結果
import json
retrieved_lines_with_distances = [
(res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))
[
[
"Hiroshi [B-NAME] [I-STREET_ADDRESS] a businessman from Tokyo Japan went to attend tech expo at the Berlin Convention Center on November 10 2023 . He registered for event 9 AM using his digital passport number Q [B-ID_NUM] [B-NAME] with industry professionals participated in panel discussions and had lunch some potential partners One of he met was Munich they decided keep touch : partner ' s office address given as house No [I-STREET_ADDRESS] [B-NAME] business card 654 [B-STREET_ADDRESS]",
0.6544462442398071
]
]
使用 LLM 獲得 RAG 回應
將擷取的文件轉換成字串格式。
context = "\n".join(
[line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)
定義 LLM 的系統和使用者提示。
注意:我們告訴 LLM,如果片段中沒有有用的資訊,就說「我不知道」。
SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. If there are no useful information in the snippets, just say "I don't know".
AI:
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""
使用 OpenAI ChatGPT 根據提示產生回應。
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT},
],
)
print(response.choices[0].message.content)
I don't know.
在這裡我們可以看到,由於我們已經用掩碼取代了 PII,LLM 無法在上下文中取得 PII 資訊。通過這種方式,我們可以有效地保護用戶的隱私。