🚀 免費嘗試 Zilliz Cloud,完全托管的 Milvus,體驗速度提升 10 倍!立即嘗試

milvus-logo
LFAI
主頁
  • 整合

使用 Milvus 和擁抱臉回答問題

Open In Colab GitHub Repository

基於語意搜尋的問題回答系統的工作方式,是針對給定的查詢問題,從問答對資料集中找出最相似的問題。一旦找出最相似的問題,資料集中相應的答案就會被視為查詢的答案。此方法依賴語意相似性量測來判斷問題之間的相似性,並擷取相關的答案。

本教學說明如何使用Hugging Face作為資料載入與嵌入產生器來處理資料,並使用Milvus作為向量資料庫來進行語意搜尋,以建立一個問題回答系統。

開始之前

您需要確認所有所需的相依性都已安裝:

  • pymilvus: python 套件可與 Milvus 或 Zilliz Cloud 所提供的向量資料庫服務搭配使用。
  • datasets,transformers: Hugging Face 套件可管理資料並運用模型。
  • torch:一個功能強大的函式庫提供高效的張量計算和深度學習工具。
$ pip install --upgrade pymilvus transformers datasets torch

如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動運行時間。(按一下螢幕上方的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。

準備資料

在本節中,我們將載入擁抱臉資料集中的範例問題-答案對。作為示範,我們只從SQuAD 的驗證分割中抽取部分資料。

from datasets import load_dataset


DATASET = "squad"  # Name of dataset from HuggingFace Datasets
INSERT_RATIO = 0.001  # Ratio of example dataset to be inserted

data = load_dataset(DATASET, split="validation")
# Generates a fixed subset. To generate a random subset, remove the seed.
data = data.train_test_split(test_size=INSERT_RATIO, seed=42)["test"]
# Clean up the data structure in the dataset.
data = data.map(
    lambda val: {"answer": val["answers"]["text"][0]},
    remove_columns=["id", "answers", "context"],
)

# View summary of example data
print(data)
Dataset({
    features: ['title', 'question', 'answer'],
    num_rows: 11
})

要產生問題的嵌入模型,您可以從 Hugging Face Models 中選擇一個文字嵌入模型。在本教程中,我們將以一個小型的句子嵌入模型all-MiniLM-L6-v2為例。

from transformers import AutoTokenizer, AutoModel
import torch

MODEL = (
    "sentence-transformers/all-MiniLM-L6-v2"  # Name of model from HuggingFace Models
)
INFERENCE_BATCH_SIZE = 64  # Batch size of model inference

# Load tokenizer & model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModel.from_pretrained(MODEL)


def encode_text(batch):
    # Tokenize sentences
    encoded_input = tokenizer(
        batch["question"], padding=True, truncation=True, return_tensors="pt"
    )

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    token_embeddings = model_output[0]
    attention_mask = encoded_input["attention_mask"]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    sentence_embeddings = torch.sum(
        token_embeddings * input_mask_expanded, 1
    ) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    # Normalize embeddings
    batch["question_embedding"] = torch.nn.functional.normalize(
        sentence_embeddings, p=2, dim=1
    )
    return batch


data = data.map(encode_text, batched=True, batch_size=INFERENCE_BATCH_SIZE)
data_list = data.to_list()

插入資料

現在我們已經準備好問題嵌入的問題-答案對。下一步就是將它們插入向量資料庫。

我們首先需要連線到 Milvus 服務,並建立一個 Milvus 套件。

from pymilvus import MilvusClient


MILVUS_URI = "./huggingface_milvus_test.db"  # Connection URI
COLLECTION_NAME = "huggingface_test"  # Collection name
DIMENSION = 384  # Embedding dimension depending on model

milvus_client = MilvusClient(MILVUS_URI)
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    dimension=DIMENSION,
    auto_id=True,  # Enable auto id
    enable_dynamic_field=True,  # Enable dynamic fields
    vector_field_name="question_embedding",  # Map vector field name and embedding column in dataset
    consistency_level="Strong",  # To enable search with latest data
)

至於MilvusClient 的參數:

  • uri 設定為本機檔案,例如./milvus.db ,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。
  • 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如http://localhost:19530 ,作為您的uri
  • 如果您想使用Zilliz Cloud(Milvus 的完全管理雲端服務),請調整uritoken ,與 Zilliz Cloud 中的Public Endpoint 和 Api key對應。

將所有資料插入收集:

milvus_client.insert(collection_name=COLLECTION_NAME, data=data_list)
{'insert_count': 11,
 'ids': [450072488481390592, 450072488481390593, 450072488481390594, 450072488481390595, 450072488481390596, 450072488481390597, 450072488481390598, 450072488481390599, 450072488481390600, 450072488481390601, 450072488481390602],
 'cost': 0}

提出問題

一旦所有資料都插入 Milvus,我們就可以提出問題,看看最接近的答案是什麼。

questions = {
    "question": [
        "What is LGM?",
        "When did Massachusetts first mandate that children be educated in schools?",
    ]
}

# Generate question embeddings
question_embeddings = [v.tolist() for v in encode_text(questions)["question_embedding"]]

# Search across Milvus
search_results = milvus_client.search(
    collection_name=COLLECTION_NAME,
    data=question_embeddings,
    limit=3,  # How many search results to output
    output_fields=["answer", "question"],  # Include these fields in search results
)

# Print out results
for q, res in zip(questions["question"], search_results):
    print("Question:", q)
    for r in res:
        print(
            {
                "answer": r["entity"]["answer"],
                "score": r["distance"],
                "original question": r["entity"]["question"],
            }
        )
    print("\n")
Question: What is LGM?
{'answer': 'Last Glacial Maximum', 'score': 0.956273078918457, 'original question': 'What does LGM stands for?'}
{'answer': 'coordinate the response to the embargo', 'score': 0.2120140939950943, 'original question': 'Why was this short termed organization created?'}
{'answer': '"Reducibility Among Combinatorial Problems"', 'score': 0.1945795714855194, 'original question': 'What is the paper written by Richard Karp in 1972 that ushered in a new era of understanding between intractability and NP-complete problems?'}


Question: When did Massachusetts first mandate that children be educated in schools?
{'answer': '1852', 'score': 0.9709997177124023, 'original question': 'In what year did Massachusetts first require children to be educated in schools?'}
{'answer': 'several regional colleges and universities', 'score': 0.34164726734161377, 'original question': 'In 1890, who did the university decide to team up with?'}
{'answer': '1962', 'score': 0.1931006908416748, 'original question': 'When were stromules discovered?'}

免費嘗試托管的 Milvus

Zilliz Cloud 無縫接入,由 Milvus 提供動力,速度提升 10 倍。

開始使用
反饋

這個頁面有幫助嗎?