使用 Milvus 和 Cohere 進行問題回答
本頁說明如何使用 Milvus 作為向量資料庫,並使用 Cohere 作為嵌入系統,建立一個以 SQuAD 資料集為基礎的問題解答系統。
在您開始之前
本頁面的程式碼片段需要安裝pymilvus、cohere、pandas、numpy 和tqdm。在這些套件中,PYMILVUS是 Milvus 的用戶端。如果您的系統上沒有這些套件,請執行下列指令來安裝它們:
pip install pymilvus cohere pandas numpy tqdm
然後您需要載入本指南要使用的模組。
import cohere
import pandas
import numpy as np
from tqdm import tqdm
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
參數
在這裡我們可以找到以下片段所使用的參數。有些參數需要變更,以符合您的環境。每個參數旁都有說明。
FILE = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json' # The SQuAD dataset url
COLLECTION_NAME = 'question_answering_db' # Collection name
DIMENSION = 1024 # Embeddings size, cohere embeddings default to 4096 with the large model
COUNT = 5000 # How many questions to embed and insert into Milvus
BATCH_SIZE = 96 # How large of batches to use for embedding and insertion
MILVUS_HOST = 'localhost' # Milvus server URI
MILVUS_PORT = '19530'
COHERE_API_KEY = 'replace-this-with-the-cohere-api-key' # API key obtained from Cohere
若要瞭解更多關於本頁面所用模型和資料集的資訊,請參閱co:here和SQuAD。
準備資料集
在本範例中,我們要使用 Stanford Question Answering Dataset (SQuAD) 作為回答問題的真相來源。這個資料集是以 JSON 檔案的形式出現,我們要使用pandas將它載入。
# Download the dataset
dataset = pandas.read_json(FILE)
# Clean up the dataset by grabbing all the question answer pairs
simplified_records = []
for x in dataset['data']:
for y in x['paragraphs']:
for z in y['qas']:
if len(z['answers']) != 0:
simplified_records.append({'question': z['question'], 'answer': z['answers'][0]['text']})
# Grab the amount of records based on COUNT
simplified_records = pandas.DataFrame.from_records(simplified_records)
simplified_records = simplified_records.sample(n=min(COUNT, len(simplified_records)), random_state = 42)
# Check the length of the cleaned dataset matches count
print(len(simplified_records))
輸出應該是資料集中的記錄數
5000
建立資料集
本節將介紹 Milvus 以及為本使用個案設定資料庫。在 Milvus 中,我們需要設定一個資料集並建立索引。
# Connect to Milvus Database
connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)
# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
utility.drop_collection(COLLECTION_NAME)
# Create collection which includes the id, title, and embedding.
fields = [
FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name='original_question', dtype=DataType.VARCHAR, max_length=1000),
FieldSchema(name='answer', dtype=DataType.VARCHAR, max_length=1000),
FieldSchema(name='original_question_embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)
# Create an IVF_FLAT index for collection.
index_params = {
'metric_type':'IP',
'index_type':"IVF_FLAT",
'params':{"nlist": 1024}
}
collection.create_index(field_name="original_question_embedding", index_params=index_params)
collection.load()
插入資料
建立資料集後,我們需要開始插入資料。這分為三個步驟
- 讀取資料、
- 嵌入原始問題,以及
- 將資料插入我們剛剛在 Milvus 上建立的資料集中。
在這個範例中,資料包括原始問題、原始問題的嵌入以及原始問題的答案。
# Set up a co:here client.
cohere_client = cohere.Client(COHERE_API_KEY)
# Extract embeddings from questions using Cohere
def embed(texts, input_type):
res = cohere_client.embed(texts, model='embed-multilingual-v3.0', input_type=input_type)
return res.embeddings
# Insert each question, answer, and qustion embedding
total = pandas.DataFrame()
for batch in tqdm(np.array_split(simplified_records, (COUNT/BATCH_SIZE) + 1)):
questions = batch['question'].tolist()
embeddings = embed(questions, "search_document")
data = [
{
'original_question': x,
'answer': batch['answer'].tolist()[i],
'original_question_embedding': embeddings[i]
} for i, x in enumerate(questions)
]
collection.insert(data=data)
time.sleep(10)
詢問問題
一旦所有資料都插入 Milvus 資料集中,我們就可以利用問題短語向系統發問,將其嵌入 Cohere 並使用資料集中進行搜尋。
插入資料後立即執行的搜尋可能會慢一點,因為搜尋未編入索引的資料是以暴力方式進行的。一旦新資料被自動編入索引,搜尋速度就會加快。
# Search the cluster for an answer to a question text
def search(text, top_k = 5):
# AUTOINDEX does not require any search params
search_params = {}
results = collection.search(
data = embed([text], "search_query"), # Embeded the question
anns_field='original_question_embedding',
param=search_params,
limit = top_k, # Limit to top_k results per search
output_fields=['original_question', 'answer'] # Include the original question and answer in the result
)
distances = results[0].distances
entities = [ x.entity.to_dict()['entity'] for x in results[0] ]
ret = [ {
"answer": x[1]["answer"],
"distance": x[0],
"original_question": x[1]['original_question']
} for x in zip(distances, entities)]
return ret
# Ask these questions
search_questions = ['What kills bacteria?', 'What\'s the biggest dog?']
# Print out the results in order of [answer, similarity score, original question]
ret = [ { "question": x, "candidates": search(x) } for x in search_questions ]
輸出應該與下列內容相似:
# Output
#
# [
# {
# "question": "What kills bacteria?",
# "candidates": [
# {
# "answer": "farming",
# "distance": 0.6261022090911865,
# "original_question": "What makes bacteria resistant to antibiotic treatment?"
# },
# {
# "answer": "Phage therapy",
# "distance": 0.6093736886978149,
# "original_question": "What has been talked about to treat resistant bacteria?"
# },
# {
# "answer": "oral contraceptives",
# "distance": 0.5902313590049744,
# "original_question": "In therapy, what does the antibacterial interact with?"
# },
# {
# "answer": "slowing down the multiplication of bacteria or killing the bacteria",
# "distance": 0.5874154567718506,
# "original_question": "How do antibiotics work?"
# },
# {
# "answer": "in intensive farming to promote animal growth",
# "distance": 0.5667208433151245,
# "original_question": "Besides in treating human disease where else are antibiotics used?"
# }
# ]
# },
# {
# "question": "What's the biggest dog?",
# "candidates": [
# {
# "answer": "English Mastiff",
# "distance": 0.7875324487686157,
# "original_question": "What breed was the largest dog known to have lived?"
# },
# {
# "answer": "forest elephants",
# "distance": 0.5886962413787842,
# "original_question": "What large animals reside in the national park?"
# },
# {
# "answer": "Rico",
# "distance": 0.5634892582893372,
# "original_question": "What is the name of the dog that could ID over 200 things?"
# },
# {
# "answer": "Iditarod Trail Sled Dog Race",
# "distance": 0.546872615814209,
# "original_question": "Which dog-sled race in Alaska is the most famous?"
# },
# {
# "answer": "part of the family",
# "distance": 0.5387814044952393,
# "original_question": "Most people today describe their dogs as what?"
# }
# ]
# }
# ]