使用 Milvus 和 OpenAI 進行語意搜尋
本指南展示OpenAI 的 Embedding API如何與 Milvus 向量資料庫搭配使用,以對文字進行語意搜尋。
開始使用
在您開始之前,請確定您已準備好 OpenAI API 金鑰,或者您已從OpenAI 網站取得一個金鑰。
本範例使用的資料是書名。您可以在此下載資料集,並將其放在執行下列程式碼的同一個目錄中。
首先,安裝 Milvus 和 OpenAI 的套件:
pip install --upgrade openai pymilvus
如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動運行時間。(點選螢幕上方的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。
有了這些,我們就可以產生嵌入並使用向量資料庫來進行語意搜尋了。
使用 OpenAI 與 Milvus 搜尋書名
在以下範例中,我們從下載的 CSV 檔案載入書名資料,使用 OpenAI 嵌入模型產生向量表示,並將其儲存在 Milvus 向量資料庫中進行語意搜尋。
from openai import OpenAI
from pymilvus import MilvusClient
MODEL_NAME = "text-embedding-3-small" # Which model to use, please check https://platform.openai.com/docs/guides/embeddings for available models
DIMENSION = 1536 # Dimension of vector embedding
# Connect to OpenAI with API Key.
openai_client = OpenAI(api_key="<YOUR_OPENAI_API_KEY>")
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
vectors = [
vec.embedding
for vec in openai_client.embeddings.create(input=docs, model=MODEL_NAME).data
]
# Prepare data to be stored in Milvus vector database.
# We can store the id, vector representation, raw text and labels such as "subject" in this case in Milvus.
data = [
{"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
for i in range(len(docs))
]
# Connect to Milvus, all data is stored in a local file named "milvus_openai_demo.db"
# in current directory. You can also connect to a remote Milvus server following this
# instruction: https://milvus.io/docs/install_standalone-docker.md.
milvus_client = MilvusClient(uri="milvus_openai_demo.db")
COLLECTION_NAME = "demo_collection" # Milvus collection name
# Create a collection to store the vectors and text.
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.create_collection(collection_name=COLLECTION_NAME, dimension=DIMENSION)
# Insert all data into Milvus vector database.
res = milvus_client.insert(collection_name="demo_collection", data=data)
print(res["insert_count"])
至於MilvusClient
的參數:
- 將
uri
設定為本機檔案,例如./milvus.db
,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。 - 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如
http://localhost:19530
,作為您的uri
。 - 如果您想使用Zilliz Cloud,Milvus 的完全管理雲端服務,請調整
uri
和token
,對應 Zilliz Cloud 的Public Endpoint 和 Api key。
有了 Milvus 向量資料庫中的所有資料,我們現在就可以透過產生查詢的向量嵌入來執行語意搜尋,並進行向量搜尋。
queries = ["When was artificial intelligence founded?"]
query_vectors = [
vec.embedding
for vec in openai_client.embeddings.create(input=queries, model=MODEL_NAME).data
]
res = milvus_client.search(
collection_name=COLLECTION_NAME, # target collection
data=query_vectors, # query vectors
limit=2, # number of returned entities
output_fields=["text", "subject"], # specifies fields to be returned
)
for q in queries:
print("Query:", q)
for result in res:
print(result)
print("\n")
您應該會看到以下輸出:
[
{
"id": 0,
"distance": -0.772376537322998,
"entity": {
"text": "Artificial intelligence was founded as an academic discipline in 1956.",
"subject": "history",
},
},
{
"id": 1,
"distance": -0.58596271276474,
"entity": {
"text": "Alan Turing was the first person to conduct substantial research in AI.",
"subject": "history",
},
},
]