🚀 免費嘗試 Zilliz Cloud,完全托管的 Milvus,體驗速度提升 10 倍!立即嘗試

milvus-logo
LFAI
主頁
  • 整合

使用 Milvus 和 VoyageAI 進行語意搜尋

Open In Colab GitHub Repository

本指南展示了如何將VoyageAI 的 Embedding API與 Milvus 向量資料庫搭配使用,在文字上進行語意搜尋。

開始使用

在開始之前,請確認您已準備好 Voyage API 金鑰,或者您可以從VoyageAI 網站取得一個。

本範例使用的資料是書名。您可以在此下載資料集,並將其放在執行下列程式碼的同一個目錄中。

首先,安裝 Milvus 和 Voyage AI 的套件:

$ pip install --upgrade voyageai pymilvus

如果您使用的是 Google Colab,為了啟用剛安裝的依賴項目,您可能需要重新啟動運行時間。(點選螢幕上方的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。

有了這些,我們就可以產生 embeddings 並使用向量資料庫來進行語意搜尋了。

使用 VoyageAI 和 Milvus 搜尋書名

在下面的範例中,我們從下載的 CSV 檔案載入書名資料,使用 Voyage AI 嵌入模型產生向量表示,並將其儲存於 Milvus 向量資料庫,以進行語意搜尋。

import voyageai
from pymilvus import MilvusClient

MODEL_NAME = "voyage-law-2"  # Which model to use, please check https://docs.voyageai.com/docs/embeddings for available models
DIMENSION = 1024  # Dimension of vector embedding

# Connect to VoyageAI with API Key.
voyage_client = voyageai.Client(api_key="<YOUR_VOYAGEAI_API_KEY>")

docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

vectors = voyage_client.embed(texts=docs, model=MODEL_NAME, truncation=False).embeddings

# Prepare data to be stored in Milvus vector database.
# We can store the id, vector representation, raw text and labels such as "subject" in this case in Milvus.
data = [
    {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
    for i in range(len(docs))
]


# Connect to Milvus, all data is stored in a local file named "milvus_voyage_demo.db"
# in current directory. You can also connect to a remote Milvus server following this
# instruction: https://milvus.io/docs/install_standalone-docker.md.
milvus_client = MilvusClient(uri="milvus_voyage_demo.db")
COLLECTION_NAME = "demo_collection"  # Milvus collection name
# Create a collection to store the vectors and text.
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.create_collection(collection_name=COLLECTION_NAME, dimension=DIMENSION)

# Insert all data into Milvus vector database.
res = milvus_client.insert(collection_name="demo_collection", data=data)

print(res["insert_count"])

至於MilvusClient 的參數:

  • uri 設定為本機檔案,例如./milvus.db ,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。
  • 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如http://localhost:19530 ,作為您的uri
  • 如果您想使用Zilliz Cloud,Milvus 的完全管理雲端服務,請調整uritoken ,對應 Zilliz Cloud 的Public Endpoint 和 Api key

有了 Milvus 向量資料庫中的所有資料,我們現在就可以透過為查詢產生向量嵌入來執行語意搜尋,並進行向量搜尋。

queries = ["When was artificial intelligence founded?"]

query_vectors = voyage_client.embed(
    texts=queries, model=MODEL_NAME, truncation=False
).embeddings

res = milvus_client.search(
    collection_name=COLLECTION_NAME,  # target collection
    data=query_vectors,  # query vectors
    limit=2,  # number of returned entities
    output_fields=["text", "subject"],  # specifies fields to be returned
)

for q in queries:
    print("Query:", q)
    for result in res:
        print(result)
    print("\n")
Query: When was artificial intelligence founded?
[{'id': 0, 'distance': 0.7196218371391296, 'entity': {'text': 'Artificial intelligence was founded as an academic discipline in 1956.', 'subject': 'history'}}, {'id': 1, 'distance': 0.6297335028648376, 'entity': {'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history'}}]

使用 VoyageAI 與 Milvus 搜尋圖像

import base64
import voyageai
from pymilvus import MilvusClient
import urllib.request
import matplotlib.pyplot as plt
from io import BytesIO
import urllib.request
import fitz  # PyMuPDF
from PIL import Image
def pdf_url_to_screenshots(url: str, zoom: float = 1.0) -> list[Image]:

    # Ensure that the URL is valid
    if not url.startswith("http") and url.endswith(".pdf"):
        raise ValueError("Invalid URL")

    # Read the PDF from the specified URL
    with urllib.request.urlopen(url) as response:
        pdf_data = response.read()
    pdf_stream = BytesIO(pdf_data)
    pdf = fitz.open(stream=pdf_stream, filetype="pdf")

    images = []

    # Loop through each page, render as pixmap, and convert to PIL Image
    mat = fitz.Matrix(zoom, zoom)
    for n in range(pdf.page_count):
        pix = pdf[n].get_pixmap(matrix=mat)

        # Convert pixmap to PIL Image
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        images.append(img)

    # Close the document
    pdf.close()

    return images


def image_to_base64(image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue())
    return img_str.decode("utf-8")

DIMENSION = 1024  # Dimension of vector embedding

接下來我們需要為 Milvus 準備輸入資料。讓我們重複使用上一章所建立的 VoyageAI 用戶端。有關可用的 VoyageAI 多模式嵌入模型,請查看此頁面

pages = pdf_url_to_screenshots("https://www.fdrlibrary.org/documents/356632/390886/readingcopy.pdf", zoom=3.0)
inputs = [[img] for img in pages]

vectors = client.multimodal_embed(inputs, model="voyage-multimodal-3")

inputs = [i[0] if isinstance(i[0], str) else image_to_base64(i[0]) for i in inputs]
# Prepare data to be stored in Milvus vector database.
# We can store the id, vector representation, raw text and labels such as "subject" in this case in Milvus.
data = [
    {"id": i, "vector": vectors.embeddings[i], "data": inputs[i], "subject": "fruits"}
    for i in range(len(inputs))
]

接下來,我們建立 Milvus 資料庫連線,並將嵌入資料插入 Milvus 資料庫。

milvus_client = MilvusClient(uri="milvus_voyage_multi_demo.db")
COLLECTION_NAME = "demo_collection"  # Milvus collection name
# Create a collection to store the vectors and text.
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.create_collection(collection_name=COLLECTION_NAME, dimension=DIMENSION)

# Insert all data into Milvus vector database.
res = milvus_client.insert(collection_name="demo_collection", data=data)

print(res["insert_count"])

現在我們準備搜尋圖像。這裡的查詢是字串,但我們也可以使用影像進行查詢。(我們使用 matplotlib顯示結果圖像。

queries = [["The consequences of a dictator's peace"]]

query_vectors = client.multimodal_embed(
    inputs=queries, model="voyage-multimodal-3", truncation=False
).embeddings

res = milvus_client.search(
    collection_name=COLLECTION_NAME,  # target collection
    data=query_vectors,  # query vectors
    limit=4,  # number of returned entities
    output_fields=["data", "subject"],  # specifies fields to be returned
)

for q in queries:
    print("Query:", q)
    for result in res:
        fig, axes = plt.subplots(1, len(result), figsize=(66, 6))
        for n, page in enumerate(result):
            page_num = page['id']
            axes[n].imshow(pages[page_num])
            axes[n].axis("off")

    plt.tight_layout()
    plt.show()

免費嘗試托管的 Milvus

Zilliz Cloud 無縫接入,由 Milvus 提供動力,速度提升 10 倍。

開始使用
反饋

這個頁面有幫助嗎?