🚀 免費嘗試 Zilliz Cloud,完全托管的 Milvus,體驗速度提升 10 倍!立即嘗試

milvus-logo
LFAI
主頁
  • 教學

向量可視化

Open In Colab GitHub Repository

在這個範例中,我們將展示如何使用t-SNE 將 Milvus 中的嵌入(向量)可視化。

減維技術,例如 t-SNE,對於在二維或三維空間可視化複雜的高維資料,同時保留局部結構,是非常有價值的。這有助於模式識別、增強對特徵關係的理解,並方便解釋機器學習模型的結果。此外,t-SNE 還能透過直觀地比較聚類結果來協助演算法評估,簡化非專業觀眾的資料呈現,並透過低維表示來降低計算成本。透過這些應用,t-SNE 不僅有助於深入瞭解資料集,還能支援更明智的決策過程。

準備工作

依賴與環境

$ pip install --upgrade pymilvus openai requests tqdm matplotlib seaborn

在本範例中,我們將使用 OpenAI 的嵌入模型。您應該準備 api key OPENAI_API_KEY 作為環境變數。

import os

os.environ["OPENAI_API_KEY"] = "sk-***********"

準備資料

我們使用 MilvusDocumentation 2.4.x中的 FAQ 頁面作為 RAG 中的私有知識,對於簡單的 RAG 管道而言,這是一個很好的資料來源。

下載 zip 檔案並解壓縮文件到資料夾milvus_docs

$ wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
$ unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs

我們從資料夾milvus_docs/en/faq 載入所有 markdown 檔案。對於每個文件,我們只需簡單地使用「#」來分隔文件中的內容,這樣就可以大致分隔出 markdown 檔案中每個主要部分的內容。

from glob import glob

text_lines = []

for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
    with open(file_path, "r") as file:
        file_text = file.read()

    text_lines += file_text.split("# ")

準備嵌入模型

我們初始化 OpenAI 用戶端以準備嵌入模型。

from openai import OpenAI

openai_client = OpenAI()

定義一個使用 OpenAI client 產生文字嵌入的函式。我們使用text-embedding-3-large模型作為範例。

def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="text-embedding-3-large")
        .data[0]
        .embedding
    )

產生測試嵌入,並列印其尺寸和前幾個元素。

test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])
3072
[-0.015370666049420834, 0.00234124343842268, -0.01011690590530634, 0.044725317507982254, -0.017235849052667618, -0.02880779094994068, -0.026678944006562233, 0.06816216558218002, -0.011376636102795601, 0.021659553050994873]

將資料載入 Milvus

建立集合

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")

collection_name = "my_rag_collection"

至於MilvusClient 的參數:

  • uri 設定為本機檔案,例如./milvus.db ,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。
  • 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如http://localhost:19530 ,作為您的uri
  • 如果您想使用Zilliz Cloud(Milvus 的完全管理雲端服務),請調整uritoken ,與 Zilliz Cloud 中的Public Endpoint 和 Api key對應。

檢查集合是否已經存在,如果已經存在,請將其刪除。

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

使用指定的參數建立新的集合。

如果我們沒有指定任何欄位資訊,Milvus 會自動建立一個預設的id 欄位做為主索引鍵,以及一個vector 欄位來儲存向量資料。保留的 JSON 欄位用來儲存非結構描述定義的欄位及其值。

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)

插入資料

遍歷文字行,建立嵌入,然後將資料插入 Milvus。

這裡有一個新欄位text ,它是集合模式中的非定義欄位。它會自動加入保留的 JSON 動態欄位,在高層次上可視為一般欄位。

from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": emb_text(line), "text": line})

milvus_client.insert(collection_name=collection_name, data=data)
Creating embeddings: 100%|██████████| 72/72 [00:20<00:00,  3.60it/s]





{'insert_count': 72, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'cost': 0}

在本節中,我們會執行 milvus 搜尋,然後將查詢向量和擷取向量一起以縮小的維度可視化。

擷取查詢資料

讓我們為搜尋準備一個問題。

# Modify the question to test it with your own query!

question = "How is data stored in Milvus?"

在資料集中搜尋該問題,並擷取語義上前十名的符合資料。

search_res = milvus_client.search(
    collection_name=collection_name,
    data=[
        emb_text(question)
    ],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=10,  # Return top 10 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

讓我們來看看查詢的搜尋結果

import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))
[
    [
        " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",
        0.7675539255142212
    ],
    [
        "How does Milvus handle vector data types and precision?\n\nMilvus supports Binary, Float32, Float16, and BFloat16 vector types.\n\n- Binary vectors: Store binary data as sequences of 0s and 1s, used in image processing and information retrieval.\n- Float32 vectors: Default storage with a precision of about 7 decimal digits. Even Float64 values are stored with Float32 precision, leading to potential precision loss upon retrieval.\n- Float16 and BFloat16 vectors: Offer reduced precision and memory usage. Float16 is suitable for applications with limited bandwidth and storage, while BFloat16 balances range and efficiency, commonly used in deep learning to reduce computational requirements without significantly impacting accuracy.\n\n###",
        0.6210848689079285
    ],
    [
        "Does the query perform in memory? What are incremental data and historical data?\n\nYes. When a query request comes, Milvus searches both incremental data and historical data by loading them into memory. Incremental data are in the growing segments, which are buffered in memory before they reach the threshold to be persisted in storage engine, while historical data are from the sealed segments that are stored in the object storage. Incremental data and historical data together constitute the whole dataset to search.\n\n###",
        0.585393488407135
    ],
    [
        "Why is there no vector data in etcd?\n\netcd stores Milvus module metadata; MinIO stores entities.\n\n###",
        0.579704999923706
    ],
    [
        "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###",
        0.5777501463890076
    ],
    [
        "What is the maximum dataset size Milvus can handle?\n\n  \nTheoretically, the maximum dataset size Milvus can handle is determined by the hardware it is run on, specifically system memory and storage:\n\n- Milvus loads all specified collections and partitions into memory before running queries. Therefore, memory size determines the maximum amount of data Milvus can query.\n- When new entities and and collection-related schema (currently only MinIO is supported for data persistence) are added to Milvus, system storage determines the maximum allowable size of inserted data.\n\n###",
        0.5655910968780518
    ],
    [
        "Does Milvus support inserting and searching data simultaneously?\n\nYes. Insert operations and query operations are handled by two separate modules that are mutually independent. From the client\u2019s perspective, an insert operation is complete when the inserted data enters the message queue. However, inserted data are unsearchable until they are loaded to the query node. If the segment size does not reach the index-building threshold (512 MB by default), Milvus resorts to brute-force search and query performance may be diminished.\n\n###",
        0.5618637204170227
    ],
    [
        "What data types does Milvus support on the primary key field?\n\nIn current release, Milvus supports both INT64 and string.\n\n###",
        0.5561620593070984
    ],
    [
        "Is Milvus available for concurrent search?\n\nYes. For queries on the same collection, Milvus concurrently searches the incremental and historical data. However, queries on different collections are conducted in series. Whereas the historical data can be an extremely huge dataset, searches on the historical data are relatively more time-consuming and essentially performed in series.\n\n###",
        0.529681921005249
    ],
    [
        "Can vectors with duplicate primary keys be inserted into Milvus?\n\nYes. Milvus does not check if vector primary keys are duplicates.\n\n###",
        0.528809666633606
    ]
]

透過 t-SNE 將維度降低為 2-d

讓我們透過 t-SNE 將嵌入的維度降低為 2-d。我們將使用sklearn 函式庫來執行 t-SNE 轉換。

import pandas as pd
import numpy as np
from sklearn.manifold import TSNE

data.append({"id": len(data), "vector": emb_text(question), "text": question})
embeddings = []
for gp in data:
    embeddings.append(gp["vector"])

X = np.array(embeddings, dtype=np.float32)
tsne = TSNE(random_state=0, max_iter=1000)
tsne_results = tsne.fit_transform(X)

df_tsne = pd.DataFrame(tsne_results, columns=["TSNE1", "TSNE2"])
df_tsne
TSNE1 TSNE2
0 -3.877362 0.866726
1 -5.923084 0.671701
2 -0.645954 0.240083
3 0.444582 1.222875
4 6.503896 -4.984684
... ... ...
69 6.354055 1.264959
70 6.055961 1.266211
71 -1.516003 1.328765
72 3.971772 -0.681780
73 3.971772 -0.681780

74 行 × 2 列

在 2d 平面上可視化 Milvus 搜尋結果

我們將以綠色繪製查詢向量,紅色繪製檢索向量,藍色繪製剩餘向量。

import matplotlib.pyplot as plt
import seaborn as sns

# Extract similar ids from search results
similar_ids = [gp["id"] for gp in search_res[0]]

df_norm = df_tsne[:-1]

df_query = pd.DataFrame(df_tsne.iloc[-1]).T

# Filter points based on similar ids
similar_points = df_tsne[df_tsne.index.isin(similar_ids)]

# Create the plot
fig, ax = plt.subplots(figsize=(8, 6))  # Set figsize

# Set the style of the plot
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})

# Plot all points in blue
sns.scatterplot(
    data=df_tsne, x="TSNE1", y="TSNE2", color="blue", label="All knowledge", ax=ax
)

# Overlay similar points in red
sns.scatterplot(
    data=similar_points,
    x="TSNE1",
    y="TSNE2",
    color="red",
    label="Similar knowledge",
    ax=ax,
)

sns.scatterplot(
    data=df_query, x="TSNE1", y="TSNE2", color="green", label="Query", ax=ax
)

# Set plot titles and labels
plt.title("Scatter plot of knowledge using t-SNE")
plt.xlabel("TSNE1")
plt.ylabel("TSNE2")

# Set axis to be equal
plt.axis("equal")

# Display the legend
plt.legend()

# Show the plot
plt.show()

png

我們可以看到,查詢向量與檢索向量很接近。雖然擷取的向量不在以查詢向量為中心、半徑固定的標準圓內,但我們可以看到它們在 2D 平面上仍然非常接近查詢向量。

使用降維技術可以促進對向量的理解和疑難排解。希望您能透過本教學對向量有更深入的了解。

免費嘗試托管的 Milvus

Zilliz Cloud 無縫接入,由 Milvus 提供動力,速度提升 10 倍。

開始使用
反饋

這個頁面有幫助嗎?