Open In Colab GitHub Repository

使用 Milvus 和 Docling 建立 RAG

Docling可為人工智能應用程式簡化跨多種格式的文件解析與理解。透過先進的 PDF 理解能力和統一的文件表示法,Docling 可讓非結構化文檔資料為下游工作流程做好準備。

在本教程中,我們將告訴您如何使用 Milvus 和 Docling 建立一個檢索-增強世代 (RAG) 管道。這個管道整合了 Docling 用於文件解析、Milvus 用於向量儲存,以及 OpenAI 用於產生有洞察力、上下文感知的回應。

準備工作

相依性與環境

若要開始使用,請執行下列指令來安裝所需的相依性:

$ pip install --upgrade pymilvus milvus-lite docling openai

如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時(點選畫面上方的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。

設定 API 金鑰

在本範例中,我們將使用 OpenAI 作為 LLM。您應該準備OPENAI_API_KEY作為環境變數。

import os

os.environ["OPENAI_API_KEY"] = "sk-***********"

準備 LLM 和嵌入模型

我們初始化 OpenAI 用戶端以準備嵌入模型。

from openai import OpenAI

openai_client = OpenAI()

定義一個使用 OpenAI 用戶端產生文字嵌入的函式。我們使用text-embedding-3-small模型作為範例。

def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="text-embedding-3-small")
        .data[0]
        .embedding
    )

產生測試嵌入,並列印其尺寸和前幾個元素。

test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])
1536
[0.00988506618887186, -0.005540902726352215, 0.0068014683201909065, -0.03810417652130127, -0.018254263326525688, -0.041231658309698105, -0.007651153020560741, 0.03220026567578316, 0.01892443746328354, 0.00010708322952268645]

使用 Docling 處理資料

Docling 可以將各種文件格式解析成統一的表示法 (Docling Document),然後輸出成不同的輸出格式。關於支援的輸入和輸出格式的完整清單,請參閱官方說明文件

在本教程中,我們將使用 Markdown 檔案(原始碼) 作為輸入。我們將使用 Docling 提供的HierarchicalChunker來處理文件,以產生適合下游 RAG 任務的結構化、層次化分块。

from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker

converter = DocumentConverter()
chunker = HierarchicalChunker()

# Convert the input file to Docling Document
source = "https://milvus.io/docs/overview.md"
doc = converter.convert(source).document

# Perform hierarchical chunking
texts = [chunk.text for chunk in chunker.chunk(doc)]

for i, text in enumerate(texts[:5]):
    print(f"Chunk {i+1}:\n{text}\n{'-'*50}")
Chunk 1:
Milvus is a high-performance, highly scalable vector database that runs efficiently across a wide range of environments, from a laptop to large-scale distributed systems. It is available as both open-source software and a cloud service.
--------------------------------------------------
Chunk 2:
Milvus is an open-source project under LF AI & Data Foundation distributed under the Apache 2.0 license. Most contributors are experts from the high-performance computing (HPC) community, specializing in building large-scale systems and optimizing hardware-aware code. Core contributors include professionals from Zilliz, ARM, NVIDIA, AMD, Intel, Meta, IBM, Salesforce, Alibaba, and Microsoft.
--------------------------------------------------
Chunk 3:
Unstructured data, such as text, images, and audio, varies in format and carries rich underlying semantics, making it challenging to analyze. To manage this complexity, embeddings are used to convert unstructured data into numerical vectors that capture its essential characteristics. These vectors are then stored in a vector database, enabling fast and scalable searches and analytics.
--------------------------------------------------
Chunk 4:
Milvus offers robust data modeling capabilities, enabling you to organize your unstructured or multi-modal data into structured collections. It supports a wide range of data types for different attribute modeling, including common numerical and character types, various vector types, arrays, sets, and JSON, saving you from the effort of maintaining multiple database systems.
--------------------------------------------------
Chunk 5:
Untructured data, embeddings, and Milvus
--------------------------------------------------

將資料載入 Milvus

建立集合

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"

至於MilvusClient 的參數:

  • uri 設定為本機檔案,例如./milvus.db ,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。
  • 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如http://localhost:19530 ,作為您的uri
  • 如果您想使用Zilliz Cloud(Milvus 的完全管理雲端服務),請調整uritoken ,與 Zilliz Cloud 中的Public Endpoint 和 Api key對應。

檢查集合是否已經存在,如果已經存在,請將其刪除。

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

使用指定的參數建立新的集合。

如果我們沒有指定任何欄位資訊,Milvus 會自動建立一個預設的id 欄位做為主索引鍵,以及一個vector 欄位來儲存向量資料。保留的 JSON 欄位用來儲存非結構描述定義的欄位及其值。

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Bounded",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/tune_consistency.md#Consistency-Level for more details.
)

插入資料

from tqdm import tqdm

data = []

for i, chunk in enumerate(tqdm(texts, desc="Processing chunks")):
    embedding = emb_text(chunk)
    data.append({"id": i, "vector": embedding, "text": chunk})

milvus_client.insert(collection_name=collection_name, data=data)
Processing chunks: 100%|██████████| 36/36 [00:18<00:00,  1.96it/s]





{'insert_count': 36, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35], 'cost': 0}

建立 RAG

為查詢擷取資料

讓我們指定一個關於我們剛剛搜刮的網站的查詢問題。

question = (
    "What are the three deployment modes of Milvus, and what are their differences?"
)

在集合中搜尋該問題,並擷取符合語意的前三名。

search_res = milvus_client.search(
    collection_name=collection_name,
    data=[emb_text(question)],
    limit=3,
    search_params={"metric_type": "IP", "params": {}},
    output_fields=["text"],
)

讓我們來看看查詢的搜尋結果

import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))
[
    [
        "Milvus offers three deployment modes, covering a wide range of data scales\u2014from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors:",
        0.6503741145133972
    ],
    [
        "Milvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it\u2019s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more.",
        0.6281254291534424
    ],
    [
        "What is Milvus?\nUnstructured Data, Embeddings, and Milvus\nWhat Makes Milvus so Fast\uff1f\nWhat Makes Milvus so Scalable\nTypes of Searches Supported by Milvus\nComprehensive Feature Set",
        0.6117545962333679
    ]
]

使用 LLM 取得 RAG 回應

將擷取的文件轉換成字串格式。

context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

定義 Lanage Model 的系統和使用者提示。此提示與從 Milvus 擷取的文件組合。

SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

使用 OpenAI ChatGPT 根據提示產生回應。

response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
print(response.choices[0].message.content)
The three deployment modes of Milvus are Milvus Lite, Milvus Standalone, and Milvus Distributed. 

1. **Milvus Lite**: This is a Python library designed for easy integration into applications. It is lightweight and ideal for quick prototyping in Jupyter Notebooks or for use on edge devices with limited resources.

2. **Milvus Standalone**: This deployment mode involves a single-machine server with all components bundled into a single Docker image for convenient deployment.

3. **Milvus Distributed**: This mode can be deployed on Kubernetes clusters and is built for larger-scale scenarios, including managing billions of vectors. It features a cloud-native architecture that ensures redundancy in critical components, making it suited for extensive scalability.