LangExtract + Milvus 整合
本指南示範如何使用LangExtract與Milvus建立智慧型文件處理與檢索系統。
LangExtract 是一個 Python 函式庫,它使用大型語言模型 (Large Language Models, LLMs) 來從非結構化的文字文件中抽取結構化的資訊,並提供精確的來源基礎。該系統結合了 LangExtract 的萃取能力與 Milvus 的向量儲存,可同時進行語意相似性搜尋與精確的元資料篩選。
這種整合對於內容管理、語義搜尋、知識發現,以及根據萃取的文件屬性建立推薦系統特別有價值。
先決條件
在執行本筆記本之前,請確定您已安裝下列相依性:
$ pip install --upgrade pymilvus milvus-lite langextract google-genai requests tqdm pandas
如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時(點選畫面頂端的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。
在本範例中,我們將使用 Gemini 作為 LLM。您應該準備api key GEMINI_API_KEY 作為環境變數。
import os
os.environ["GEMINI_API_KEY"] = "AIza*****************"
定義 LangExtract + Milvus 管道
我們將定義使用 LangExtract 作為結構化資訊萃取、Milvus 作為向量儲存的管道。
import langextract as lx
import textwrap
from google import genai
from google.genai.types import EmbedContentConfig
from pymilvus import MilvusClient, DataType
import uuid
配置與設定
讓我們為整合配置全局參數。我們將使用 Gemini 的嵌入模型為我們的文件產生向量表示。
genai_client = genai.Client()
COLLECTION_NAME = "document_extractions"
EMBEDDING_MODEL = "gemini-embedding-001"
EMBEDDING_DIM = 3072 # Default dimension for gemini-embedding-001
初始化 Milvus 客戶端
現在讓我們初始化 Milvus 客戶端。為了簡單起見,我們會使用本機資料庫檔案,但這可以很容易地擴充到完整的 Milvus 伺服器部署。
client = MilvusClient(uri="./milvus_demo.db")
至於MilvusClient 的參數 :
- 將
uri設定為本機檔案,例如./milvus.db,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。 - 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如
http://localhost:19530,作為您的uri。 - 如果您要使用Zilliz Cloud,Milvus 的完全管理雲端服務,請調整
uri和token,這兩個網址對應 Zilliz Cloud 的Public Endpoint 和 Api key。
樣本資料準備
在這個示範中,我們會使用電影描述作為範例文件。這展示了 LangExtract 從非結構化文字中抽取結構化資訊的能力,例如類型、角色和主題。
sample_documents = [
"John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed thriller features intense gunfights and explosive scenes.",
"A young wizard named Harry Potter discovers his magical abilities at Hogwarts School. The fantasy adventure includes magical creatures and epic battles.",
"Tony Stark builds an advanced suit of armor to become Iron Man. The superhero movie showcases cutting-edge technology and spectacular action sequences.",
"A group of friends get lost in a haunted forest where supernatural creatures lurk. The horror film creates a terrifying atmosphere with jump scares.",
"Two detectives investigate a series of mysterious murders in New York City. The crime thriller features suspenseful plot twists and dramatic confrontations.",
"A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller explores the dangers of advanced technology and human survival.",
"A romantic comedy about two friends who fall in love during a cross-country road trip. The drama explores personal growth and relationship dynamics.",
"An evil sorcerer threatens to destroy the magical kingdom. A brave hero must gather allies and master ancient magic to save the fantasy world.",
"Space marines battle alien invaders on a distant planet. The action sci-fi movie features futuristic weapons and intense combat in space.",
"A detective investigates supernatural crimes in Victorian London. The horror thriller combines period drama with paranormal investigation themes.",
]
print("=== LangExtract + Milvus Integration Demo ===")
print(f"Preparing to process {len(sample_documents)} documents")
=== LangExtract + Milvus Integration Demo ===
Preparing to process 10 documents
設定 Milvus 套件
在儲存抽取出來的資料之前,我們需要建立一個有適當模式的 Milvus 套件。這個集合將儲存原始文件文字、向量嵌入和萃取的元資料欄位。
print("\n1. Setting up Milvus collection...")
# Drop existing collection if it exists
if client.has_collection(collection_name=COLLECTION_NAME):
client.drop_collection(collection_name=COLLECTION_NAME)
print(f"Dropped existing collection: {COLLECTION_NAME}")
# Create collection schema
schema = client.create_schema(
auto_id=False,
enable_dynamic_field=True,
description="Document extraction results and vector storage",
)
# Add fields - simplified to 3 main metadata fields
schema.add_field(
field_name="id", datatype=DataType.VARCHAR, max_length=100, is_primary=True
)
schema.add_field(
field_name="document_text", datatype=DataType.VARCHAR, max_length=10000
)
schema.add_field(
field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM
)
# Create collection
client.create_collection(collection_name=COLLECTION_NAME, schema=schema)
print(f"Collection '{COLLECTION_NAME}' created successfully")
# Create vector index
index_params = client.prepare_index_params()
index_params.add_index(
field_name="embedding",
index_type="AUTOINDEX",
metric_type="COSINE",
)
client.create_index(collection_name=COLLECTION_NAME, index_params=index_params)
print("Vector index created successfully")
1. Setting up Milvus collection...
Dropped existing collection: document_extractions
Collection 'document_extractions' created successfully
Vector index created successfully
定義抽取模式
LangExtract 使用提示和範例來引導 LLM 擷取結構化的資訊。讓我們定義電影描述的萃取模式,指定要萃取哪些資訊,以及如何將資訊分類。
print("\n2. Extracting tags from documents...")
# Define extraction prompt - for movie descriptions, specify attribute value ranges
prompt = textwrap.dedent(
"""\
Extract movie genre, main characters, and key themes from movie descriptions.
Use exact text for extractions. Do not paraphrase or overlap entities.
For each extraction, provide attributes with values from these predefined sets:
Genre attributes:
- primary_genre: ["action", "comedy", "drama", "horror", "sci-fi", "fantasy", "thriller", "crime", "superhero"]
- secondary_genre: ["action", "comedy", "drama", "horror", "sci-fi", "fantasy", "thriller", "crime", "superhero"]
Character attributes:
- role: ["protagonist", "antagonist", "supporting"]
- type: ["hero", "villain", "detective", "military", "wizard", "scientist", "friends", "investigator"]
Theme attributes:
- theme_type: ["conflict", "investigation", "personal_growth", "technology", "magic", "survival", "romance"]
- setting: ["urban", "space", "fantasy_world", "school", "forest", "victorian", "america", "future"]
Focus on identifying key elements that would be useful for movie search and filtering."""
)
2. Extracting tags from documents...
提供範例以改善萃取
為了提高抽取的品質與一致性,我們會提供 LangExtract 一些範例。這些範例展示了預期的格式,並幫助模型了解我們的抽取需求。
# Provide examples to guide the model - n-shot examples for movie descriptions
# Unify attribute keys to ensure consistency in extraction results
examples = [
lx.data.ExampleData(
text="A space marine battles alien creatures on a distant planet. The sci-fi action movie features futuristic weapons and intense combat scenes.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="sci-fi action",
attributes={"primary_genre": "sci-fi", "secondary_genre": "action"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="space marine",
attributes={"role": "protagonist", "type": "military"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="battles alien creatures",
attributes={"theme_type": "conflict", "setting": "space"},
),
],
),
lx.data.ExampleData(
text="A detective investigates supernatural murders in Victorian London. The horror thriller film combines period drama with paranormal elements.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="horror thriller",
attributes={"primary_genre": "horror", "secondary_genre": "thriller"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="detective",
attributes={"role": "protagonist", "type": "detective"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="supernatural murders",
attributes={"theme_type": "investigation", "setting": "victorian"},
),
],
),
lx.data.ExampleData(
text="Two friends embark on a road trip adventure across America. The comedy drama explores friendship and self-discovery through humorous situations.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="comedy drama",
attributes={"primary_genre": "comedy", "secondary_genre": "drama"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="two friends",
attributes={"role": "protagonist", "type": "friends"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="friendship and self-discovery",
attributes={"theme_type": "personal_growth", "setting": "america"},
),
],
),
]
# Extract from each document
extraction_results = []
for doc in sample_documents:
result = lx.extract(
text_or_documents=doc,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash",
)
extraction_results.append(result)
print(f"Successfully extracted from document: {doc[:50]}...")
print(f"Completed tag extraction, processed {len(extraction_results)} documents")
處理結果並將其向量化
現在我們需要處理萃取結果,並為每個文件產生向量內嵌。我們也會將擷取的屬性扁平化成獨立的欄位,讓它們在 Milvus 中容易搜尋。
print("\n3. Processing extraction results and generating vectors...")
processed_data = []
for result in extraction_results:
# Generate vectors for documents
embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[result.text],
config=EmbedContentConfig(
task_type="RETRIEVAL_DOCUMENT",
output_dimensionality=EMBEDDING_DIM,
),
)
embedding = embedding_response.embeddings[0].values
print(f"Successfully generated vector: {result.text[:30]}...")
# Initialize data structure, flatten attributes into separate fields
data_entry = {
"id": result.document_id or str(uuid.uuid4()),
"document_text": result.text,
"embedding": embedding,
# Initialize all possible fields with default values
"genre": "unknown",
"primary_genre": "unknown",
"secondary_genre": "unknown",
"character_role": "unknown",
"character_type": "unknown",
"theme_type": "unknown",
"theme_setting": "unknown",
}
# Process extraction results, flatten attributes
for extraction in result.extractions:
if extraction.extraction_class == "genre":
# Flatten genre attributes
data_entry["genre"] = extraction.extraction_text
attrs = extraction.attributes or {}
data_entry["primary_genre"] = attrs.get("primary_genre", "unknown")
data_entry["secondary_genre"] = attrs.get("secondary_genre", "unknown")
elif extraction.extraction_class == "character":
# Flatten character attributes (take first main character's attributes)
attrs = extraction.attributes or {}
if (
data_entry["character_role"] == "unknown"
): # Only take first character's attributes
data_entry["character_role"] = attrs.get("role", "unknown")
data_entry["character_type"] = attrs.get("type", "unknown")
elif extraction.extraction_class == "theme":
# Flatten theme attributes (take first main theme's attributes)
attrs = extraction.attributes or {}
if (
data_entry["theme_type"] == "unknown"
): # Only take first theme's attributes
data_entry["theme_type"] = attrs.get("theme_type", "unknown")
data_entry["theme_setting"] = attrs.get("setting", "unknown")
processed_data.append(data_entry)
print(f"Completed data processing, ready to insert {len(processed_data)} records")
3. Processing extraction results and generating vectors...
Successfully generated vector: John McClane fights terrorists...
Successfully generated vector: A young wizard named Harry Pot...
Successfully generated vector: Tony Stark builds an advanced ...
Successfully generated vector: A group of friends get lost in...
Successfully generated vector: Two detectives investigate a s...
Successfully generated vector: A brilliant scientist creates ...
Successfully generated vector: A romantic comedy about two fr...
Successfully generated vector: An evil sorcerer threatens to ...
Successfully generated vector: Space marines battle alien inv...
Successfully generated vector: A detective investigates super...
Completed data processing, ready to insert 10 records
將資料插入 Milvus
處理過的資料已準備就緒,讓我們將它插入 Milvus 資料庫。這將使我們能夠執行語義搜索和精確的元資料篩選。
print("\n4. Inserting data into Milvus...")
if processed_data:
res = client.insert(collection_name=COLLECTION_NAME, data=processed_data)
print(f"Successfully inserted {len(processed_data)} documents into Milvus")
print(f"Insert result: {res}")
else:
print("No data to insert")
4. Inserting data into Milvus...
Successfully inserted 10 documents into Milvus
Insert result: {'insert_count': 10, 'ids': ['doc_f8797155', 'doc_78c7e586', 'doc_fa3a3ab5', 'doc_64981815', 'doc_3ab18cb2', 'doc_1ea42b18', 'doc_f0779243', 'doc_386590b7', 'doc_3b3ae1ab', 'doc_851089d6']}
示範元資料篩選
結合 LangExtract 與 Milvus 的主要優勢之一,就是能夠根據萃取的元資料執行精確的篩選。讓我們用一些篩選表達式(filter expression)搜尋來示範這一點。
print("\n=== Filter Expression Search Examples ===")
# Load collection into memory for querying
print("Loading collection into memory...")
client.load_collection(collection_name=COLLECTION_NAME)
print("Collection loaded successfully")
# Search for thriller movies
print("\n1. Searching for thriller movies:")
results = client.query(
collection_name=COLLECTION_NAME,
filter='secondary_genre == "thriller"',
output_fields=["document_text", "genre", "primary_genre", "secondary_genre"],
limit=5,
)
for result in results:
print(f"- {result['document_text'][:100]}...")
print(
f" Genre: {result['genre']} ({result.get('primary_genre')}-{result.get('secondary_genre')})"
)
# Search for movies with military characters
print("\n2. Searching for movies with military characters:")
results = client.query(
collection_name=COLLECTION_NAME,
filter='character_type == "military"',
output_fields=["document_text", "genre", "character_role", "character_type"],
limit=5,
)
for result in results:
print(f"- {result['document_text'][:100]}...")
print(f" Genre: {result['genre']}")
print(
f" Character: {result.get('character_role')} ({result.get('character_type')})"
)
=== Filter Expression Search Examples ===
Loading collection into memory...
Collection loaded successfully
1. Searching for thriller movies:
- A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller e...
Genre: sci-fi thriller (sci-fi-thriller)
- Two detectives investigate a series of mysterious murders in New York City. The crime thriller featu...
Genre: crime thriller (crime-thriller)
- A detective investigates supernatural crimes in Victorian London. The horror thriller combines perio...
Genre: horror thriller (horror-thriller)
- John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed t...
Genre: action-packed thriller (action-thriller)
2. Searching for movies with military characters:
- Space marines battle alien invaders on a distant planet. The action sci-fi movie features futuristic...
Genre: action sci-fi
Character: protagonist (military)
結合語意搜尋與元資料篩選
此整合的真正威力來自於結合語意向量搜尋與精確的元資料篩選。這可讓我們找到語義上相似的內容,同時根據擷取的屬性套用特定的限制條件。
print("\n=== Semantic Search Examples ===")
# 1. Search for action-related content + only thriller genre
print("\n1. Searching for action-related content + only thriller genre:")
query_text = "action fight combat battle explosion"
query_embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[query_text],
config=EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=EMBEDDING_DIM,
),
)
query_embedding = query_embedding_response.embeddings[0].values
results = client.search(
collection_name=COLLECTION_NAME,
data=[query_embedding],
anns_field="embedding",
limit=3,
filter='secondary_genre == "thriller"',
output_fields=["document_text", "genre", "primary_genre", "secondary_genre"],
search_params={"metric_type": "COSINE"},
)
if results:
for result in results[0]:
print(f"- Similarity: {result['distance']:.4f}")
print(f" Text: {result['document_text'][:100]}...")
print(
f" Genre: {result.get('genre')} ({result.get('primary_genre')}-{result.get('secondary_genre')})"
)
# 2. Search for magic-related content + fantasy genre + conflict theme
print("\n2. Searching for magic-related content + fantasy genre + conflict theme:")
query_text = "magic wizard spell fantasy magical"
query_embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[query_text],
config=EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=EMBEDDING_DIM,
),
)
query_embedding = query_embedding_response.embeddings[0].values
results = client.search(
collection_name=COLLECTION_NAME,
data=[query_embedding],
anns_field="embedding",
limit=3,
filter='primary_genre == "fantasy" and theme_type == "conflict"',
output_fields=[
"document_text",
"genre",
"primary_genre",
"theme_type",
"theme_setting",
],
search_params={"metric_type": "COSINE"},
)
if results:
for result in results[0]:
print(f"- Similarity: {result['distance']:.4f}")
print(f" Text: {result['document_text'][:100]}...")
print(f" Genre: {result.get('genre')} ({result.get('primary_genre')})")
print(f" Theme: {result.get('theme_type')} ({result.get('theme_setting')})")
print("\n=== Demo Complete ===")
=== Semantic Search Examples ===
1. Searching for action-related content + only thriller genre:
- Similarity: 0.6947
Text: John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed t...
Genre: action-packed thriller (action-thriller)
- Similarity: 0.6128
Text: Two detectives investigate a series of mysterious murders in New York City. The crime thriller featu...
Genre: crime thriller (crime-thriller)
- Similarity: 0.5889
Text: A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller e...
Genre: sci-fi thriller (sci-fi-thriller)
2. Searching for magic-related content + fantasy genre + conflict theme:
- Similarity: 0.6986
Text: An evil sorcerer threatens to destroy the magical kingdom. A brave hero must gather allies and maste...
Genre: fantasy (fantasy)
Theme: conflict (fantasy_world)
=== Demo Complete ===