Intégration LangExtract + Milvus
Ce guide montre comment utiliser LangExtract avec Milvus pour construire un système intelligent de traitement et d'extraction de documents.
LangExtract est une bibliothèque Python qui utilise de grands modèles de langage (LLM) pour extraire des informations structurées à partir de documents textuels non structurés avec une référence précise à la source. Le système combine les capacités d'extraction de LangExtract avec le stockage vectoriel de Milvus pour permettre à la fois la recherche de similarités sémantiques et le filtrage précis des métadonnées.
Cette intégration est particulièrement utile pour la gestion de contenu, la recherche sémantique, la découverte de connaissances et la construction de systèmes de recommandation basés sur les attributs des documents extraits.
Conditions préalables
Avant d'exécuter ce bloc-notes, assurez-vous que les dépendances suivantes sont installées :
$ pip install --upgrade pymilvus milvus-lite langextract google-genai requests tqdm pandas
Si vous utilisez Google Colab, pour activer les dépendances qui viennent d'être installées, vous devrez peut-être redémarrer le runtime (cliquez sur le menu "Runtime" en haut de l'écran, et sélectionnez "Restart session" dans le menu déroulant).
Nous utiliserons Gemini comme LLM dans cet exemple. Vous devez préparer la clé api GEMINI_API_KEY en tant que variable d'environnement.
import os
os.environ["GEMINI_API_KEY"] = "AIza*****************"
Définir le pipeline LangExtract + Milvus
Nous allons définir le pipeline qui utilise LangExtract pour l'extraction d'informations structurées et Milvus comme magasin de vecteurs.
import langextract as lx
import textwrap
from google import genai
from google.genai.types import EmbedContentConfig
from pymilvus import MilvusClient, DataType
import uuid
Configuration et mise en place
Configurons nos paramètres globaux pour l'intégration. Nous utiliserons le modèle d'intégration de Gemini pour générer des représentations vectorielles pour nos documents.
genai_client = genai.Client()
COLLECTION_NAME = "document_extractions"
EMBEDDING_MODEL = "gemini-embedding-001"
EMBEDDING_DIM = 3072 # Default dimension for gemini-embedding-001
Initialisation du client Milvus
Initialisons maintenant notre client Milvus. Nous utiliserons un fichier de base de données local pour des raisons de simplicité, mais cela peut facilement être étendu à un déploiement complet du serveur Milvus.
client = MilvusClient(uri="./milvus_demo.db")
En ce qui concerne l'argument de MilvusClient:
- Définir
uricomme un fichier local, par exemple./milvus.db, est la méthode la plus pratique, car elle utilise automatiquement Milvus Lite pour stocker toutes les données dans ce fichier. - Si vous avez des données à grande échelle, vous pouvez configurer un serveur Milvus plus performant sur docker ou kubernetes. Dans cette configuration, veuillez utiliser l'uri du serveur, par exemple
http://localhost:19530, comme votreuri. - Si vous souhaitez utiliser Zilliz Cloud, le service cloud entièrement géré pour Milvus, ajustez les adresses
uriettoken, qui correspondent au point de terminaison public et à la clé Api dans Zilliz Cloud.
Préparation des données de l'échantillon
Pour cette démonstration, nous utiliserons des descriptions de films comme documents d'exemple. Cela permet de montrer la capacité de LangExtract à extraire des informations structurées telles que les genres, les personnages et les thèmes à partir d'un texte non structuré.
sample_documents = [
"John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed thriller features intense gunfights and explosive scenes.",
"A young wizard named Harry Potter discovers his magical abilities at Hogwarts School. The fantasy adventure includes magical creatures and epic battles.",
"Tony Stark builds an advanced suit of armor to become Iron Man. The superhero movie showcases cutting-edge technology and spectacular action sequences.",
"A group of friends get lost in a haunted forest where supernatural creatures lurk. The horror film creates a terrifying atmosphere with jump scares.",
"Two detectives investigate a series of mysterious murders in New York City. The crime thriller features suspenseful plot twists and dramatic confrontations.",
"A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller explores the dangers of advanced technology and human survival.",
"A romantic comedy about two friends who fall in love during a cross-country road trip. The drama explores personal growth and relationship dynamics.",
"An evil sorcerer threatens to destroy the magical kingdom. A brave hero must gather allies and master ancient magic to save the fantasy world.",
"Space marines battle alien invaders on a distant planet. The action sci-fi movie features futuristic weapons and intense combat in space.",
"A detective investigates supernatural crimes in Victorian London. The horror thriller combines period drama with paranormal investigation themes.",
]
print("=== LangExtract + Milvus Integration Demo ===")
print(f"Preparing to process {len(sample_documents)} documents")
=== LangExtract + Milvus Integration Demo ===
Preparing to process 10 documents
Configuration de la collection Milvus
Avant de pouvoir stocker les données extraites, nous devons créer une collection Milvus avec le schéma approprié. Cette collection stockera le texte du document original, les encastrements vectoriels et les champs de métadonnées extraits.
print("\n1. Setting up Milvus collection...")
# Drop existing collection if it exists
if client.has_collection(collection_name=COLLECTION_NAME):
client.drop_collection(collection_name=COLLECTION_NAME)
print(f"Dropped existing collection: {COLLECTION_NAME}")
# Create collection schema
schema = client.create_schema(
auto_id=False,
enable_dynamic_field=True,
description="Document extraction results and vector storage",
)
# Add fields - simplified to 3 main metadata fields
schema.add_field(
field_name="id", datatype=DataType.VARCHAR, max_length=100, is_primary=True
)
schema.add_field(
field_name="document_text", datatype=DataType.VARCHAR, max_length=10000
)
schema.add_field(
field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM
)
# Create collection
client.create_collection(collection_name=COLLECTION_NAME, schema=schema)
print(f"Collection '{COLLECTION_NAME}' created successfully")
# Create vector index
index_params = client.prepare_index_params()
index_params.add_index(
field_name="embedding",
index_type="AUTOINDEX",
metric_type="COSINE",
)
client.create_index(collection_name=COLLECTION_NAME, index_params=index_params)
print("Vector index created successfully")
1. Setting up Milvus collection...
Dropped existing collection: document_extractions
Collection 'document_extractions' created successfully
Vector index created successfully
Définition du schéma d'extraction
LangExtract utilise des invites et des exemples pour guider le LLM dans l'extraction d'informations structurées. Définissons notre schéma d'extraction pour les descriptions de films, en spécifiant les informations à extraire et la manière de les classer.
print("\n2. Extracting tags from documents...")
# Define extraction prompt - for movie descriptions, specify attribute value ranges
prompt = textwrap.dedent(
"""\
Extract movie genre, main characters, and key themes from movie descriptions.
Use exact text for extractions. Do not paraphrase or overlap entities.
For each extraction, provide attributes with values from these predefined sets:
Genre attributes:
- primary_genre: ["action", "comedy", "drama", "horror", "sci-fi", "fantasy", "thriller", "crime", "superhero"]
- secondary_genre: ["action", "comedy", "drama", "horror", "sci-fi", "fantasy", "thriller", "crime", "superhero"]
Character attributes:
- role: ["protagonist", "antagonist", "supporting"]
- type: ["hero", "villain", "detective", "military", "wizard", "scientist", "friends", "investigator"]
Theme attributes:
- theme_type: ["conflict", "investigation", "personal_growth", "technology", "magic", "survival", "romance"]
- setting: ["urban", "space", "fantasy_world", "school", "forest", "victorian", "america", "future"]
Focus on identifying key elements that would be useful for movie search and filtering."""
)
2. Extracting tags from documents...
Fournir des exemples pour une meilleure extraction
Pour améliorer la qualité et la cohérence des extractions, nous fournirons à LangExtract quelques exemples. Ces exemples illustrent le format attendu et aident le modèle à comprendre nos exigences en matière d'extraction.
# Provide examples to guide the model - n-shot examples for movie descriptions
# Unify attribute keys to ensure consistency in extraction results
examples = [
lx.data.ExampleData(
text="A space marine battles alien creatures on a distant planet. The sci-fi action movie features futuristic weapons and intense combat scenes.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="sci-fi action",
attributes={"primary_genre": "sci-fi", "secondary_genre": "action"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="space marine",
attributes={"role": "protagonist", "type": "military"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="battles alien creatures",
attributes={"theme_type": "conflict", "setting": "space"},
),
],
),
lx.data.ExampleData(
text="A detective investigates supernatural murders in Victorian London. The horror thriller film combines period drama with paranormal elements.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="horror thriller",
attributes={"primary_genre": "horror", "secondary_genre": "thriller"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="detective",
attributes={"role": "protagonist", "type": "detective"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="supernatural murders",
attributes={"theme_type": "investigation", "setting": "victorian"},
),
],
),
lx.data.ExampleData(
text="Two friends embark on a road trip adventure across America. The comedy drama explores friendship and self-discovery through humorous situations.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="comedy drama",
attributes={"primary_genre": "comedy", "secondary_genre": "drama"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="two friends",
attributes={"role": "protagonist", "type": "friends"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="friendship and self-discovery",
attributes={"theme_type": "personal_growth", "setting": "america"},
),
],
),
]
# Extract from each document
extraction_results = []
for doc in sample_documents:
result = lx.extract(
text_or_documents=doc,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash",
)
extraction_results.append(result)
print(f"Successfully extracted from document: {doc[:50]}...")
print(f"Completed tag extraction, processed {len(extraction_results)} documents")
Traitement et vectorisation des résultats
Nous devons maintenant traiter les résultats de l'extraction et générer des encastrements vectoriels pour chaque document. Nous allons également aplatir les attributs extraits dans des champs distincts afin de les rendre facilement consultables dans Milvus.
print("\n3. Processing extraction results and generating vectors...")
processed_data = []
for result in extraction_results:
# Generate vectors for documents
embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[result.text],
config=EmbedContentConfig(
task_type="RETRIEVAL_DOCUMENT",
output_dimensionality=EMBEDDING_DIM,
),
)
embedding = embedding_response.embeddings[0].values
print(f"Successfully generated vector: {result.text[:30]}...")
# Initialize data structure, flatten attributes into separate fields
data_entry = {
"id": result.document_id or str(uuid.uuid4()),
"document_text": result.text,
"embedding": embedding,
# Initialize all possible fields with default values
"genre": "unknown",
"primary_genre": "unknown",
"secondary_genre": "unknown",
"character_role": "unknown",
"character_type": "unknown",
"theme_type": "unknown",
"theme_setting": "unknown",
}
# Process extraction results, flatten attributes
for extraction in result.extractions:
if extraction.extraction_class == "genre":
# Flatten genre attributes
data_entry["genre"] = extraction.extraction_text
attrs = extraction.attributes or {}
data_entry["primary_genre"] = attrs.get("primary_genre", "unknown")
data_entry["secondary_genre"] = attrs.get("secondary_genre", "unknown")
elif extraction.extraction_class == "character":
# Flatten character attributes (take first main character's attributes)
attrs = extraction.attributes or {}
if (
data_entry["character_role"] == "unknown"
): # Only take first character's attributes
data_entry["character_role"] = attrs.get("role", "unknown")
data_entry["character_type"] = attrs.get("type", "unknown")
elif extraction.extraction_class == "theme":
# Flatten theme attributes (take first main theme's attributes)
attrs = extraction.attributes or {}
if (
data_entry["theme_type"] == "unknown"
): # Only take first theme's attributes
data_entry["theme_type"] = attrs.get("theme_type", "unknown")
data_entry["theme_setting"] = attrs.get("setting", "unknown")
processed_data.append(data_entry)
print(f"Completed data processing, ready to insert {len(processed_data)} records")
3. Processing extraction results and generating vectors...
Successfully generated vector: John McClane fights terrorists...
Successfully generated vector: A young wizard named Harry Pot...
Successfully generated vector: Tony Stark builds an advanced ...
Successfully generated vector: A group of friends get lost in...
Successfully generated vector: Two detectives investigate a s...
Successfully generated vector: A brilliant scientist creates ...
Successfully generated vector: A romantic comedy about two fr...
Successfully generated vector: An evil sorcerer threatens to ...
Successfully generated vector: Space marines battle alien inv...
Successfully generated vector: A detective investigates super...
Completed data processing, ready to insert 10 records
Insertion des données dans Milvus
Nos données traitées étant prêtes, nous allons les insérer dans la collection Milvus. Cela nous permettra d'effectuer des recherches sémantiques et un filtrage précis des métadonnées.
print("\n4. Inserting data into Milvus...")
if processed_data:
res = client.insert(collection_name=COLLECTION_NAME, data=processed_data)
print(f"Successfully inserted {len(processed_data)} documents into Milvus")
print(f"Insert result: {res}")
else:
print("No data to insert")
4. Inserting data into Milvus...
Successfully inserted 10 documents into Milvus
Insert result: {'insert_count': 10, 'ids': ['doc_f8797155', 'doc_78c7e586', 'doc_fa3a3ab5', 'doc_64981815', 'doc_3ab18cb2', 'doc_1ea42b18', 'doc_f0779243', 'doc_386590b7', 'doc_3b3ae1ab', 'doc_851089d6']}
Démonstration du filtrage des métadonnées
L'un des principaux avantages de la combinaison de LangExtract et de Milvus est la possibilité d'effectuer un filtrage précis basé sur les métadonnées extraites. Nous allons en faire la démonstration avec quelques recherches par expression de filtre.
print("\n=== Filter Expression Search Examples ===")
# Load collection into memory for querying
print("Loading collection into memory...")
client.load_collection(collection_name=COLLECTION_NAME)
print("Collection loaded successfully")
# Search for thriller movies
print("\n1. Searching for thriller movies:")
results = client.query(
collection_name=COLLECTION_NAME,
filter='secondary_genre == "thriller"',
output_fields=["document_text", "genre", "primary_genre", "secondary_genre"],
limit=5,
)
for result in results:
print(f"- {result['document_text'][:100]}...")
print(
f" Genre: {result['genre']} ({result.get('primary_genre')}-{result.get('secondary_genre')})"
)
# Search for movies with military characters
print("\n2. Searching for movies with military characters:")
results = client.query(
collection_name=COLLECTION_NAME,
filter='character_type == "military"',
output_fields=["document_text", "genre", "character_role", "character_type"],
limit=5,
)
for result in results:
print(f"- {result['document_text'][:100]}...")
print(f" Genre: {result['genre']}")
print(
f" Character: {result.get('character_role')} ({result.get('character_type')})"
)
=== Filter Expression Search Examples ===
Loading collection into memory...
Collection loaded successfully
1. Searching for thriller movies:
- A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller e...
Genre: sci-fi thriller (sci-fi-thriller)
- Two detectives investigate a series of mysterious murders in New York City. The crime thriller featu...
Genre: crime thriller (crime-thriller)
- A detective investigates supernatural crimes in Victorian London. The horror thriller combines perio...
Genre: horror thriller (horror-thriller)
- John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed t...
Genre: action-packed thriller (action-thriller)
2. Searching for movies with military characters:
- Space marines battle alien invaders on a distant planet. The action sci-fi movie features futuristic...
Genre: action sci-fi
Character: protagonist (military)
Combinaison de la recherche sémantique et du filtrage des métadonnées
La véritable puissance de cette intégration réside dans la combinaison de la recherche sémantique vectorielle et du filtrage précis des métadonnées. Cela nous permet de trouver des contenus sémantiquement similaires tout en appliquant des contraintes spécifiques basées sur les attributs extraits.
print("\n=== Semantic Search Examples ===")
# 1. Search for action-related content + only thriller genre
print("\n1. Searching for action-related content + only thriller genre:")
query_text = "action fight combat battle explosion"
query_embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[query_text],
config=EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=EMBEDDING_DIM,
),
)
query_embedding = query_embedding_response.embeddings[0].values
results = client.search(
collection_name=COLLECTION_NAME,
data=[query_embedding],
anns_field="embedding",
limit=3,
filter='secondary_genre == "thriller"',
output_fields=["document_text", "genre", "primary_genre", "secondary_genre"],
search_params={"metric_type": "COSINE"},
)
if results:
for result in results[0]:
print(f"- Similarity: {result['distance']:.4f}")
print(f" Text: {result['document_text'][:100]}...")
print(
f" Genre: {result.get('genre')} ({result.get('primary_genre')}-{result.get('secondary_genre')})"
)
# 2. Search for magic-related content + fantasy genre + conflict theme
print("\n2. Searching for magic-related content + fantasy genre + conflict theme:")
query_text = "magic wizard spell fantasy magical"
query_embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[query_text],
config=EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=EMBEDDING_DIM,
),
)
query_embedding = query_embedding_response.embeddings[0].values
results = client.search(
collection_name=COLLECTION_NAME,
data=[query_embedding],
anns_field="embedding",
limit=3,
filter='primary_genre == "fantasy" and theme_type == "conflict"',
output_fields=[
"document_text",
"genre",
"primary_genre",
"theme_type",
"theme_setting",
],
search_params={"metric_type": "COSINE"},
)
if results:
for result in results[0]:
print(f"- Similarity: {result['distance']:.4f}")
print(f" Text: {result['document_text'][:100]}...")
print(f" Genre: {result.get('genre')} ({result.get('primary_genre')})")
print(f" Theme: {result.get('theme_type')} ({result.get('theme_setting')})")
print("\n=== Demo Complete ===")
=== Semantic Search Examples ===
1. Searching for action-related content + only thriller genre:
- Similarity: 0.6947
Text: John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed t...
Genre: action-packed thriller (action-thriller)
- Similarity: 0.6128
Text: Two detectives investigate a series of mysterious murders in New York City. The crime thriller featu...
Genre: crime thriller (crime-thriller)
- Similarity: 0.5889
Text: A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller e...
Genre: sci-fi thriller (sci-fi-thriller)
2. Searching for magic-related content + fantasy genre + conflict theme:
- Similarity: 0.6986
Text: An evil sorcerer threatens to destroy the magical kingdom. A brave hero must gather allies and maste...
Genre: fantasy (fantasy)
Theme: conflict (fantasy_world)
=== Demo Complete ===