Integración de LangExtract + Milvus
Esta guía demuestra cómo utilizar LangExtract con Milvus para construir un sistema inteligente de procesamiento y recuperación de documentos.
LangExtract es una biblioteca Python que utiliza modelos lingüísticos amplios (LLM) para extraer información estructurada de documentos de texto no estructurados con una base de origen precisa. El sistema combina las capacidades de extracción de LangExtract con el almacenamiento vectorial de Milvus para permitir tanto la búsqueda por similitud semántica como el filtrado preciso de metadatos.
Esta integración es especialmente valiosa para la gestión de contenidos, la búsqueda semántica, el descubrimiento de conocimientos y la creación de sistemas de recomendación basados en los atributos de los documentos extraídos.
Requisitos previos
Antes de ejecutar este cuaderno, asegúrate de tener instaladas las siguientes dependencias:
$ pip install --upgrade pymilvus milvus-lite langextract google-genai requests tqdm pandas
Si utilizas Google Colab, para activar las dependencias que acabas de instalar, es posible que tengas que reiniciar el tiempo de ejecución (haz clic en el menú "Tiempo de ejecución" en la parte superior de la pantalla y selecciona "Reiniciar sesión" en el menú desplegable).
En este ejemplo utilizaremos Gemini como LLM. Deberá preparar la clave api GEMINI_API_KEY como variable de entorno.
import os
os.environ["GEMINI_API_KEY"] = "AIza*****************"
Definir el proceso LangExtract + Milvus
Definiremos el pipeline que utiliza LangExtract para la extracción de información estructurada y Milvus como almacén de vectores.
import langextract as lx
import textwrap
from google import genai
from google.genai.types import EmbedContentConfig
from pymilvus import MilvusClient, DataType
import uuid
Configuración
Vamos a configurar nuestros parámetros globales para la integración. Utilizaremos el modelo de incrustación de Gemini para generar representaciones vectoriales para nuestros documentos.
genai_client = genai.Client()
COLLECTION_NAME = "document_extractions"
EMBEDDING_MODEL = "gemini-embedding-001"
EMBEDDING_DIM = 3072 # Default dimension for gemini-embedding-001
Inicializar el cliente Milvus
Ahora vamos a inicializar nuestro cliente Milvus. Utilizaremos un archivo de base de datos local por simplicidad, pero esto puede escalarse fácilmente a un despliegue de servidor Milvus completo.
client = MilvusClient(uri="./milvus_demo.db")
En cuanto al argumento de MilvusClient:
- Establecer el
uricomo un archivo local, por ejemplo./milvus.db, es el método más conveniente, ya que utiliza automáticamente Milvus Lite para almacenar todos los datos en este archivo. - Si tiene una gran escala de datos, puede configurar un servidor Milvus más eficiente en docker o kubernetes. En esta configuración, por favor utilice la uri del servidor, por ejemplo
http://localhost:19530, como suuri. - Si desea utilizar Zilliz Cloud, el servicio en la nube totalmente gestionado para Milvus, ajuste
uriytoken, que corresponden al punto final público y a la clave Api en Zilliz Cloud.
Ejemplo de preparación de datos
Para esta demostración, utilizaremos descripciones de películas como documentos de muestra. Esto demuestra la capacidad de LangExtract para extraer información estructurada como géneros, personajes y temas a partir de texto no estructurado.
sample_documents = [
"John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed thriller features intense gunfights and explosive scenes.",
"A young wizard named Harry Potter discovers his magical abilities at Hogwarts School. The fantasy adventure includes magical creatures and epic battles.",
"Tony Stark builds an advanced suit of armor to become Iron Man. The superhero movie showcases cutting-edge technology and spectacular action sequences.",
"A group of friends get lost in a haunted forest where supernatural creatures lurk. The horror film creates a terrifying atmosphere with jump scares.",
"Two detectives investigate a series of mysterious murders in New York City. The crime thriller features suspenseful plot twists and dramatic confrontations.",
"A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller explores the dangers of advanced technology and human survival.",
"A romantic comedy about two friends who fall in love during a cross-country road trip. The drama explores personal growth and relationship dynamics.",
"An evil sorcerer threatens to destroy the magical kingdom. A brave hero must gather allies and master ancient magic to save the fantasy world.",
"Space marines battle alien invaders on a distant planet. The action sci-fi movie features futuristic weapons and intense combat in space.",
"A detective investigates supernatural crimes in Victorian London. The horror thriller combines period drama with paranormal investigation themes.",
]
print("=== LangExtract + Milvus Integration Demo ===")
print(f"Preparing to process {len(sample_documents)} documents")
=== LangExtract + Milvus Integration Demo ===
Preparing to process 10 documents
Configuración de la colección Milvus
Antes de almacenar los datos extraídos, debemos crear una colección Milvus con el esquema adecuado. Esta colección almacenará el texto del documento original, las incrustaciones vectoriales y los campos de metadatos extraídos.
print("\n1. Setting up Milvus collection...")
# Drop existing collection if it exists
if client.has_collection(collection_name=COLLECTION_NAME):
client.drop_collection(collection_name=COLLECTION_NAME)
print(f"Dropped existing collection: {COLLECTION_NAME}")
# Create collection schema
schema = client.create_schema(
auto_id=False,
enable_dynamic_field=True,
description="Document extraction results and vector storage",
)
# Add fields - simplified to 3 main metadata fields
schema.add_field(
field_name="id", datatype=DataType.VARCHAR, max_length=100, is_primary=True
)
schema.add_field(
field_name="document_text", datatype=DataType.VARCHAR, max_length=10000
)
schema.add_field(
field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM
)
# Create collection
client.create_collection(collection_name=COLLECTION_NAME, schema=schema)
print(f"Collection '{COLLECTION_NAME}' created successfully")
# Create vector index
index_params = client.prepare_index_params()
index_params.add_index(
field_name="embedding",
index_type="AUTOINDEX",
metric_type="COSINE",
)
client.create_index(collection_name=COLLECTION_NAME, index_params=index_params)
print("Vector index created successfully")
1. Setting up Milvus collection...
Dropped existing collection: document_extractions
Collection 'document_extractions' created successfully
Vector index created successfully
Definición del esquema de extracción
LangExtract utiliza indicaciones y ejemplos para guiar al LLM en la extracción de información estructurada. Vamos a definir nuestro esquema de extracción para las descripciones de películas, especificando qué información extraer y cómo categorizarla.
print("\n2. Extracting tags from documents...")
# Define extraction prompt - for movie descriptions, specify attribute value ranges
prompt = textwrap.dedent(
"""\
Extract movie genre, main characters, and key themes from movie descriptions.
Use exact text for extractions. Do not paraphrase or overlap entities.
For each extraction, provide attributes with values from these predefined sets:
Genre attributes:
- primary_genre: ["action", "comedy", "drama", "horror", "sci-fi", "fantasy", "thriller", "crime", "superhero"]
- secondary_genre: ["action", "comedy", "drama", "horror", "sci-fi", "fantasy", "thriller", "crime", "superhero"]
Character attributes:
- role: ["protagonist", "antagonist", "supporting"]
- type: ["hero", "villain", "detective", "military", "wizard", "scientist", "friends", "investigator"]
Theme attributes:
- theme_type: ["conflict", "investigation", "personal_growth", "technology", "magic", "survival", "romance"]
- setting: ["urban", "space", "fantasy_world", "school", "forest", "victorian", "america", "future"]
Focus on identifying key elements that would be useful for movie search and filtering."""
)
2. Extracting tags from documents...
Ejemplos para mejorar la extracción
Para mejorar la calidad y la coherencia de las extracciones, proporcionaremos a LangExtract algunos ejemplos. Estos ejemplos muestran el formato esperado y ayudan al modelo a comprender nuestros requisitos de extracción.
# Provide examples to guide the model - n-shot examples for movie descriptions
# Unify attribute keys to ensure consistency in extraction results
examples = [
lx.data.ExampleData(
text="A space marine battles alien creatures on a distant planet. The sci-fi action movie features futuristic weapons and intense combat scenes.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="sci-fi action",
attributes={"primary_genre": "sci-fi", "secondary_genre": "action"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="space marine",
attributes={"role": "protagonist", "type": "military"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="battles alien creatures",
attributes={"theme_type": "conflict", "setting": "space"},
),
],
),
lx.data.ExampleData(
text="A detective investigates supernatural murders in Victorian London. The horror thriller film combines period drama with paranormal elements.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="horror thriller",
attributes={"primary_genre": "horror", "secondary_genre": "thriller"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="detective",
attributes={"role": "protagonist", "type": "detective"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="supernatural murders",
attributes={"theme_type": "investigation", "setting": "victorian"},
),
],
),
lx.data.ExampleData(
text="Two friends embark on a road trip adventure across America. The comedy drama explores friendship and self-discovery through humorous situations.",
extractions=[
lx.data.Extraction(
extraction_class="genre",
extraction_text="comedy drama",
attributes={"primary_genre": "comedy", "secondary_genre": "drama"},
),
lx.data.Extraction(
extraction_class="character",
extraction_text="two friends",
attributes={"role": "protagonist", "type": "friends"},
),
lx.data.Extraction(
extraction_class="theme",
extraction_text="friendship and self-discovery",
attributes={"theme_type": "personal_growth", "setting": "america"},
),
],
),
]
# Extract from each document
extraction_results = []
for doc in sample_documents:
result = lx.extract(
text_or_documents=doc,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash",
)
extraction_results.append(result)
print(f"Successfully extracted from document: {doc[:50]}...")
print(f"Completed tag extraction, processed {len(extraction_results)} documents")
Procesamiento y vectorización de los resultados
Ahora tenemos que procesar los resultados de la extracción y generar incrustaciones vectoriales para cada documento. También aplanaremos los atributos extraídos en campos separados para facilitar su búsqueda en Milvus.
print("\n3. Processing extraction results and generating vectors...")
processed_data = []
for result in extraction_results:
# Generate vectors for documents
embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[result.text],
config=EmbedContentConfig(
task_type="RETRIEVAL_DOCUMENT",
output_dimensionality=EMBEDDING_DIM,
),
)
embedding = embedding_response.embeddings[0].values
print(f"Successfully generated vector: {result.text[:30]}...")
# Initialize data structure, flatten attributes into separate fields
data_entry = {
"id": result.document_id or str(uuid.uuid4()),
"document_text": result.text,
"embedding": embedding,
# Initialize all possible fields with default values
"genre": "unknown",
"primary_genre": "unknown",
"secondary_genre": "unknown",
"character_role": "unknown",
"character_type": "unknown",
"theme_type": "unknown",
"theme_setting": "unknown",
}
# Process extraction results, flatten attributes
for extraction in result.extractions:
if extraction.extraction_class == "genre":
# Flatten genre attributes
data_entry["genre"] = extraction.extraction_text
attrs = extraction.attributes or {}
data_entry["primary_genre"] = attrs.get("primary_genre", "unknown")
data_entry["secondary_genre"] = attrs.get("secondary_genre", "unknown")
elif extraction.extraction_class == "character":
# Flatten character attributes (take first main character's attributes)
attrs = extraction.attributes or {}
if (
data_entry["character_role"] == "unknown"
): # Only take first character's attributes
data_entry["character_role"] = attrs.get("role", "unknown")
data_entry["character_type"] = attrs.get("type", "unknown")
elif extraction.extraction_class == "theme":
# Flatten theme attributes (take first main theme's attributes)
attrs = extraction.attributes or {}
if (
data_entry["theme_type"] == "unknown"
): # Only take first theme's attributes
data_entry["theme_type"] = attrs.get("theme_type", "unknown")
data_entry["theme_setting"] = attrs.get("setting", "unknown")
processed_data.append(data_entry)
print(f"Completed data processing, ready to insert {len(processed_data)} records")
3. Processing extraction results and generating vectors...
Successfully generated vector: John McClane fights terrorists...
Successfully generated vector: A young wizard named Harry Pot...
Successfully generated vector: Tony Stark builds an advanced ...
Successfully generated vector: A group of friends get lost in...
Successfully generated vector: Two detectives investigate a s...
Successfully generated vector: A brilliant scientist creates ...
Successfully generated vector: A romantic comedy about two fr...
Successfully generated vector: An evil sorcerer threatens to ...
Successfully generated vector: Space marines battle alien inv...
Successfully generated vector: A detective investigates super...
Completed data processing, ready to insert 10 records
Inserción de datos en Milvus
Con nuestros datos procesados listos, vamos a insertarlos en la colección Milvus. Esto nos permitirá realizar tanto búsquedas semánticas como un filtrado preciso de metadatos.
print("\n4. Inserting data into Milvus...")
if processed_data:
res = client.insert(collection_name=COLLECTION_NAME, data=processed_data)
print(f"Successfully inserted {len(processed_data)} documents into Milvus")
print(f"Insert result: {res}")
else:
print("No data to insert")
4. Inserting data into Milvus...
Successfully inserted 10 documents into Milvus
Insert result: {'insert_count': 10, 'ids': ['doc_f8797155', 'doc_78c7e586', 'doc_fa3a3ab5', 'doc_64981815', 'doc_3ab18cb2', 'doc_1ea42b18', 'doc_f0779243', 'doc_386590b7', 'doc_3b3ae1ab', 'doc_851089d6']}
Demostración del filtrado de metadatos
Una de las principales ventajas de combinar LangExtract con Milvus es la posibilidad de realizar un filtrado preciso basado en los metadatos extraídos. Vamos a demostrarlo con algunas búsquedas de expresiones de filtrado.
print("\n=== Filter Expression Search Examples ===")
# Load collection into memory for querying
print("Loading collection into memory...")
client.load_collection(collection_name=COLLECTION_NAME)
print("Collection loaded successfully")
# Search for thriller movies
print("\n1. Searching for thriller movies:")
results = client.query(
collection_name=COLLECTION_NAME,
filter='secondary_genre == "thriller"',
output_fields=["document_text", "genre", "primary_genre", "secondary_genre"],
limit=5,
)
for result in results:
print(f"- {result['document_text'][:100]}...")
print(
f" Genre: {result['genre']} ({result.get('primary_genre')}-{result.get('secondary_genre')})"
)
# Search for movies with military characters
print("\n2. Searching for movies with military characters:")
results = client.query(
collection_name=COLLECTION_NAME,
filter='character_type == "military"',
output_fields=["document_text", "genre", "character_role", "character_type"],
limit=5,
)
for result in results:
print(f"- {result['document_text'][:100]}...")
print(f" Genre: {result['genre']}")
print(
f" Character: {result.get('character_role')} ({result.get('character_type')})"
)
=== Filter Expression Search Examples ===
Loading collection into memory...
Collection loaded successfully
1. Searching for thriller movies:
- A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller e...
Genre: sci-fi thriller (sci-fi-thriller)
- Two detectives investigate a series of mysterious murders in New York City. The crime thriller featu...
Genre: crime thriller (crime-thriller)
- A detective investigates supernatural crimes in Victorian London. The horror thriller combines perio...
Genre: horror thriller (horror-thriller)
- John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed t...
Genre: action-packed thriller (action-thriller)
2. Searching for movies with military characters:
- Space marines battle alien invaders on a distant planet. The action sci-fi movie features futuristic...
Genre: action sci-fi
Character: protagonist (military)
Combinar la búsqueda semántica con el filtrado de metadatos
La verdadera potencia de esta integración proviene de la combinación de la búsqueda semántica vectorial con el filtrado preciso de metadatos. Esto nos permite encontrar contenido semánticamente similar mientras aplicamos restricciones específicas basadas en atributos extraídos.
print("\n=== Semantic Search Examples ===")
# 1. Search for action-related content + only thriller genre
print("\n1. Searching for action-related content + only thriller genre:")
query_text = "action fight combat battle explosion"
query_embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[query_text],
config=EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=EMBEDDING_DIM,
),
)
query_embedding = query_embedding_response.embeddings[0].values
results = client.search(
collection_name=COLLECTION_NAME,
data=[query_embedding],
anns_field="embedding",
limit=3,
filter='secondary_genre == "thriller"',
output_fields=["document_text", "genre", "primary_genre", "secondary_genre"],
search_params={"metric_type": "COSINE"},
)
if results:
for result in results[0]:
print(f"- Similarity: {result['distance']:.4f}")
print(f" Text: {result['document_text'][:100]}...")
print(
f" Genre: {result.get('genre')} ({result.get('primary_genre')}-{result.get('secondary_genre')})"
)
# 2. Search for magic-related content + fantasy genre + conflict theme
print("\n2. Searching for magic-related content + fantasy genre + conflict theme:")
query_text = "magic wizard spell fantasy magical"
query_embedding_response = genai_client.models.embed_content(
model=EMBEDDING_MODEL,
contents=[query_text],
config=EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=EMBEDDING_DIM,
),
)
query_embedding = query_embedding_response.embeddings[0].values
results = client.search(
collection_name=COLLECTION_NAME,
data=[query_embedding],
anns_field="embedding",
limit=3,
filter='primary_genre == "fantasy" and theme_type == "conflict"',
output_fields=[
"document_text",
"genre",
"primary_genre",
"theme_type",
"theme_setting",
],
search_params={"metric_type": "COSINE"},
)
if results:
for result in results[0]:
print(f"- Similarity: {result['distance']:.4f}")
print(f" Text: {result['document_text'][:100]}...")
print(f" Genre: {result.get('genre')} ({result.get('primary_genre')})")
print(f" Theme: {result.get('theme_type')} ({result.get('theme_setting')})")
print("\n=== Demo Complete ===")
=== Semantic Search Examples ===
1. Searching for action-related content + only thriller genre:
- Similarity: 0.6947
Text: John McClane fights terrorists in a Los Angeles skyscraper during Christmas Eve. The action-packed t...
Genre: action-packed thriller (action-thriller)
- Similarity: 0.6128
Text: Two detectives investigate a series of mysterious murders in New York City. The crime thriller featu...
Genre: crime thriller (crime-thriller)
- Similarity: 0.5889
Text: A brilliant scientist creates artificial intelligence that becomes self-aware. The sci-fi thriller e...
Genre: sci-fi thriller (sci-fi-thriller)
2. Searching for magic-related content + fantasy genre + conflict theme:
- Similarity: 0.6986
Text: An evil sorcerer threatens to destroy the magical kingdom. A brave hero must gather allies and maste...
Genre: fantasy (fantasy)
Theme: conflict (fantasy_world)
=== Demo Complete ===