Creación de RAG con Milvus y Firecrawl
Firecrawl permite a los desarrolladores crear aplicaciones de IA con datos limpios extraídos de cualquier sitio web. Con capacidades avanzadas de scraping, crawling y extracción de datos, Firecrawl simplifica el proceso de convertir contenido de sitios web en markdown limpio o datos estructurados para flujos de trabajo de IA posteriores.
En este tutorial, le mostraremos cómo crear una canalización de generación mejorada de recuperación (RAG) utilizando Milvus y Firecrawl. La canalización integra Firecrawl para el raspado de datos web, Milvus para el almacenamiento vectorial y OpenAI para generar respuestas perspicaces y conscientes del contexto.
Dependencias y entorno
Para empezar, instala las dependencias necesarias ejecutando el siguiente comando:
$ pip install firecrawl-py pymilvus openai requests tqdm
Si utilizas Google Colab, para activar las dependencias que acabas de instalar, es posible que tengas que reiniciar el tiempo de ejecución (haz clic en el menú "Tiempo de ejecución" en la parte superior de la pantalla y selecciona "Reiniciar sesión" en el menú desplegable).
Configuración de las claves API
Para utilizar Firecrawl para extraer datos de la URL especificada, es necesario obtener una FIRECRAWL_API_KEY y establecerla como variable de entorno. Además, en este ejemplo utilizaremos OpenAI como LLM. Deberías preparar la OPENAI_API_KEY como una variable de entorno también.
import os
os.environ["FIRECRAWL_API_KEY"] = "fc-***********"
os.environ["OPENAI_API_KEY"] = "sk-***********"
Preparar el LLM y el modelo de incrustación
Inicializamos el cliente OpenAI para preparar el modelo de incrustación.
from openai import OpenAI
openai_client = OpenAI()
Definimos una función para generar incrustaciones de texto utilizando el cliente OpenAI. Usamos el modelo text-embedding-3-small como ejemplo.
def emb_text(text):
return (
openai_client.embeddings.create(input=text, model="text-embedding-3-small")
Generar un embedding de prueba e imprimir su dimensión y primeros elementos.
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]
Extraer datos usando Firecrawl
Inicializar la aplicación Firecrawl
Utilizaremos la biblioteca firecrawl
para extraer datos de la URL especificada en formato markdown. Comience inicializando la aplicación Firecrawl:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
Extraer el sitio web de destino
Extraiga el contenido de la URL de destino. El sitio web LLM-powered Autonomous Agents proporciona una exploración en profundidad de los sistemas de agentes autónomos construidos utilizando grandes modelos de lenguaje (LLMs). Utilizaremos estos contenidos para construir un sistema RAG.
# Scrape a website:
scrape_status = app.scrape_url(
params={"formats": ["markdown"]},
markdown_content = scrape_status["markdown"]
Procesar el contenido raspado
Para hacer que el contenido raspado sea manejable para su inserción en Milvus, simplemente utilizamos "# " para separar el contenido, lo que puede separar aproximadamente el contenido de cada parte principal del archivo markdown raspado.
def split_markdown_content(content):
return [section.strip() for section in content.split("# ") if section.strip()]
# Process the scraped markdown content
sections = split_markdown_content(markdown_content)
# Print the first few sections to understand the structure
for i, section in enumerate(sections[:3]):
print(f"Section {i+1}:")
print(section[:300] + "...")
print("-" * 50)
Section 1:
Table of Contents
- [Agent System Overview](#agent-system-overview)
- [Component One: Planning](#component-one-planning) - [Task Decomposition](#task-decomposition)
- [Self-Reflection](#self-reflection)
- [Component Two: Memory](#component-two-memory) - [Types of Memory](#types-of-memory)
- [...
Section 2:
Agent System Overview [\#](\#agent-system-overview)
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:
- **Planning**
- Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling effi...
Section 3:
Component One: Planning [\#](\#component-one-planning)
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Cargar datos en Milvus
Crear la colección
from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"
En cuanto al argumento de MilvusClient
Establecer el
como un archivo local, por ejemplo./milvus.db
, es el método más conveniente, ya que utiliza automáticamente Milvus Lite para almacenar todos los datos en este archivo.Si tiene una gran escala de datos, puede configurar un servidor Milvus más eficiente en docker o kubernetes. En esta configuración, por favor utilice la uri del servidor, por ejemplo
, como suuri
.Si desea utilizar Zilliz Cloud, el servicio en la nube totalmente gestionado para Milvus, ajuste
, que corresponden al punto final público y a la clave Api en Zilliz Cloud.
Compruebe si la colección ya existe y elimínela en caso afirmativo.
if milvus_client.has_collection(collection_name):
Crear una nueva colección con los parámetros especificados.
Si no especificamos ninguna información de campo, Milvus creará automáticamente un campo id
por defecto para la clave primaria, y un campo vector
para almacenar los datos vectoriales. Se utiliza un campo JSON reservado para almacenar campos no definidos por el esquema y sus valores.
metric_type="IP", # Inner product distance
consistency_level="Strong", # Strong consistency level
Insertar datos
from tqdm import tqdm
data = []
for i, section in enumerate(tqdm(sections, desc="Processing sections")):
embedding = emb_text(section)
data.append({"id": i, "vector": embedding, "text": section})
# Insert data into Milvus
milvus_client.insert(collection_name=collection_name, data=data)
Processing sections: 100%|██████████| 17/17 [00:08<00:00, 2.09it/s]
{'insert_count': 17, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], 'cost': 0}
Construir RAG
Recuperar datos para una consulta
Vamos a especificar una pregunta de consulta sobre el sitio web que acabamos de raspar.
question = "What are the main components of autonomous agents?"
Busquemos la pregunta en la colección y recuperemos las 3 primeras coincidencias semánticas.
search_res =
search_params={"metric_type": "IP", "params": {}},
Veamos los resultados de la consulta
import json
retrieved_lines_with_distances = [
(res["entity"]["text"], res["distance"]) for res in search_res[0]
print(json.dumps(retrieved_lines_with_distances, indent=4))
"Agent System Overview [\\#](\\#agent-system-overview)\n\nIn a LLM-powered autonomous agent system, LLM functions as the agent\u2019s brain, complemented by several key components:\n\n- **Planning**\n - Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\n - Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n- **Memory**\n - Short-term memory: I would consider all the in-context learning (See [Prompt Engineering]( as utilizing short-term memory of the model to learn.\n - Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\n- **Tool use**\n - The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.\n\nFig. 1. Overview of a LLM-powered autonomous agent system.",
"Table of Contents\n\n- [Agent System Overview](#agent-system-overview)\n- [Component One: Planning](#component-one-planning) - [Task Decomposition](#task-decomposition)\n - [Self-Reflection](#self-reflection)\n- [Component Two: Memory](#component-two-memory) - [Types of Memory](#types-of-memory)\n - [Maximum Inner Product Search (MIPS)](#maximum-inner-product-search-mips)\n- [Component Three: Tool Use](#component-three-tool-use)\n- [Case Studies](#case-studies) - [Scientific Discovery Agent](#scientific-discovery-agent)\n - [Generative Agents Simulation](#generative-agents-simulation)\n - [Proof-of-Concept Examples](#proof-of-concept-examples)\n- [Challenges](#challenges)\n- [Citation](#citation)\n- [References](#references)\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as [AutoGPT](, [GPT-Engineer]( and [BabyAGI](, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.",
"Challenges [\\#](\\#challenges)\n\nAfter going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations:\n\n- **Finite context length**: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention.\n\n- **Challenges in long-term planning and task decomposition**: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.\n\n- **Reliability of natural language interface**: Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasionally exhibit rebellious behavior (e.g. refuse to follow an instruction). Consequently, much of the agent demo code focuses on parsing model output.",
Utilizar LLM para obtener una respuesta RAG
Convertir los documentos recuperados a un formato de cadena.
context = "\n".join(
[line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
Definir avisos de sistema y de usuario para el modelo de lenguaje. Este prompt se ensambla con los documentos recuperados de Milvus.
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
Utilizar OpenAI ChatGPT para generar una respuesta basada en las instrucciones.
response =
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT},
The main components of a LLM-powered autonomous agent system are the Planning, Memory, and Tool use.
1. Planning: The agent breaks down large tasks into smaller, manageable subgoals, and can self-reflect and learn from past mistakes, refining its actions for future steps.
2. Memory: This includes short-term memory, which the model uses for in-context learning, and long-term memory, which allows the agent to retain and recall information over extended periods.
3. Tool use: This component allows the agent to call external APIs for additional information that is not available in the model weights, like current information, code execution capacity, and access to proprietary information sources.