🚀 Experimente o Zilliz Cloud, o Milvus totalmente gerenciado, gratuitamente—experimente um desempenho 10x mais rápido! Experimente Agora>>

milvus-logo
LFAI
Home
  • Integrações

Criar RAG com Milvus e Firecrawl

Open In Colab GitHub Repository

O Firecrawl permite que os programadores criem aplicações de IA com dados limpos extraídos de qualquer sítio Web. Com recursos avançados de raspagem, rastreamento e extração de dados, o Firecrawl simplifica o processo de conversão do conteúdo do site em markdown limpo ou dados estruturados para fluxos de trabalho de IA downstream.

Neste tutorial, mostraremos como criar um pipeline RAG (Retrieval-Augmented Generation) usando o Milvus e o Firecrawl. O pipeline integra o Firecrawl para raspagem de dados da Web, o Milvus para armazenamento de vetores e o OpenAI para gerar respostas perspicazes e sensíveis ao contexto.

Preparação

Dependências e ambiente

Para começar, instale as dependências necessárias executando o seguinte comando:

$ pip install firecrawl-py pymilvus openai requests tqdm

Se estiver a utilizar o Google Colab, para ativar as dependências acabadas de instalar, poderá ser necessário reiniciar o tempo de execução (clique no menu "Tempo de execução" na parte superior do ecrã e selecione "Reiniciar sessão" no menu pendente).

Configurando chaves de API

Para usar o Firecrawl para extrair dados do URL especificado, é necessário obter uma FIRECRAWL_API_KEY e defini-la como uma variável de ambiente. Além disso, usaremos o OpenAI como LLM neste exemplo. Deve também preparar a OPENAI_API_KEY como uma variável de ambiente.

import os

os.environ["FIRECRAWL_API_KEY"] = "fc-***********"
os.environ["OPENAI_API_KEY"] = "sk-***********"

Preparar o LLM e o modelo de incorporação

Inicializamos o cliente OpenAI para preparar o modelo de incorporação.

from openai import OpenAI

openai_client = OpenAI()

Defina uma função para gerar texto incorporado utilizando o cliente OpenAI. Utilizamos o modelo text-embedding-3-small como exemplo.

def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="text-embedding-3-small")
        .data[0]
        .embedding
    )

Gerar um embedding de teste e imprimir a sua dimensão e os primeiros elementos.

test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])
1536
[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]

Extrair dados usando o Firecrawl

Inicializar a aplicação Firecrawl

Usaremos a biblioteca firecrawl para extrair dados do URL especificado no formato markdown. Comece inicializando o aplicativo Firecrawl:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

Extrair o site de destino

Extraia o conteúdo do URL de destino. O site Agentes autónomos alimentados por LLM fornece uma exploração aprofundada de sistemas de agentes autónomos construídos utilizando modelos de linguagem grandes (LLMs). Vamos utilizar estes conteúdos para construir um sistema RAG.

# Scrape a website:
scrape_status = app.scrape_url(
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
    params={"formats": ["markdown"]},
)

markdown_content = scrape_status["markdown"]

Processar o conteúdo extraído

Para tornar o conteúdo extraído gerível para inserção no Milvus, usamos simplesmente "#" para separar o conteúdo, o que pode separar aproximadamente o conteúdo de cada parte principal do ficheiro markdown extraído.

def split_markdown_content(content):
    return [section.strip() for section in content.split("# ") if section.strip()]


# Process the scraped markdown content
sections = split_markdown_content(markdown_content)

# Print the first few sections to understand the structure
for i, section in enumerate(sections[:3]):
    print(f"Section {i+1}:")
    print(section[:300] + "...")
    print("-" * 50)
Section 1:
Table of Contents

- [Agent System Overview](#agent-system-overview)
- [Component One: Planning](#component-one-planning)  - [Task Decomposition](#task-decomposition)
  - [Self-Reflection](#self-reflection)
- [Component Two: Memory](#component-two-memory)  - [Types of Memory](#types-of-memory)
  - [...
--------------------------------------------------
Section 2:
Agent System Overview [\#](\#agent-system-overview)

In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

- **Planning**
  - Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling effi...
--------------------------------------------------
Section 3:
Component One: Planning [\#](\#component-one-planning)

A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.

#...
--------------------------------------------------

Carregar dados no Milvus

Criar a coleção

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"

Quanto ao argumento de MilvusClient:

  • Definir o uri como um ficheiro local, por exemplo./milvus.db, é o método mais conveniente, uma vez que utiliza automaticamente o Milvus Lite para armazenar todos os dados neste ficheiro.

  • Se tiver uma grande escala de dados, pode configurar um servidor Milvus mais eficiente em docker ou kubernetes. Nesta configuração, utilize o uri do servidor, por exemplo,http://localhost:19530, como o seu uri.

  • Se pretender utilizar o Zilliz Cloud, o serviço de nuvem totalmente gerido para o Milvus, ajuste os endereços uri e token, que correspondem ao Public Endpoint e à chave Api no Zilliz Cloud.

Verificar se a coleção já existe e eliminá-la se existir.

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

Criar uma nova coleção com os parâmetros especificados.

Se não especificarmos qualquer informação de campo, o Milvus criará automaticamente um campo id por defeito para a chave primária e um campo vector para armazenar os dados vectoriais. Um campo JSON reservado é utilizado para armazenar campos não definidos pelo esquema e os respectivos valores.

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)

Inserir dados

from tqdm import tqdm

data = []

for i, section in enumerate(tqdm(sections, desc="Processing sections")):
    embedding = emb_text(section)
    data.append({"id": i, "vector": embedding, "text": section})

# Insert data into Milvus
milvus_client.insert(collection_name=collection_name, data=data)
Processing sections: 100%|██████████| 17/17 [00:08<00:00,  2.09it/s]





{'insert_count': 17, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], 'cost': 0}

Criar RAG

Recuperar dados para uma consulta

Vamos especificar uma pergunta de consulta sobre o sítio Web que acabámos de extrair.

question = "What are the main components of autonomous agents?"

Pesquise a pergunta na coleção e recupere as três principais correspondências semânticas.

search_res = milvus_client.search(
    collection_name=collection_name,
    data=[emb_text(question)],
    limit=3,
    search_params={"metric_type": "IP", "params": {}},
    output_fields=["text"],
)

Vamos dar uma vista de olhos aos resultados da pesquisa da consulta

import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))
[
    [
        "Agent System Overview [\\#](\\#agent-system-overview)\n\nIn a LLM-powered autonomous agent system, LLM functions as the agent\u2019s brain, complemented by several key components:\n\n- **Planning**\n  - Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\n  - Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n- **Memory**\n  - Short-term memory: I would consider all the in-context learning (See [Prompt Engineering](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)) as utilizing short-term memory of the model to learn.\n  - Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\n- **Tool use**\n  - The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.\n\n![](agent-overview.png)Fig. 1. Overview of a LLM-powered autonomous agent system.",
        0.6343474388122559
    ],
    [
        "Table of Contents\n\n- [Agent System Overview](#agent-system-overview)\n- [Component One: Planning](#component-one-planning)  - [Task Decomposition](#task-decomposition)\n  - [Self-Reflection](#self-reflection)\n- [Component Two: Memory](#component-two-memory)  - [Types of Memory](#types-of-memory)\n  - [Maximum Inner Product Search (MIPS)](#maximum-inner-product-search-mips)\n- [Component Three: Tool Use](#component-three-tool-use)\n- [Case Studies](#case-studies)  - [Scientific Discovery Agent](#scientific-discovery-agent)\n  - [Generative Agents Simulation](#generative-agents-simulation)\n  - [Proof-of-Concept Examples](#proof-of-concept-examples)\n- [Challenges](#challenges)\n- [Citation](#citation)\n- [References](#references)\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT), [GPT-Engineer](https://github.com/AntonOsika/gpt-engineer) and [BabyAGI](https://github.com/yoheinakajima/babyagi), serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.",
        0.5715497732162476
    ],
    [
        "Challenges [\\#](\\#challenges)\n\nAfter going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations:\n\n- **Finite context length**: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention.\n\n- **Challenges in long-term planning and task decomposition**: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.\n\n- **Reliability of natural language interface**: Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasionally exhibit rebellious behavior (e.g. refuse to follow an instruction). Consequently, much of the agent demo code focuses on parsing model output.",
        0.5009307265281677
    ]
]

Utilizar o LLM para obter uma resposta RAG

Converter os documentos recuperados num formato de cadeia de caracteres.

context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

Definir avisos do sistema e do utilizador para o Modelo de Linguagem. Este prompt é montado com os documentos recuperados do Milvus.

SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

Utilizar o OpenAI ChatGPT para gerar uma resposta com base nos avisos.

response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
print(response.choices[0].message.content)
The main components of a LLM-powered autonomous agent system are the Planning, Memory, and Tool use. 

1. Planning: The agent breaks down large tasks into smaller, manageable subgoals, and can self-reflect and learn from past mistakes, refining its actions for future steps.

2. Memory: This includes short-term memory, which the model uses for in-context learning, and long-term memory, which allows the agent to retain and recall information over extended periods. 

3. Tool use: This component allows the agent to call external APIs for additional information that is not available in the model weights, like current information, code execution capacity, and access to proprietary information sources.

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started
Feedback

Esta página foi útil?