Construir um pipeline RAG: Carregando dados do S3 para o Milvus
Este tutorial orienta-o no processo de construção de um pipeline RAG (Retrieval-Augmented Generation) utilizando o Milvus e o Amazon S3. Você aprenderá como carregar documentos de forma eficiente a partir de um bucket do S3, dividi-los em pedaços gerenciáveis e armazenar seus embeddings vetoriais no Milvus para uma recuperação rápida e escalável. Para simplificar esse processo, usaremos o LangChain como uma ferramenta para carregar dados do S3 e facilitar seu armazenamento no Milvus.
Preparação
Dependências e ambiente
$ pip install --upgrade --quiet pymilvus milvus-lite openai requests tqdm boto3 langchain langchain-core langchain-community langchain-text-splitters langchain-milvus langchain-openai bs4
Se estiver a utilizar o Google Colab, para ativar as dependências que acabou de instalar, poderá ter de reiniciar o tempo de execução (clique no menu "Tempo de execução" na parte superior do ecrã e selecione "Reiniciar sessão" no menu pendente).
Neste exemplo, vamos utilizar o OpenAI como LLM. Deve preparar a chave api OPENAI_API_KEY como uma variável de ambiente.
import os
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
Configuração do S3
Para carregar documentos do S3, é necessário o seguinte:
- AWS Access Key e Secret Key: Armazene-as como variáveis de ambiente para acessar com segurança seu bucket do S3:
os.environ["AWS_ACCESS_KEY_ID"] = "your-aws-access-key-id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your-aws-secret-access-key"
- Bucket e documento S3: Especifique o nome do bucket e o nome do documento como argumentos para a classe
S3FileLoader.
from langchain_community.document_loaders import S3FileLoader
loader = S3FileLoader(
bucket="milvus-s3-example", # Replace with your S3 bucket name
key="WhatIsMilvus.docx", # Replace with your document file name
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
)
- Carregar documentos: Depois de configurado, você pode carregar o documento do S3 no seu pipeline:
documents = loader.load()
Esta etapa garante que seus documentos sejam carregados com êxito do S3 e estejam prontos para processamento no pipeline do RAG.
Dividir documentos em partes
Depois de carregar o documento, use o RecursiveCharacterTextSplitter do LangChain para dividir o conteúdo em partes gerenciáveis:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Initialize a RecursiveCharacterTextSplitter for splitting text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
# Split the documents into chunks using the text_splitter
docs = text_splitter.split_documents(documents)
# Let's take a look at the first document
docs[1]
Document(metadata={'source': 's3://milvus-s3-example/WhatIsMilvus.docx'}, page_content='Milvus offers three deployment modes, covering a wide range of data scales—from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors: \n\nMilvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it’s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more. \n\nWhat Makes Milvus so Fast\U0010fc00 \n\nMilvus was designed from day one to be a highly efficient vector database system. In most cases, Milvus outperforms other vector databases by 2-5x (see the VectorDBBench results). This high performance is the result of several key design decisions: \n\nHardware-aware Optimization: To accommodate Milvus in various hardware environments, we have optimized its performance specifically for many hardware architectures and platforms, including AVX512, SIMD, GPUs, and NVMe SSD. \n\nAdvanced Search Algorithms: Milvus supports a wide range of in-memory and on-disk indexing/search algorithms, including IVF, HNSW, DiskANN, and more, all of which have been deeply optimized. Compared to popular implementations like FAISS and HNSWLib, Milvus delivers 30%-70% better performance.')
Nesta fase, os seus documentos são carregados a partir do S3, divididos em partes mais pequenas e prontos para serem processados no pipeline RAG (Retrieval-Augmented Generation).
Criar cadeia RAG com o Milvus Vetor Store
Iremos inicializar um armazenamento de vectores Milvus com os documentos, que carregam os documentos no armazenamento de vectores Milvus e constroem um índice sob o capô.
from langchain_milvus import Milvus
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = Milvus.from_documents(
documents=docs,
embedding=embeddings,
connection_args={
"uri": "./milvus_demo.db",
},
drop_old=False, # Drop the old Milvus collection if it exists
)
Para o connection_args:
Definir o
uricomo um ficheiro local, por exemplo,./milvus.db, é o método mais conveniente, uma vez que utiliza automaticamente o Milvus Lite para armazenar todos os dados neste ficheiro.Se tiver uma grande escala de dados, pode configurar um servidor Milvus mais eficiente em docker ou kubernetes. Nesta configuração, utilize o uri do servidor, por exemplo,
http://localhost:19530, como o seuuri.Se pretender utilizar o Zilliz Cloud, o serviço de nuvem totalmente gerido para o Milvus, ajuste os endereços
urietoken, que correspondem ao Public Endpoint e à chave Api no Zilliz Cloud.
Pesquise os documentos no armazenamento de vectores do Milvus utilizando uma pergunta de teste. Vejamos o documento 1 de topo.
query = "How can Milvus be deployed"
vectorstore.similarity_search(query, k=1)
[Document(metadata={'pk': 455631712233193487, 'source': 's3://milvus-s3-example/WhatIsMilvus.docx'}, page_content='Milvus offers three deployment modes, covering a wide range of data scales—from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors: \n\nMilvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it’s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more. \n\nWhat Makes Milvus so Fast\U0010fc00 \n\nMilvus was designed from day one to be a highly efficient vector database system. In most cases, Milvus outperforms other vector databases by 2-5x (see the VectorDBBench results). This high performance is the result of several key design decisions: \n\nHardware-aware Optimization: To accommodate Milvus in various hardware environments, we have optimized its performance specifically for many hardware architectures and platforms, including AVX512, SIMD, GPUs, and NVMe SSD. \n\nAdvanced Search Algorithms: Milvus supports a wide range of in-memory and on-disk indexing/search algorithms, including IVF, HNSW, DiskANN, and more, all of which have been deeply optimized. Compared to popular implementations like FAISS and HNSWLib, Milvus delivers 30%-70% better performance.')]
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
# Initialize the OpenAI language model for response generation
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# Define the prompt template for generating AI responses
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>
<question>
{question}
</question>
The response should be specific and use statistics or numbers when possible.
Assistant:"""
# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()
# Define a function to format the retrieved documents
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
Utilize a LCEL (LangChain Expression Language) para criar uma cadeia RAG.
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
res = rag_chain.invoke(query)
res
'Milvus can be deployed in three different modes: Milvus Lite for local prototyping and edge devices, Milvus Standalone for single-machine server deployment, and Milvus Distributed for deployment on Kubernetes clusters. These deployment modes cover a wide range of data scales, from small-scale prototyping to massive clusters managing tens of billions of vectors.'