Open In Colab GitHub Repository

RAG 파이프라인 구축하기: S3에서 Milvus로 데이터 로드하기

이 튜토리얼에서는 Milvus와 Amazon S3를 사용하여 검색 증강 생성(RAG) 파이프라인을 구축하는 과정을 안내합니다. S3 버킷에서 문서를 효율적으로 로드하고, 관리 가능한 청크로 분할하고, 빠르고 확장 가능한 검색을 위해 벡터 임베딩을 Milvus에 저장하는 방법을 배우게 됩니다. 이 과정을 간소화하기 위해 LangChain을 도구로 사용하여 S3에서 데이터를 로드하고 Milvus에 쉽게 저장할 수 있도록 할 것입니다.

준비

종속성 및 환경

$ pip install --upgrade --quiet pymilvus milvus-lite openai requests tqdm boto3 langchain langchain-core langchain-community langchain-text-splitters langchain-milvus langchain-openai bs4

Google Colab을 사용하는 경우, 방금 설치한 종속성을 활성화하려면 런타임을 다시 시작해야 할 수 있습니다(화면 상단의 "런타임" 메뉴를 클릭하고 드롭다운 메뉴에서 "세션 다시 시작"을 선택합니다).

이 예제에서는 OpenAI를 LLM으로 사용하겠습니다. 환경 변수로 OPENAI_API_KEY API 키를 준비해야 합니다.

import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

S3 구성

S3에서 문서를 로드하려면 다음이 필요합니다:

  1. AWS 액세스 키와 비밀 키: 이를 환경 변수로 저장하여 S3 버킷에 안전하게 액세스하세요:
os.environ["AWS_ACCESS_KEY_ID"] = "your-aws-access-key-id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your-aws-secret-access-key"
  1. S3 버킷 및 문서: S3FileLoader 클래스의 인수로 버킷 이름과 문서 이름을 지정합니다.
from langchain_community.document_loaders import S3FileLoader

loader = S3FileLoader(
    bucket="milvus-s3-example",  # Replace with your S3 bucket name
    key="WhatIsMilvus.docx",  # Replace with your document file name
    aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
)
  1. 문서 로드: 구성이 완료되면 S3에서 파이프라인으로 문서를 로드할 수 있습니다:
documents = loader.load()

이 단계에서는 문서가 S3에서 성공적으로 로드되어 RAG 파이프라인에서 처리할 준비가 되었는지 확인합니다.

문서를 청크로 분할

문서를 로드한 후 LangChain의 RecursiveCharacterTextSplitter 을 사용하여 콘텐츠를 관리하기 쉬운 청크로 분할합니다:

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize a RecursiveCharacterTextSplitter for splitting text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

# Split the documents into chunks using the text_splitter
docs = text_splitter.split_documents(documents)

# Let's take a look at the first document
docs[1]
Document(metadata={'source': 's3://milvus-s3-example/WhatIsMilvus.docx'}, page_content='Milvus offers three deployment modes, covering a wide range of data scales—from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors: \n\nMilvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it’s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more. \n\nWhat Makes Milvus so Fast\U0010fc00 \n\nMilvus was designed from day one to be a highly efficient vector database system. In most cases, Milvus outperforms other vector databases by 2-5x (see the VectorDBBench results). This high performance is the result of several key design decisions: \n\nHardware-aware Optimization: To accommodate Milvus in various hardware environments, we have optimized its performance specifically for many hardware architectures and platforms, including AVX512, SIMD, GPUs, and NVMe SSD. \n\nAdvanced Search Algorithms: Milvus supports a wide range of in-memory and on-disk indexing/search algorithms, including IVF, HNSW, DiskANN, and more, all of which have been deeply optimized. Compared to popular implementations like FAISS and HNSWLib, Milvus delivers 30%-70% better performance.')

이 단계에서 문서는 S3에서 로드되고, 더 작은 청크로 분할되며, 검색 증강 생성(RAG) 파이프라인에서 추가 처리할 준비가 됩니다.

Milvus 벡터 스토어로 RAG 체인 구축하기

문서로 Milvus 벡터 스토어를 초기화하여 Milvus 벡터 스토어에 문서를 로드하고 내부적으로 인덱스를 구축합니다.

from langchain_milvus import Milvus
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=embeddings,
    connection_args={
        "uri": "./milvus_demo.db",
    },
    drop_old=False,  # Drop the old Milvus collection if it exists
)

connection_args:

  • uri 을 로컬 파일(예:./milvus.db)로 설정하는 것이 가장 편리한 방법인데, 이 파일에 모든 데이터를 저장하기 위해 Milvus Lite를 자동으로 활용하기 때문입니다.

  • 데이터 규모가 큰 경우, 도커나 쿠버네티스에 더 성능이 좋은 Milvus 서버를 설정할 수 있습니다. 이 설정에서는 서버 URL(예:http://localhost:19530)을 uri 으로 사용하세요.

  • 밀버스의 완전 관리형 클라우드 서비스인 질리즈 클라우드를 사용하려면, 질리즈 클라우드의 퍼블릭 엔드포인트와 API 키에 해당하는 uritoken 을 조정해 주세요.

테스트 쿼리 문제를 사용하여 Milvus 벡터 스토어에서 문서를 검색합니다. 상위 1개 문서를 살펴보겠습니다.

query = "How can Milvus be deployed"
vectorstore.similarity_search(query, k=1)
[Document(metadata={'pk': 455631712233193487, 'source': 's3://milvus-s3-example/WhatIsMilvus.docx'}, page_content='Milvus offers three deployment modes, covering a wide range of data scales—from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors: \n\nMilvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it’s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more. \n\nWhat Makes Milvus so Fast\U0010fc00 \n\nMilvus was designed from day one to be a highly efficient vector database system. In most cases, Milvus outperforms other vector databases by 2-5x (see the VectorDBBench results). This high performance is the result of several key design decisions: \n\nHardware-aware Optimization: To accommodate Milvus in various hardware environments, we have optimized its performance specifically for many hardware architectures and platforms, including AVX512, SIMD, GPUs, and NVMe SSD. \n\nAdvanced Search Algorithms: Milvus supports a wide range of in-memory and on-disk indexing/search algorithms, including IVF, HNSW, DiskANN, and more, all of which have been deeply optimized. Compared to popular implementations like FAISS and HNSWLib, Milvus delivers 30%-70% better performance.')]
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

# Initialize the OpenAI language model for response generation
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Define the prompt template for generating AI responses
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""

# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()


# Define a function to format the retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

LCEL(LangChain 표현 언어)을 사용하여 RAG 체인을 구축합니다.

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


res = rag_chain.invoke(query)
res
'Milvus can be deployed in three different modes: Milvus Lite for local prototyping and edge devices, Milvus Standalone for single-machine server deployment, and Milvus Distributed for deployment on Kubernetes clusters. These deployment modes cover a wide range of data scales, from small-scale prototyping to massive clusters managing tens of billions of vectors.'

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started
피드백

이 페이지가 도움이 되었나요?