• Integrations

Evaluation with Ragas

Open In Colab

This guide demonstrates how to use Ragas to evaluate a Retrieval-Augmented Generation (RAG) pipeline built upon Milvus.

The RAG system combines a retrieval system with a generative model to generate new text based on a given prompt. The system first retrieves relevant documents from a corpus using a vector similarity search engine like Milvus, and then uses a generative model to generate new text based on the retrieved documents.

Ragas is a framework that helps you evaluate your RAG pipelines. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in.


Before running this notebook, make sure you have the following dependencies installed:

$ pip install --upgrade pymilvus openai requests tqdm pandas ragas

If you are using Google Colab, to enable dependencies just installed, you may need to restart the runtime.

We will use OpenAI as the LLM in this example. You should prepare the api key OPENAI_API_KEY as an environment variable.

import os

os.environ["OPENAI_API_KEY"] = "sk-***********"

Define the RAG pipeline

We will define the RAG class that use Milvus as the vector store, and OpenAI as the LLM. The class contains the load method, which loads the text data into Milvus, the retrieve method, which retrieves the most similar text data to the given question, and the answer method, which answers the given question with the retrieved knowledge.

from typing import List
from tqdm import tqdm
from openai import OpenAI
from pymilvus import MilvusClient

class RAG:
    RAG (Retrieval-Augmented Generation) class built upon OpenAI and Milvus.

    def __init__(self, openai_client: OpenAI, milvus_client: MilvusClient):

    def _emb_text(self, text: str) -> List[float]:
        return (
            self.openai_client.embeddings.create(input=text, model=self.embedding_model)

    def _prepare_openai(
        openai_client: OpenAI,
        embedding_model: str = "text-embedding-3-small",
        llm_model: str = "gpt-3.5-turbo",
        self.openai_client = openai_client
        self.embedding_model = embedding_model
        self.llm_model = llm_model
        self.SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
        self.USER_PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.

    def _prepare_milvus(
        self, milvus_client: MilvusClient, collection_name: str = "rag_collection"
        self.milvus_client = milvus_client
        self.collection_name = collection_name
        if self.milvus_client.has_collection(self.collection_name):
        embedding_dim = len(self._emb_text("foo"))
            metric_type="IP",  # Inner product distance
            consistency_level="Strong",  # Strong consistency level

    def load(self, texts: List[str]):
        Load the text data into Milvus.
        data = []
        for i, line in enumerate(tqdm(texts, desc="Creating embeddings")):
            data.append({"id": i, "vector": self._emb_text(line), "text": line})

        self.milvus_client.insert(collection_name=self.collection_name, data=data)

    def retrieve(self, question: str, top_k: int = 3) -> List[str]:
        Retrieve the most similar text data to the given question.
        search_res =
            search_params={"metric_type": "IP", "params": {}},  # Inner product distance
            output_fields=["text"],  # Return the text field
        retrieved_texts = [res["entity"]["text"] for res in search_res[0]]
        return retrieved_texts[:top_k]

    def answer(
        question: str,
        retrieval_top_k: int = 3,
        return_retrieved_text: bool = False,
        Answer the given question with the retrieved knowledge.
        retrieved_texts = self.retrieve(question, top_k=retrieval_top_k)
        user_prompt = self.USER_PROMPT.format(
            context="\n".join(retrieved_texts), question=question
        response =
                {"role": "system", "content": self.SYSTEM_PROMPT},
                {"role": "user", "content": user_prompt},
        if not return_retrieved_text:
            return response.choices[0].message.content
            return response.choices[0].message.content, retrieved_texts

Let's initialize the RAG class with OpenAI and Milvus clients.

openai_client = OpenAI()
milvus_client = MilvusClient("./milvus_demo.db")

my_rag = RAG(openai_client=openai_client, milvus_client=milvus_client)

Run the RAG pipeline and get results

We use the Milvus development guide to be as the private knowledge in our RAG, which is a good data source for a simple RAG pipeline.

Download it and load it into the rag pipeline.

import os
import urllib.request

url = ""
file_path = "./"

if not os.path.exists(file_path):
    urllib.request.urlretrieve(url, file_path)
with open(file_path, "r") as file:
    file_text =

# We simply use "# " to separate the content in the file, which can roughly separate the content of each main part of the markdown file.
text_lines = file_text.split("# ")
my_rag.load(text_lines)  # Load the text data into RAG pipeline
Creating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:16<00:00,  2.80it/s]

Let's define a query question about the content of the development guide documentation. And then use the answer method to get the answer and the retrieved context texts.

question = "what is the hardware requirements specification if I want to build Milvus and run from source code?"
my_rag.answer(question, return_retrieved_text=True)
('The hardware requirements specification to build and run Milvus from source code is 8GB of RAM and 50GB of free disk space.',
 ['Hardware Requirements\n\nThe following specification (either physical or virtual machine resources) is recommended for Milvus to build and run from source code.\n\n```\n- 8GB of RAM\n- 50GB of free disk space\n```\n\n##',
  'Building Milvus on a local OS/shell environment\n\nThe details below outline the hardware and software requirements for building on Linux and MacOS.\n\n##',
  "Software Requirements\n\nAll Linux distributions are available for Milvus development. However a majority of our contributor worked with Ubuntu or CentOS systems, with a small portion of Mac (both x86_64 and Apple Silicon) contributors. If you would like Milvus to build and run on other distributions, you are more than welcome to file an issue and contribute!\n\nHere's a list of verified OS types where Milvus can successfully build and run:\n\n- Debian/Ubuntu\n- Amazon Linux\n- MacOS (x86_64)\n- MacOS (Apple Silicon)\n\n##"])

Now let's prepare some questions with its corresponding ground truth answers. We get answers and contexts from our RAG pipeline.

from datasets import Dataset
import pandas as pd

question_list = [
    "what is the hardware requirements specification if I want to build Milvus and run from source code?",
    "What is the programming language used to write Knowhere?",
    "What should be ensured before running code coverage?",
ground_truth_list = [
    "If you want to build Milvus and run from source code, the recommended hardware requirements specification is:\n\n- 8GB of RAM\n- 50GB of free disk space.",
    "The programming language used to write Knowhere is C++.",
    "Before running code coverage, you should make sure that your code changes are covered by unit tests.",
contexts_list = []
answer_list = []
for question in tqdm(question_list, desc="Answering questions"):
    answer, contexts = my_rag.answer(question, return_retrieved_text=True)

df = pd.DataFrame(
        "question": question_list,
        "contexts": contexts_list,
        "answer": answer_list,
        "ground_truth": ground_truth_list,
rag_results = Dataset.from_pandas(df)
Answering questions: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.29s/it]
question contexts answer ground_truth
0 what is the hardware requirements specificatio... [Hardware Requirements\n\nThe following specif... The hardware requirements specification for bu... If you want to build Milvus and run from sourc...
1 What is the programming language used to write... [CMake & Conan\n\nThe algorithm library of Mil... The programming language used to write the Kno... The programming language used to write Knowher...
2 What should be ensured before running code cov... [Code coverage\n\nBefore submitting your pull ... Before running code coverage, you should ensur... Before running code coverage, you should make ...

Evaluation with Ragas

We use Ragas to evaluate the performance of our RAG pipeline results.

Ragas provides a set of metrics that is easy to use. We take Answer relevancy, Faithfulness, Context recall, and Context precision as the metrics to evaluate our RAG pipeline. For more information about the metrics, please refer to the Ragas Metrics.

from ragas import evaluate
from ragas.metrics import (

result = evaluate(

Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

{'answer_relevancy': 0.9445, 'faithfulness': 1.0000, 'context_recall': 1.0000, 'context_precision': 1.0000}


Was this page helpful?