🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • How do we evaluate a RAG system on domains where no standard dataset exists (for example, a company’s internal documents)? What steps are needed to create a meaningful test set in such cases?

How do we evaluate a RAG system on domains where no standard dataset exists (for example, a company’s internal documents)? What steps are needed to create a meaningful test set in such cases?

To evaluate a RAG system on domains without standard datasets, such as internal company documents, you must create a custom test set that reflects real-world use cases. Start by defining the scope and goals of the system. For example, if the RAG application answers questions about internal HR policies or engineering guidelines, identify common user queries and the expected outputs. Collaborate with domain experts to curate a list of representative questions and validate the correct answers. This ensures the test set aligns with actual user needs and domain-specific knowledge. Without this step, the evaluation risks being misaligned with practical scenarios, leading to unreliable metrics.

Next, build the test set by sampling documents and generating query-answer pairs. Extract key topics from the internal documents (e.g., project reports, compliance manuals) and manually craft questions that users might ask. For instance, a query like, “What’s the process for submitting a security incident report?” should map to specific sections in the documents. Include variations of questions (e.g., rephrased, ambiguous, or multi-hop queries) to test robustness. Annotate each query with the expected answer and the document passages that support it. To ensure quality, have domain experts review a subset of these pairs and refine them based on feedback. This process mimics real-world complexity and ensures the test set captures edge cases.

Finally, design evaluation metrics tailored to the domain. Use retrieval metrics like precision@k (how many relevant documents are in the top-k results) and answer quality metrics like accuracy, completeness, and relevance. For example, if the system retrieves three documents but only two are relevant, precision@3 would be 66%. For answer generation, manually assess if the output correctly addresses the query and cites the right sources. Automate checks where possible—e.g., using semantic similarity scores between generated and reference answers—but prioritize human evaluation for nuanced judgments. Iterate by testing the system on the custom set, identifying failures (e.g., missed documents or incorrect summaries), and refining the model or retrieval pipeline. This approach balances rigor with practicality, even in the absence of standardized benchmarks.

Like the article? Spread the word