To evaluate retrieval performance in Retrieval-Augmented Generation (RAG) systems, several standard benchmarks and datasets are commonly used. These focus on testing how effectively a system retrieves relevant documents or passages from a large corpus, which is critical for accurate answer generation. Open-domain question answering (QA) tasks are a primary application, with datasets like Natural Questions (NQ), WebQuestions (WebQ), TriviaQA, MS MARCO, and HotpotQA being widely adopted. Each dataset varies in scale, question complexity, and retrieval requirements, enabling developers to test different aspects of retrieval quality.
Natural Questions (NQ) and WebQuestions (WebQ) are two foundational benchmarks. NQ contains real user queries from Google search logs, paired with human-annotated answers derived from Wikipedia. Retrieval systems are tested on their ability to find short or long answers from a 16-million-document Wikipedia corpus. WebQuestions is smaller, with questions sourced from Google Suggest API, and answers tied to Freebase entities. Both datasets measure retrieval accuracy through metrics like recall@k (whether the correct document is in the top k results) and exact match (EM) for answer correctness. TriviaQA adds complexity with trivia-style questions requiring multi-paragraph evidence from Wikipedia or web sources. These datasets stress the system’s ability to handle diverse query types and large-scale document searches.
Beyond QA-specific benchmarks, MS MARCO and BEIR are widely used. MS MARCO (Microsoft Machine Reading Comprehension) includes real Bing search queries and focuses on passage retrieval for answers. Its large-scale corpus (8.8 million passages) tests scalability. BEIR (Benchmarking IR) is a heterogeneous benchmark covering 18 datasets across tasks like fact-checking and citation prediction, making it useful for evaluating retrieval robustness. HotpotQA introduces multi-hop reasoning, where retrieving interconnected documents is necessary. For example, answering “Who founded the company that acquired DeepMind?” requires retrieving both DeepMind’s acquisition by Google and Google’s founding details. These benchmarks collectively test retrieval systems on precision, scalability, and reasoning, providing developers with clear metrics to optimize performance.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word