To evaluate multi-step retrieval effectively, datasets must be explicitly designed around questions that require combining information from multiple documents. The key consideration is ensuring each question demands information from at least two distinct documents, with clear annotations indicating which documents are necessary to answer it. For example, a question like “How did Company X’s 2023 revenue compare to industry growth trends?” would require retrieving both Company X’s financial report and a separate industry analysis document. Without explicit markings of these source documents, it becomes impossible to measure whether the system correctly identifies and synthesizes the required information. Datasets must also avoid “self-contained” questions (answerable with a single document) to isolate the evaluation of multi-document reasoning.
The dataset must include controlled document relationships and intentional ambiguity. Documents should have overlapping themes but distinct details, forcing the system to identify complementary information. For instance, two product manuals might describe overlapping features but differ in technical specifications, requiring the system to merge details from both. Additionally, the dataset should incorporate “distractor” documents—irrelevant or partially related texts that share keywords with the question but do not contribute to the answer. This tests whether the system can filter noise. For example, a question about “side effects of Drug A in elderly patients” might have distractor documents discussing Drug A’s efficacy in adults or Drug B’s side effects. Explicit annotations of required documents help developers verify if the system avoids false positives.
Finally, evaluation metrics must account for multi-document dependencies. Traditional metrics like recall@k or precision@k are insufficient because they treat documents as independent units. Instead, metrics should measure whether all required documents are retrieved and in a logical sequence. For example, a question about “causes of Event Y” might require retrieving a policy document first to contextualize a later case study. The dataset should track retrieval order and document combinations, not just individual relevance. Tools like HotpotQA, which annotates supporting documents for multi-hop questions, provide a template for structuring such datasets. Developers can extend this by adding document relationship metadata (e.g., hyperlinks, citations) to simulate real-world scenarios where systems must navigate interconnected information.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word