User expectations for multi-hop questions differ from simple queries in three key ways: depth, logical coherence, and source reliability. Unlike single-step questions that require direct fact retrieval, multi-hop questions demand connecting information from multiple sources or reasoning steps. For example, answering “What was the average temperature in Paris when the first SpaceX rocket landed?” requires finding both the rocket landing date (2015) and Paris weather data for that period. Users expect answers to explicitly show how these pieces connect, not just present the final result. They also anticipate explanations of potential ambiguities, like whether “first successful landing” refers to Falcon 9’s 2015 milestone or earlier attempts.
Evaluation metrics must prioritize different factors than those used for simple QA. Traditional metrics like BLEU or exact match fail here because they focus on surface-level text similarity rather than reasoning validity. A better approach combines three elements:
For developers, this means moving beyond single-score metrics. A multi-hop evaluation system might combine:
For example, an answer to “How did GDPR affect the adoption of AWS in German healthcare?” should explicitly link GDPR’s data residency rules to AWS’s Frankfurt region rollout, then connect that to healthcare migration patterns. An evaluation would penalize answers that correctly state both facts but fail to show causation, even if all facts are technically accurate. This granular approach better reflects user satisfaction with complex reasoning than traditional metrics.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word