🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • How should a dataset for evaluating hallucination be structured? (For example, include questions where the answer is not in the knowledge base to see if the system correctly abstains or indicates uncertainty.)

How should a dataset for evaluating hallucination be structured? (For example, include questions where the answer is not in the knowledge base to see if the system correctly abstains or indicates uncertainty.)

To build a dataset for evaluating hallucination in language models, the structure should focus on testing the system’s ability to identify when it lacks sufficient information to answer a question. The dataset must include three core components: questions with clear answers (to verify accuracy), questions with no answers in the knowledge base (to test abstention), and ambiguous or underspecified questions (to check how uncertainty is communicated). Each question should be paired with ground-truth metadata indicating whether an answer exists, the correct response (if applicable), and the context or knowledge base used. This setup ensures the model’s behavior can be measured objectively.

The dataset should balance answerable and unanswerable questions to avoid bias. For example, include straightforward factual queries like “What is the capital of France?” (answerable) alongside questions like “What is the population of Mars in 2050?” (unanswerable, as no reliable data exists). Ambiguous cases, such as “Who won the 2022 Nobel Prize in Physics?” (answerable if the knowledge cutoff is post-2022), can test how the model handles temporal constraints. To ensure realism, unanswerable questions should mirror real-world scenarios—e.g., “What are the side effects of [obscure drug not in the knowledge base]?”—and avoid synthetic or overly contrived examples. Metadata should explicitly flag whether answers are present, partially present, or absent, and specify the knowledge boundaries (e.g., “Data valid up to 2023”).

Validation and metrics are critical. For answerable questions, measure precision and recall to ensure the model answers correctly. For unanswerable ones, track the abstention rate (how often the model says “I don’t know”) and false positives (incorrect answers). Include edge cases like conflicting information (e.g., “Is coffee good for you?” with mixed sources) to test how the model handles uncertainty. Human reviewers should verify the dataset’s accuracy by confirming that unanswerable questions truly lack supporting data. Iterate by testing the model on the dataset, refining question clarity, and adjusting the balance of question types based on performance gaps. This structured approach ensures the evaluation reflects real-world reliability and minimizes overconfidence in ungrounded responses.

Like the article? Spread the word