To determine if a RAG system’s answer is hallucinated or grounded, human judges can evaluate three key criteria: alignment with source documents, consistency with known facts, and specificity of claims. Judges should cross-reference the system’s output against the retrieved data and external knowledge to identify unsupported claims. Here’s how this might work in practice.
First, judges should verify whether the answer aligns with the information in the source documents provided to the RAG system. For example, if the system claims “Company X was founded in 1995,” but the retrieved documents state the founding year as 1990, this would be a clear hallucination. Judges can annotate each factual claim in the answer (e.g., dates, statistics, events) and check if it directly matches or logically extrapolates from the sources. Ambiguous phrasing like “studies suggest” without citing specific sources could also indicate hallucination if no supporting data exists in the retrieved content. Tools like highlighting text and side-by-side source comparisons can streamline this process.
Second, judges should assess internal consistency and plausibility. A grounded answer should avoid contradictions, both within the answer itself and with widely accepted knowledge. For instance, if a RAG system states, “The CEO founded two companies in the same year,” judges would flag this as suspicious unless the sources explicitly confirm it. Similarly, overly precise but unsupported claims (e.g., “Revenue grew by 27.3%”) might signal fabrication if the source documents only mention “significant growth.” Judges can use external databases (e.g., Wikipedia, official reports) to validate high-stakes claims, though this requires balancing efficiency with thoroughness.
Finally, judges should evaluate the logical structure of the answer. Grounded answers typically follow a coherent flow derived from the source material, while hallucinated content may include irrelevant tangents or leaps in reasoning. For example, if a RAG answer jumps from “Company Y invested in solar energy” to “Company Y will dominate the EV market by 2025” without intermediate evidence, this could indicate speculation. Judges might use rubrics to score answers based on how well each sentence connects to the retrieved context. Training judges to recognize common hallucination patterns (e.g., mixing facts from different sources incorrectly) can improve reliability in evaluations.
