The underlying large language model (LLM) plays a critical role in determining its tendency to produce hallucinations—outputs that are factually incorrect or unsupported by the input data. This is influenced by factors like the model’s training data, architecture, and decoding strategies. For example, models trained on vast, uncurated datasets may inadvertently learn biases or inaccuracies, leading them to generate plausible-sounding but incorrect information. Architectures with weaker context retention (e.g., smaller models or those with limited attention mechanisms) may struggle to stay grounded in the input, especially when processing long or complex queries. Decoding methods like high-temperature sampling or greedy search can also amplify hallucinations by prioritizing fluency over accuracy. For instance, a model might invent a fictional study to answer a medical question if its training data includes unreliable sources or if it lacks mechanisms to verify claims against the input.
To evaluate different LLMs on the same retrieval data for grounding performance, developers can use a combination of automated metrics and human assessment. First, define a benchmark dataset with queries paired with verified, context-specific retrieval data (e.g., a set of questions and corresponding passages from a knowledge base). Run each LLM to generate responses based on this data, then measure metrics like factual consistency (e.g., using tools like BERTScore or QuestEval to compare generated text against the source) and hallucination rate (counting unsupported claims). For example, if a model claims “Study X found Y” but the retrieval data only mentions “Study X observed Z,” this counts as a hallucination. Additionally, human evaluators can rate outputs for coherence, relevance, and adherence to the source material. Stress-testing with ambiguous or incomplete retrieval data can reveal how models handle uncertainty—e.g., whether they over-infer or default to generic statements.
Improving grounding performance often involves both model-specific adjustments and evaluation refinements. For instance, models fine-tuned with reinforcement learning from human feedback (RLHF) or trained on datasets emphasizing citation (e.g., “always reference the provided source”) may show fewer hallucinations. Developers can also compare architectures: a model like GPT-4 might outperform smaller variants in complex grounding tasks due to its ability to parse longer contexts, while retrieval-augmented models like RETRO might excel by design. To ensure fair evaluation, use controlled inputs (e.g., identical prompts and retrieval contexts across tests) and track edge cases, such as how models handle conflicting information in the retrieval data. For example, if two sources contradict each other, does the model acknowledge the conflict or pick a side arbitrarily? By systematically analyzing these behaviors, developers can identify which LLMs strike the best balance between creativity and factual reliability for their specific use case.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word