🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • What is the concept of “open-book” QA and how does it relate to RAG? How would you evaluate an LLM in an open-book setting differently from a closed-book setting?

What is the concept of “open-book” QA and how does it relate to RAG? How would you evaluate an LLM in an open-book setting differently from a closed-book setting?

Open-Book QA and Its Relation to RAG Open-book question answering (QA) allows models to access external information sources—like documents or databases—to answer questions. Unlike closed-book QA, where models rely solely on pre-trained knowledge, open-book systems can retrieve and reference specific data during inference. For example, a medical QA system might search research papers to answer a question about treatment guidelines. This approach is useful when answers depend on up-to-date or domain-specific information not fully memorized during training.

Retrieval-Augmented Generation (RAG) is a specific implementation of open-book QA. RAG combines two steps: retrieving relevant documents and generating answers using that context. For instance, a RAG-based chatbot might first query a company’s internal documentation to find policies, then synthesize an answer from those results. The key connection is that RAG explicitly separates retrieval (accessing external data) and generation (producing the final answer), making it a structured way to handle open-book tasks. Both aim to enhance accuracy by grounding answers in external evidence, but RAG formalizes the process.

Evaluating LLMs in Open-Book vs. Closed-Book Settings In closed-book settings, evaluation focuses on the model’s ability to recall factual knowledge from its training data. Metrics like accuracy on trivia questions (e.g., “What year was the first moon landing?”) measure memorization. If the model answers incorrectly, it indicates gaps in its training data or retention.

In open-book settings, evaluation shifts to assessing how well the model retrieves and uses external information. For example, if a model answers a question about COVID-19 variants by citing outdated studies, the retrieval component failed. Metrics here include retrieval precision (e.g., how many retrieved documents are relevant) and answer faithfulness (whether the generated answer correctly reflects the sources). Additionally, open-book evaluation must test scenarios where no relevant data exists—does the model admit uncertainty or hallucinate?

Practical Differences in Evaluation Design Closed-book benchmarks often use static datasets (e.g., SQuAD for QA), where answers are known to exist in the training corpus. Open-book evaluations require dynamic datasets paired with external corpora. For example, a test might provide a question and a large document set, checking if the model can pinpoint the correct passage and generate a coherent answer.

Another difference is error analysis. In closed-book, errors might prompt retraining with more data. In open-book, errors could stem from poor retrieval (e.g., a faulty search algorithm) or flawed synthesis (e.g., misinterpreting a retrieved table). Developers might test retrieval separately—say, by measuring if the ground-truth source is in the top-5 retrieved documents—before assessing the generator’s ability to use it. This layered approach isolates weaknesses in the open-book pipeline.

Like the article? Spread the word