AI Quick Reference

Looking for fast answers or a quick refresher on AI-related topics? The AI Quick Reference has everything you need—straightforward explanations, practical solutions, and insights on the latest trends like LLMs, vector databases, RAG, and more to supercharge your AI projects!

What is multi-step retrieval (or multi-hop retrieval) in the context of RAG, and can you give an example of a question that would require this approach?
What biases might be present in a RAG evaluation dataset if the knowledge source is fixed (e.g., Wikipedia) and how do we account for those in judging performance?
Why might a RAG-generated answer score well on BLEU/ROUGE against a reference answer but still be considered a poor response in practice?
Why might an embedding model fine-tuned on domain-specific data outperform a general-purpose embedding model in a specialized RAG application (for example, legal documents or medical texts)?
What is a hallucination in the context of RAG, and how does it differ from a simple error or omission in the answer?
Why might a high-performing retriever still result in a hallucinated answer from the LLM? (Think about the LLM’s behavior and the possibility of it ignoring or misinterpreting context.)
How would you compare a system that uses a smaller but highly relevant private knowledge base to one that searches a broad corpus like the entire web? (Consider answer accuracy, trustworthiness, and response time.)
How would you evaluate the benefit of adding a second stage retriever (like first use a broad recall retrieval, then a precise re-ranker) against just using a single-stage retriever with tuned parameters?
How might adding metadata filters to retrieval queries (e.g., only retrieve from certain document types or date ranges) affect the performance of the vector store, and how to evaluate that overhead?
How can an LLM be used to perform multi-step retrieval? (For instance, the LLM uses the initial query to retrieve something, then formulates a new query based on what was found, etc.)
What are the pros and cons of an architecture where the LLM generates an answer and then a separate verification step checks and potentially corrects it using retrieval again?
How do approximate nearest neighbor settings (like search accuracy vs speed configurations) influence the end-to-end RAG latency and possibly the answer quality?
What is BERTScore or other embedding-based metrics, and can they be helpful in evaluating the similarity between a generated answer and a reference answer or source text?
How do batch processing or asynchronous calls improve the throughput of a RAG system, and what is the effect on single-query latency?
How can caching mechanisms be used in RAG to reduce latency, and what types of data might we cache (embeddings, retrieved results for frequent queries, etc.)?
How do metrics like contextual precision and contextual recall (such as those in certain RAG evaluation frameworks) work, and what do they indicate about a system’s performance?
How do cross-encoder re-rankers complement a bi-encoder embedding model in retrieval, and what does this imply about the initial embedding model’s limitations?
What are “custom retrieval-based metrics” one could use for RAG evaluation? (Think of metrics that check if the answer contains information from the retrieved text or if all answer sentences can be found in sources.)
How do you decide how many retrieval rounds to allow (depth of multi-step) before just answering with what’s gathered? What is the diminishing return, and how to measure it?
How do different retrieval strategies affect the interpretability or explainability of a RAG system’s answers (for example, an answer with cited sources vs. an answer from an opaque model memory), and how might you evaluate user trust in each approach?
What role does embedding dimensionality play in balancing semantic expressiveness and computational efficiency, and how to determine the “right” dimension for a RAG system?
How would you evaluate a RAG system that employs multi-step retrieval differently than one that uses a single retrieval step? (Consider tracking intermediate retrieval accuracy and final answer correctness.)
What kinds of evaluation metrics or criteria could capture the success of a multi-hop QA (for example, does the answer correctly integrate information from two separate documents)?
How can few-shot examples be utilized in a RAG prompt to demonstrate how the model should use retrieved information (for instance, providing an example question, the context, and the answer as a guide)?
How can fine-tuning an LLM on retrieved data (like feeding it lots of examples of using documents to answer questions) potentially improve performance, and how would you validate the improvement?
How does fine-tuning a model on a domain (so it “knows” a lot of the answers) compare to using an external retrieval system for that domain? What evaluation would highlight the differences (like evaluating on questions outside the fine-tuned knowledge)?
Why is it useful to have a variety of question types (factoid, explanatory, boolean, etc.) in a RAG evaluation set, and how might each stress the system differently?
What are the pros and cons of using high-dimensional embeddings versus lower-dimensional embeddings in terms of retrieval accuracy and system performance?
If the retrieval step is found to be slow, what optimizations might you consider? (Think indexing technique changes, hardware acceleration, or reducing vector size—how to decide which to try based on measurements.)
In a RAG system, should the original question be repeated or rephrased in the prompt along with the retrieved text, and what effect might that have on the answer?
In a RAG system, when might you choose to use an advanced re-ranking model on retrieved passages before feeding to the LLM, and what does that trade off in terms of latency or complexity?
What is an acceptable latency for a RAG system in an interactive setting (e.g., a chatbot), and how do we ensure both retrieval and generation phases meet this target?
How does introducing a retrieval step in a QA system affect end-to-end latency compared to a standalone LLM answer generation, and how can we measure this impact?
What are the challenges of keeping a generation grounded when using multi-step retrieval, and how might errors compound over multiple steps?
How could you use the LLM itself to improve retrieval — for example, by generating a better search query or re-ranking the retrieved results? How would you measure the impact of such techniques?
How does listing multiple retrieved documents in the prompt (perhaps with titles or sources) help or hinder the LLM in generating an answer?
How does multi-step retrieval impact latency and how can a system decide whether the improved answer quality is worth the extra time spent retrieving multiple rounds?
What role do negative examples (questions paired with irrelevant documents) play in evaluating the robustness of a RAG system?
How does network latency play a role when the vector store or the LLM is a remote service (for instance, calling a cloud API), and how can we mitigate this in evaluation or production?
Why is it important to prepare a dedicated evaluation dataset for RAG, and what should the key components of such a dataset be?
How might query caching or prefetching frequently asked questions improve the apparent efficiency of the vector store in a RAG system, and what are the pros and cons of evaluating a system with such caching enabled?
What are the limitations of using ROUGE or METEOR for RAG evaluation, especially considering there may be multiple correct ways to answer a question with the retrieved info?
In what cases might retrieval actually save time overall in getting an answer (think of when the alternative is the LLM thinking through facts it doesn’t know versus quickly looking them up)?
How does retrieval-augmented generation help with the issue of an LLM’s static knowledge cutoff or memory limitations?
What are the advantages and disadvantages of retrieving a large number of documents (say top-10 or top-20) versus only a few top relevant ones (top-3) as context for the LLM?
What are some known metrics or scores (such as “faithfulness” scores from tools like RAGAS) that aim to quantify how well an answer sticks to the provided documents?
What is the benefit of splitting an evaluation into retrieval evaluation and generation evaluation components using the same dataset (i.e., first evaluate how many answers can be found in the docs, then how well the model uses them)?
How can synthetic data generation help in building a RAG evaluation dataset, and what are the risks of using synthetic queries or documents?
What is the RAG “triad” of metrics sometimes discussed (e.g., answer relevance, support relevance, and correctness), and how do these provide a comprehensive picture of system performance?
What is the ReAct (Reason+Act) framework in relation to multi-step retrieval, and how would you determine if an agent-like RAG system is doing the right reasoning steps?
How does the distance metric used (cosine vs L2) interplay with the embedding model choice, and could a mismatch lead to suboptimal retrieval results?
How does embedding model choice affect the size and speed of the vector database component, and what trade-offs might this introduce for real-time RAG systems?
How does the choice of embedding model (e.g., SBERT vs. GPT-3 embeddings vs. custom-trained models) influence the effectiveness of retrieval in a RAG system?
Why is the efficiency of the vector store important in a RAG system, and how does it affect the overall user experience (consider both latency and throughput)?
What are the individual components of latency in a RAG pipeline (e.g., time to embed the query, search the vector store, and generate the answer), and how can each be optimized?
How does the length of retrieved context fed into the prompt affect the LLM’s performance and the risk of it ignoring some parts of the context?
What is the impact of embedding dimension and index type on the performance of the vector store, and how might that influence design choices for a RAG system requiring quick retrievals?
What is the role of the prompt or instructions given to the LLM in ensuring the answer is well-formed and coherent, and how would you evaluate different prompt styles on answer quality?
What is the impact of embedding quality on downstream generation — for example, can a poorer embedding that misses nuances cause the LLM to hallucinate or get answers wrong?
How does model size or type (e.g., GPT-3 vs smaller open-source models) affect how you design the RAG pipeline, and what metrics would show these differences (like one might need more context documents than another)?
How does the specificity of the prompt (e.g., “Using only the information below, answer…” vs. a generic instruction) influence the generation, and how might we measure which prompt yields more grounded answers?
What is the trade-off between answer completeness and hallucination risk, and how can a system find the right balance (for example, being more conservative in answering if unsure)?
What is the trade-off between complexity and benefit in multi-step retrieval — in what cases might a simpler one-shot retrieval actually perform just as well?
What are the trade-offs between doing retrieval on the fly for each query (real-time) versus precomputing possible question-answer pairs or passages (offline) in terms of system design and evaluation?
What role does the underlying LLM play in hallucination tendencies, and how might one evaluate different LLMs on the same retrieval data for their grounding performance?
What strategies can improve the coherence of a RAG answer if the retrieved passages are from different sources or have different writing styles (the “frankenstein” answer problem)?
How can we ask the model to provide sources or cite the documents it used in its answer, and what are the challenges in evaluating such citations for correctness?
How can we assess the coherence and fluency of answers generated by a RAG system, aside from just checking factual correctness?
How might one assess whether an embedding model is capturing the nuances needed for a particular task (e.g., does it cluster questions with their correct answers in vector space)?
How might one architect a RAG system to handle high-concurrency scenarios without significant latency degradation (e.g., scaling the vector database, using multiple LLM instances)?
How should a dataset for evaluating hallucination be structured? (For example, include questions where the answer is not in the knowledge base to see if the system correctly abstains or indicates uncertainty.)
What monitoring would you put in place to catch when either the retrieval step or the generation step is becoming a bottleneck in latency during production usage?
How can the number of retrieved documents (top-k) be chosen to balance vector store load and generation effectiveness, and what experiments would you run to find the sweet spot?
Which natural language generation metrics (e.g., BLEU, ROUGE, METEOR) can be used to compare a RAG system’s answers to reference answers, and what are the limitations of these metrics in this context?
How can one use an evaluation metric to compare two RAG systems that might have different strengths (e.g., one retrieves better but the other has a stronger generator)? What composite or multi-dimensional evaluation would you do?
How would you go about creating a test set for RAG that includes questions, relevant context documents, and ground-truth answers? (Consider using existing QA datasets and adding context references.)
How could you design a metric to penalize ungrounded content in an answer? (For example, a precision-like metric that counts the proportion of answer content supported by docs.)
What techniques can be used to detect hallucinations in a RAG-generated answer (for example, checking if all factual claims have support in the retrieved text)?
In an evaluation setting, how could human judges determine if a RAG system’s answer is hallucinated or grounded? What criteria might they use?
How do we ensure that the LLM’s answer fully addresses the user’s query in a RAG setup? (For example, if multiple points are asked, does the answer cover them all?)
How do we ensure that the test dataset truly requires retrieval augmentation (i.e., the answers are not already memorized by the model or trivial without external info)?
What metrics would you track to ensure the vector store is performing well under load (for instance, QPS it’s handling, average search time, recall at given latency)?
How do we ensure that the introduction of retrieval does not introduce new biases or issues in the LLM’s responses? Can evaluation reveal cases where the model over-trusts or misuses retrieved information?
How do we evaluate a RAG system on domains where no standard dataset exists (for example, a company’s internal documents)? What steps are needed to create a meaningful test set in such cases?
How would you evaluate the performance of a RAG system over time or after updates? (Consider setting up a continuous evaluation pipeline with key metrics to catch regressions in either retrieval or generation.)
How can we evaluate different embedding models to decide which yields the best retrieval performance for our specific RAG use case?
How can we evaluate factual correctness of an answer when a reference answer is available? (Consider exact match or F1 as used in QA benchmarks like SQuAD.)
If we were to evaluate multi-step retrieval, what special dataset considerations would we need (perhaps questions that require joining information from two documents, with those documents marked)?
How can we evaluate whether an answer from the LLM is fully supported by the retrieval context? (Consider methods like answer verification against sources or using a secondary model to cross-check facts.)
How can an LLM be guided to ask a follow-up question when the retrieved information is insufficient? (Think in terms of conversational RAG or an agent that can perform multiple retrieve-then-read cycles.)
What modifications to the prompt or system might be needed to handle multi-step retrieval (such as giving the model the ability to output a follow-up question), and how to evaluate this new ability?
How might we incorporate multiple modalities in RAG (say retrieving an image or a table) and still use an LLM for generation? What additional evaluation considerations does this bring?
How can we incorporate user feedback or real user queries into building a dataset for RAG evaluation, and what are the challenges with using real-world queries?
When integrating retrieval into multi-turn conversations, how can the prompt incorporate new context while maintaining the conversation history relevantly?
How can one leverage existing QA datasets like TriviaQA or Natural Questions for RAG evaluation, and what modifications are needed to adapt them to a retrieval setting?
How can we explicitly measure “supporting evidence coverage,” i.e., whether all parts of the answer can be traced back to some retrieved document?
How can the success of intermediate retrieval steps be measured? (For example, if the first retrieval should find a clue that helps the second retrieval, how do we verify the clue was found?)
What are some methods to obtain ground truth for which document or passage contains the answer to a question (e.g., using annotated datasets like SQuAD which point to evidence)?
What modifications might be needed to the LLM’s input formatting or architecture to best take advantage of retrieved documents (for example, adding special tokens or segments to separate context)?