To optimize LLMs for leveraging retrieved documents, three key modifications are needed: input formatting adjustments, architectural changes, and training strategy updates. These modifications help the model distinguish between different information sources, process longer contexts, and learn to prioritize relevant content.
Input Formatting
The first step is to structure the input to clearly separate the retrieved documents from the original query. Special tokens like [DOC]
, [CONTEXT]
, or [SEP]
can demarcate document boundaries. For example, a query might be formatted as [QUERY] What causes climate change? [CONTEXT] [DOC1] ...text... [DOC2] ...text...
. This helps the model recognize where external knowledge begins and ends. Additionally, positional embeddings or attention masks can be adjusted to prioritize document segments. For lengthy documents, chunking or sliding window approaches can avoid truncation losses. Tools like Longformer’s sparse attention patterns or hierarchical summarization (e.g., compressing each document into a vector before full processing) can also manage context length. For instance, a system might first encode each document separately, then combine them with the query using cross-attention layers.
Architectural Adjustments Modifying the model’s architecture to handle retrieved documents often involves enhancing its ability to process multiple context sources. Cross-attention mechanisms—like those in Fusion-in-Decoder (FiD)—allow the model to process documents and queries in parallel. For example, documents can be encoded independently, and their representations are fused during decoding. Sparse or blockwise attention (used in models like Sparse Transformer) can reduce computational overhead when processing long documents. Another approach is adding adapter layers that specialize in integrating external context. For instance, a lightweight adapter module could be inserted between transformer layers to refine document representations before they interact with the query. Multi-head attention layers might also be reconfigured to weigh document tokens differently—e.g., using a bias term to upweight document-related attention scores.
Training Strategies Fine-tuning the model on tasks that require document use is critical. This includes training with multitask objectives—such as jointly predicting answers and document relevance scores—or using contrastive learning to distinguish useful documents from distractors. For example, during training, the model could receive a mix of relevant and irrelevant documents, learning to ignore noise. Datasets like MS MARCO or Natural Questions, which pair queries with retrieved passages, are ideal for this. Additionally, positional encoding schemes can be retrained to emphasize document-order importance (e.g., giving higher weight to the first document if they’re ranked). Loss functions might also be adjusted to penalize over-reliance on the base model’s parametric knowledge when documents are provided, ensuring the model learns to “trust” the retrieved context.
These changes collectively enable the model to efficiently parse, prioritize, and integrate external documents, improving accuracy and reducing hallucination.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word