Handling document updates in LlamaIndex involves managing changes to your source data while ensuring your indexes stay accurate and efficient. When documents are modified, you generally have three options: rebuild the index from scratch, update specific parts incrementally, or use versioning to track changes. The best approach depends on factors like the size of your dataset, how often updates occur, and whether you need historical data access. LlamaIndex provides tools for each scenario, but you’ll need to choose the method that aligns with your performance and maintenance requirements.
For small datasets or infrequent updates, rebuilding the entire index is straightforward. Use SimpleDirectoryReader
to reload documents, detect changes (e.g., via file modification times), and recreate the index with VectorStoreIndex.from_documents()
. However, this becomes inefficient for large datasets. For larger or frequently updated data, use incremental updates. LlamaIndex allows inserting or deleting documents in an existing index. For example, after initial indexing, use index.insert(Document(text="new content"))
to add a document without rebuilding everything. If your storage backend (like a vector database) supports upserts, you can update embeddings for specific documents. Versioning is another option: store document versions as metadata (e.g., version=2
or timestamp=2023-10-01
) and filter queries by version to ensure you’re accessing the latest data. This avoids overwriting old indexes, which is useful for auditing or rollbacks.
Here’s a practical example for incremental updates:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
updated_doc = Document(id="doc_123", text="updated content")
index.insert(updated_doc)
index.delete("doc_123")
if necessary.
For versioning, append metadata during ingestion:documents = [Document(text="content", metadata={"version": 1})]
When querying, add a filter like index.as_query_engine(filters=[{"version": 1}])
.
Choose full rebuilds for simplicity with small data, incremental updates for scalability, and versioning for traceability. Always test performance trade-offs, as re-indexing large datasets can be time-consuming.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word