LlamaIndex handles long-term storage of indexed documents by separating the storage of raw documents from the indexes built on top of them. This approach ensures scalability and flexibility, allowing developers to choose storage solutions that fit their needs. Documents are stored in a “document store,” which can be a local file system, cloud storage (like AWS S3 or Google Cloud Storage), or a database. Indexes, which are lightweight metadata structures optimized for querying, are stored separately in an “index store,” such as a vector database or a simple file-based system. This separation ensures that large documents don’t burden the indexing layer and that updates to documents or indexes can be managed independently.
For example, if a developer uses AWS S3 as the document store, all raw PDFs, text files, or other data formats are uploaded to an S3 bucket. LlamaIndex references these documents via unique identifiers or paths, while the actual indexing—like vector embeddings or keyword-based metadata—is stored in a separate system like Pinecone or Chroma. When a query is executed, LlamaIndex retrieves relevant document references from the index store and fetches the actual content from the document store. This decoupling allows developers to scale storage for raw data (e.g., using cost-effective cloud storage) while keeping indexes in high-performance databases optimized for fast retrieval.
LlamaIndex also supports versioning and updates through incremental indexing. When a document is modified, the system can detect changes (e.g., via file timestamps or checksums) and update only the affected parts of the index. For instance, if a user edits a text file stored in a PostgreSQL database, LlamaIndex can re-index the updated sections without rebuilding the entire index. Additionally, developers can configure retention policies or backup strategies for both document and index stores, ensuring data durability. By abstracting storage backends, LlamaIndex lets teams adapt to evolving requirements—like switching from a local file system to cloud storage—without overhauling their entire indexing pipeline.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word