To implement real-time updates in Haystack, you need to ensure your document store supports immediate indexing and use appropriate methods for adding or modifying data. Haystack’s architecture separates document storage from retrieval, so real-time updates depend on the document store you choose (e.g., Elasticsearch, OpenSearch, or InMemoryDocumentStore). For most production scenarios, Elasticsearch or OpenSearch are recommended because they natively support near-real-time indexing, typically making documents searchable within 1-2 seconds after insertion. The key is to use the document store’s write methods correctly and handle updates atomically.
Start by using the write_documents()
method provided by your document store class. For example, with ElasticsearchDocumentStore
, call document_store.write_documents(docs, duplicate_documents="overwrite")
to add or update documents. The duplicate_documents
parameter set to “overwrite” ensures existing documents with matching IDs are replaced. After writing, trigger an explicit index refresh using document_store.refresh()
to make changes immediately visible to search pipelines. If you’re using a version of Elasticsearch older than 7.0, you may need to adjust the index refresh interval settings in Elasticsearch itself for faster visibility. For deletions, use document_store.delete_documents(ids=[...])
followed by a refresh.
If you’re using an in-memory store like InMemoryDocumentStore
, real-time updates are automatic but volatile (data resets on restart). For persistent stores like FAISS, you’ll need to rebuild the vector index after updates using document_store.save()
, which isn’t truly real-time. In such cases, consider combining a SQL database for metadata with a vector store, using database triggers or a message queue (e.g., RabbitMQ) to notify your application of changes. Always test update visibility by immediately querying after writes in your pipeline—this ensures your retriever components like BM25Retriever
or EmbeddingRetriever
have access to the latest data.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word