Haystack manages batch processing of documents through its pipeline architecture and built-in components designed to handle multiple documents efficiently. The framework processes batches by passing groups of documents through a sequence of nodes (like converters, preprocessors, or retrievers) in a structured workflow. Each node in a pipeline is optimized to operate on batches, reducing overhead from repeated operations and leveraging hardware acceleration where possible. For example, when using text embedding models, processing multiple documents at once allows better utilization of GPU resources, as neural networks often compute predictions faster in batches.
A key example is Haystack’s PreProcessor
class, which splits documents into smaller text chunks. If you pass 100 documents to a pipeline with a PreProcessor
node configured with a batch_size
of 10, the node will process 10 documents at a time, splitting each into paragraphs or sentences. This approach balances memory usage and speed, avoiding bottlenecks from processing all documents at once. Similarly, when using embedding models in a Retriever
, batch processing allows converting text chunks into vector embeddings in bulk, which is significantly faster than embedding one chunk at a time. Developers can control batch sizes using parameters like batch_size
in nodes or adjust them based on available memory.
Haystack also integrates with document stores (like Elasticsearch or FAISS) that support batch indexing. After processing, documents or embeddings are saved in batches rather than individually, reducing the number of write operations. For instance, using DocumentStore.write_documents()
with a list of 500 documents will insert them in bulk, which is faster than 500 separate insertions. This is particularly useful when ingesting large datasets. By combining pipeline-level batching with storage optimizations, Haystack ensures efficient handling of document processing tasks at scale while allowing developers to fine-tune batch sizes for their specific hardware and use case.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word