To perform data ingestion in Haystack, you need to convert raw data into a structured format and load it into a document store. Haystack handles data through Document
objects, which represent individual pieces of content (like text, PDFs, or images) with metadata. The process typically involves three steps: preprocessing data, converting it into Document
objects, and writing these to a document store. For example, if you have a folder of text files, you’d first read the files, split them into manageable chunks, and create Document
instances with content and metadata like filenames or timestamps.
Start by preprocessing your data to fit your use case. Use Haystack’s PreProcessor
class to clean text, split long documents into smaller chunks, or remove unnecessary elements. For instance, you might split a 10-page PDF into 10 individual Document
objects, each representing a page. The PreProcessor
lets you define parameters like split_length
(number of words per chunk) and split_overlap
(overlap between chunks). If your data comes from multiple sources (APIs, databases, files), use Haystack’s convert_files_to_docs
utility for file-based data or custom scripts to handle structured data. For example, convert_files_to_docs(dir_path="data", split_paragraphs=True)
converts all files in a directory into documents with paragraph-level splits.
Finally, write the processed Document
objects to a Haystack-supported document store like Elasticsearch, FAISS, or InMemoryDocumentStore. Use the document_store.write_documents()
method to add data. If you’re using a retrieval pipeline, ensure the document store is compatible with your retriever (e.g., ElasticsearchRetriever requires Elasticsearch). For example, after initializing an ElasticsearchDocumentStore
, call document_store.write_documents(docs)
to index the data. To automate ingestion, combine these steps into a pipeline using Haystack’s Pipeline
class, adding components like file converters, preprocessors, and document writers. This structured approach ensures your data is search-ready and optimized for tasks like question answering or semantic search.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word