🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How do I perform data ingestion in Haystack?

To perform data ingestion in Haystack, you need to convert raw data into a structured format and load it into a document store. Haystack handles data through Document objects, which represent individual pieces of content (like text, PDFs, or images) with metadata. The process typically involves three steps: preprocessing data, converting it into Document objects, and writing these to a document store. For example, if you have a folder of text files, you’d first read the files, split them into manageable chunks, and create Document instances with content and metadata like filenames or timestamps.

Start by preprocessing your data to fit your use case. Use Haystack’s PreProcessor class to clean text, split long documents into smaller chunks, or remove unnecessary elements. For instance, you might split a 10-page PDF into 10 individual Document objects, each representing a page. The PreProcessor lets you define parameters like split_length (number of words per chunk) and split_overlap (overlap between chunks). If your data comes from multiple sources (APIs, databases, files), use Haystack’s convert_files_to_docs utility for file-based data or custom scripts to handle structured data. For example, convert_files_to_docs(dir_path="data", split_paragraphs=True) converts all files in a directory into documents with paragraph-level splits.

Finally, write the processed Document objects to a Haystack-supported document store like Elasticsearch, FAISS, or InMemoryDocumentStore. Use the document_store.write_documents() method to add data. If you’re using a retrieval pipeline, ensure the document store is compatible with your retriever (e.g., ElasticsearchRetriever requires Elasticsearch). For example, after initializing an ElasticsearchDocumentStore, call document_store.write_documents(docs) to index the data. To automate ingestion, combine these steps into a pipeline using Haystack’s Pipeline class, adding components like file converters, preprocessors, and document writers. This structured approach ensures your data is search-ready and optimized for tasks like question answering or semantic search.

Like the article? Spread the word