To use Haystack for document search with natural language queries, you start by setting up a pipeline that connects a document store, a retriever, and optionally a reader. Haystack is designed to handle large-scale document collections and allows you to perform semantic search using transformer-based models. The core workflow involves storing documents in a search-optimized database, using a retriever to find relevant passages, and applying a reader (like a QA model) to extract precise answers if needed. This approach combines traditional keyword-based search with modern neural methods for accurate results.
First, prepare your documents and load them into a Haystack DocumentStore
, such as Elasticsearch, FAISS, or InMemory. For example, you might split a collection of PDFs or text files into smaller chunks using Haystack’s preprocessing tools. Next, choose a retriever model—options include BM25 (keyword-based) or dense retrievers like EmbeddingRetriever
(which uses sentence transformers). If you need answers extracted directly from text, add a Reader
component (e.g., a RoBERTa QA model). A basic pipeline would look like this in code:
from haystack import Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import EmbeddingRetriever, FARMReader
document_store = InMemoryDocumentStore()
# Load documents into the store (e.g., using a TextIndexingPipeline)
retriever = EmbeddingRetriever(document_store, model_format="sentence_transformers", embedding_model="all-MiniLM-L6-v2")
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])
This pipeline retrieves documents using semantic similarity and extracts answers from them.
Finally, run queries by passing natural language questions to the pipeline. For example:
results = pipeline.run(query="What is climate change?", params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 3}})
Adjust parameters like top_k
to balance speed and accuracy. If you don’t need answer extraction, skip the reader and use the retriever alone. For better performance, preprocess text (remove noise, split long documents) and experiment with models—try multi-qa-mpnet-base-dot-v1
for retrieval or larger readers like BERT-large. Haystack’s flexibility allows integration with custom models and databases, making it adaptable to specific use cases.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word