To query a document store using Haystack API, you start by initializing a document store, loading documents into it, and then configuring a retrieval pipeline. Haystack provides unified interfaces for different document stores (like Elasticsearch, FAISS, or InMemory) and retrievers (sparse/dense models). The core workflow involves defining your document store, adding data, and using a retriever component to search through documents based on your query. This process is designed to be modular, letting you swap components without rewriting your entire codebase.
First, set up your document store and populate it. For example, using the InMemoryDocumentStore
, you can create documents with text and metadata. Here’s a simplified example:
from haystack import Document
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
docs = [Document(content="Haystack supports multiple retrievers", meta={"source": "docs"}),
Document(content="InMemoryDocumentStore is for testing", meta={"source": "tutorial"})]
document_store.write_documents(docs)
Next, choose a retriever. If using a sparse retriever like TF-IDF
, link it to the document store:
from haystack.nodes import TFIDFRetriever
retriever = TFIDFRetriever(document_store=document_store)
For dense retrieval (e.g., with EmbeddingRetriever
), you’d specify an embedding model like sentence-transformers/all-MiniLM-L6-v2
.
Finally, execute queries using a pipeline. A basic retrieval pipeline looks like this:
from haystack import Pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])
results = pipeline.run(query="What document stores are supported?", params={"retriever": {"top_k": 3}})
The results
will contain documents ranked by relevance. You can access their content via results["documents"]
and scores via document.score
. Adjust top_k
to control the number of results. If you need filtering (e.g., by metadata), most document stores support filters={"source": ["docs"]}
in the query parameters.
Key considerations include matching the retriever type to your document store (sparse retrievers for Elasticsearch, dense for FAISS) and ensuring your documents are preprocessed (split into paragraphs, cleaned). For advanced use cases, combine retrievers in an ensemble or add a reader node for full question-answering. Check Haystack’s documentation for store-specific configurations like indexing settings in Elasticsearch or GPU acceleration for dense retrievers.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word