Haystack handles full-text search by combining document processing, indexing, and query pipelines to enable efficient text-based retrieval. At its core, Haystack relies on inverted indexes—a data structure that maps terms (like words or phrases) to the documents containing them. These indexes are typically managed by integrated search engines like Elasticsearch, OpenSearch, or databases such as SQLite or PostgreSQL with full-text extensions. For example, when a document is added to Haystack, it is split into smaller text units (e.g., paragraphs), tokenized, and indexed. This allows the system to quickly locate documents matching user queries by scanning the index instead of parsing raw text files repeatedly.
Document processing in Haystack involves several steps to prepare text for search. Files in formats like PDFs, Word documents, or HTML are first converted to plain text using built-in converters (e.g., PDFToTextConverter
). The text is then split into manageable chunks using splitters, which ensure context is preserved. Tokenization and normalization (e.g., lowercasing, removing stopwords) are applied to standardize terms. For instance, a 100-page PDF might be split into 500 text chunks, each indexed with metadata like the source file name and position. This structured approach ensures efficient storage and retrieval, even for large datasets. Developers can customize preprocessing steps, such as adding custom filters or using language-specific stemmers, to improve relevance.
When handling queries, Haystack processes the search terms through the same tokenization and normalization steps used during indexing. The system then scans the inverted index to find documents containing the query terms, ranks them using algorithms like BM25 (for keyword-based search), or applies semantic similarity models (e.g., transformers). For example, a search for “machine learning frameworks” might return documents containing terms like “TensorFlow” or “PyTorch” if semantic search is enabled. Haystack’s pipelines allow developers to combine keyword and vector-based retrieval, apply filters (e.g., date ranges), or rerank results. This flexibility ensures precise results while maintaining performance, even for complex queries across millions of documents.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word