Haystack handles tokenization and text preprocessing through a combination of modular components and integration with established NLP libraries. The framework delegates tokenization tasks to specialized tools like spaCy or Hugging Face’s tokenizers, depending on the pipeline configuration. For example, when processing text for transformer-based models (e.g., BERT), Haystack uses Hugging Face’s AutoTokenizer
to split input into subwords or tokens that match the model’s requirements. This ensures compatibility with pretrained models while avoiding reinventing low-level tokenization logic. Developers can also swap tokenizers—such as using spaCy for rule-based word splitting—to align with specific language rules or domain needs.
Text preprocessing in Haystack is managed through customizable pipelines. The PreProcessor
class provides utilities for cleaning and segmenting documents into smaller units, such as paragraphs or sentences, which are critical for tasks like retrieval or question answering. For instance, a document might be split into 150-word chunks with a 50-word overlap to ensure context isn’t lost between segments. The preprocessor can remove extra whitespace, filter short texts, or split based on specific criteria (e.g., split_by="word"
or split_by="sentence"
). Developers can extend these features by adding custom cleanup functions—like stripping HTML tags or normalizing Unicode characters—to handle specialized data sources, such as web pages or user-generated content.
Haystack’s design emphasizes flexibility, allowing tokenization and preprocessing steps to adapt to different components in a workflow. For example, a retriever like Elasticsearch’s BM25 might use simple whitespace tokenization for keyword matching, while a dense retriever like DPR requires the same tokenization used during its model training. By decoupling these steps, Haystack lets developers mix and match tools—such as combining spaCy for sentence splitting with a Hugging Face tokenizer for transformer models—without forcing a one-size-fits-all approach. This modularity ensures that preprocessing remains efficient and tailored to the specific requirements of each pipeline stage, whether indexing documents or parsing queries.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word