To handle encoding long documents with Sentence Transformers, developers typically split the text into smaller chunks or use sliding windows to work within the model’s token limits. Most transformer-based models, including those in Sentence Transformers, have a maximum token capacity (e.g., 512 tokens for many BERT-based models). When processing documents exceeding this limit, direct encoding isn’t feasible. Splitting the text into segments ensures each chunk fits within the model’s constraints. For example, a 10,000-token document could be divided into 20 chunks of 500 tokens each. This approach requires defining logical boundaries, such as splitting at sentence or paragraph breaks, or using fixed-size chunks. Developers often use libraries like transformers
to count tokens and split text programmatically. However, this method risks losing context between chunks if the splits aren’t aligned with semantic units.
A sliding window approach addresses context loss by creating overlapping chunks. For instance, a 512-token window might advance by 256 tokens each step, ensuring adjacent chunks share half their content. This preserves local context better than non-overlapping splits. While effective, sliding windows increase computational load—a 10,000-token document would generate ~39 chunks instead of 20. Developers must balance context retention with resource usage. This method is particularly useful for tasks like question answering, where answers might span chunk boundaries. Tools like Hugging Face’s tokenizers
library can automate sliding window generation with configurable stride lengths. However, even with overlaps, global document context (e.g., themes spanning the entire text) may still be diluted.
After splitting, developers encode each chunk separately and aggregate the results. Common strategies include averaging chunk embeddings or taking max-pooled values. For example, averaging produces a single vector representing the document’s overall meaning, while max-pooling emphasizes dominant features. The choice depends on the use case: averaging works well for semantic search, while max-pooling might better capture key terms for classification. Some implementations use hierarchical methods, first encoding paragraphs and then combining paragraph-level embeddings. Libraries like SpaCy or NLTK can help preprocess text into meaningful units before splitting. Critical considerations include aligning chunk sizes with the model’s limits, testing overlap ratios, and validating that the aggregation method preserves task-relevant information. Experimentation is key, as optimal settings vary based on document structure and application goals.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word