To improve throughput when generating embeddings with Sentence Transformers, batch processing is essential. Instead of processing sentences one at a time, you group them into batches and process them together. This approach maximizes hardware efficiency, especially on GPUs, which excel at parallel computation. By reducing the overhead of repeated model calls and leveraging parallel processing, you can significantly speed up embedding generation for large datasets.
The primary method involves using the encode
function provided by the Sentence Transformers library, which supports batching out of the box. For example, if you have a list of 1,000 sentences, you can pass the entire list to model.encode()
with a specified batch_size
parameter (e.g., batch_size=64
). The library automatically splits the list into batches and processes them sequentially. Larger batch sizes generally improve throughput, but you must balance this with GPU memory constraints. For instance, a batch size of 64 might process 64 sentences in roughly the same time as a single sentence, depending on model and hardware capabilities. Testing different batch sizes is key to finding the optimal balance.
Several optimizations can further enhance performance. First, pre-sorting sentences by length minimizes padding within batches, as the tokenizer pads sequences to the longest sentence in a batch. Sorting ensures sentences of similar lengths are grouped, reducing wasted computation. Second, using mixed-precision inference (FP16) can cut memory usage and speed up processing if your GPU supports it. Third, keeping data on the GPU by setting convert_to_tensor=True
avoids costly transfers between CPU and GPU. For example, model.encode(sentences, batch_size=128, convert_to_tensor=True, device='cuda')
processes batches directly on the GPU. Finally, for extremely large datasets, splitting the data into chunks and processing them sequentially with appropriate batch sizes prevents out-of-memory errors while maintaining throughput.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word