When encoding sentences with Sentence Transformers, batch size directly impacts throughput and memory usage. Larger batches generally increase throughput by allowing the model to process more sentences in parallel, leveraging GPU parallelism. However, this comes at the cost of higher memory consumption, as the GPU must store intermediate activations for all sentences in the batch. For example, doubling the batch size from 16 to 32 might nearly double the memory required, but it could also double the number of sentences processed per second, up to the point where GPU resources are saturated.
Throughput improves with larger batches because the fixed computational overhead of loading data and running the model is spread across more sentences. For instance, if encoding 16 sentences takes 100ms, encoding 32 might take 120ms—effectively doubling throughput (267 vs. 133 sentences/second). However, gains diminish as the batch size approaches the GPU’s maximum capacity. If a GPU can handle 64 sentences at 200ms, pushing to 128 might cause memory errors or only a marginal speedup (e.g., 220ms), making the trade-off less worthwhile. Developers should test their hardware to find the “sweet spot” where throughput plateaus but memory remains stable.
Memory usage scales roughly linearly with batch size, assuming fixed sentence lengths. For example, using a model like all-mpnet-base-v2
on a 16GB GPU, a batch size of 64 might consume 12GB, leaving room for other processes. Exceeding this risks out-of-memory crashes. To optimize, developers can adjust batch sizes dynamically based on input lengths (e.g., smaller batches for longer texts) or sort sentences by length to minimize padding waste. Tools like PyTorch’s DataLoader
with collate_fn
can help automate this. Always monitor GPU memory (e.g., with nvidia-smi
) during testing to avoid instability.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word