To use a GPU for faster embedding generation with Sentence Transformers, you need to ensure the model and data are moved to the GPU using PyTorch’s CUDA support. Sentence Transformers, built on PyTorch, automatically leverages GPU acceleration if the device is specified. The primary code change involves setting the model to use cuda
instead of cpu
during initialization. For example, when loading a model, you can pass device='cuda'
to the SentenceTransformer
constructor. If the model is already loaded on the CPU, you can move it to the GPU with model.to('cuda')
. Input data, such as text sentences, will automatically be processed on the GPU during embedding generation if the model is already configured for CUDA.
A key consideration is ensuring your inputs are compatible with GPU processing. When using the encode()
method to generate embeddings, Sentence Transformers internally converts text inputs into tensors and moves them to the same device as the model. For example, embeddings = model.encode(sentences)
will handle device placement automatically if the model is on the GPU. However, if you’re processing large datasets, batching is critical for maximizing GPU utilization. You can set the batch_size
parameter in encode()
to a higher value (e.g., 32 or 64) to parallelize computation. Larger batches exploit the GPU’s parallelism but must stay within memory limits—monitor VRAM usage to avoid out-of-memory errors.
Additionally, explicit device management can improve control. For instance, you can check GPU availability with torch.cuda.is_available()
and conditionally set the device. Here’s a code snippet:
from sentence_transformers import SentenceTransformer
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
sentences = ["This is an example sentence.", "Another example."]
embeddings = model.encode(sentences, batch_size=32, convert_to_tensor=True)
Setting convert_to_tensor=True
keeps embeddings on the GPU, which is useful for downstream GPU-accelerated tasks. If you need numpy arrays, omit this parameter or move the tensor back to the CPU with .cpu()
before conversion. No other code changes are required—the library handles underlying operations like data transfer and kernel optimizations.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word