Handling very large datasets that exceed memory capacity requires strategies that process data incrementally rather than loading everything at once. The Sentence Transformers library doesn’t provide built-in streaming features, but you can achieve this by manually implementing chunked processing or using generators. The key is to load, process, and discard data in batches, avoiding memory overload while maintaining efficiency. For both embedding generation and model training, the approach involves iterating through the data in manageable segments, often with help from utilities in Python or PyTorch.
For embedding tasks, you can process text in chunks by reading from disk incrementally. For example, read a CSV file in batches using pandas
with chunksize
, or iterate through a text file line by line. Each batch is passed to model.encode()
to generate embeddings, which are immediately saved to disk (e.g., as numpy files or in a database). This avoids holding all embeddings in memory. Here’s a simplified example using a generator:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
with open('large_data.txt', 'r') as f:
while True:
lines = [next(f).strip() for _ in range(1000)] # Read 1k lines
if not lines:
break
embeddings = model.encode(lines)
np.save('embeddings_batch.npy', embeddings)
For training, PyTorch’s DataLoader
and custom Dataset
classes enable on-the-fly loading. Use the datasets
library from Hugging Face to stream data directly from disk or remote sources without loading everything into memory. For example, the load_dataset
function can read a large file in chunks:
from datasets import load_dataset
dataset = load_dataset('csv', data_files='large_data.csv', streaming=True)
train_loader = torch.utils.data.DataLoader(dataset['train'], batch_size=32)
During training, each batch is loaded only when needed. Sentence Transformers’ training scripts (e.g., SentenceTransformer.fit
) can work with such loaders. Additionally, techniques like gradient checkpointing or mixed precision can reduce memory usage further. For very large datasets, consider saving intermediate model checkpoints to resume training if interrupted.
In summary, while Sentence Transformers doesn’t offer native streaming APIs, standard practices like chunked file reading, generator-based batching, and PyTorch utilities let you handle large datasets efficiently. The critical steps are incremental data loading, immediate saving of outputs, and leveraging existing libraries for memory-safe data pipelines.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word