If you encounter NaN or infinite values in the loss during Sentence Transformer training, start by checking your input data and preprocessing. Invalid inputs, such as empty strings, malformed text, or unexpected characters, can lead to unstable embeddings. For example, if a text entry is empty, the model might produce zero vectors or undefined values during embedding computation. Verify your data pipeline by adding checks for empty strings, special characters, or inconsistent encoding. Use a custom collate function to filter or handle problematic batches. Additionally, ensure text normalization (like lowercasing or removing non-ASCII characters) is applied consistently. If your dataset includes numerical features alongside text, confirm they’re scaled appropriately—values with extreme magnitudes can destabilize the model. Tools like torch.utils.data.DataLoader
with error logging can help isolate bad batches.
Next, inspect the model architecture and loss function. Exploding gradients or unstable operations in the loss calculation often cause NaN values. For instance, contrastive loss or triplet loss involves pairwise distance calculations—if embeddings collapse (e.g., all vectors become identical), divisions by near-zero values or invalid similarities may occur. Add gradient clipping (e.g., torch.nn.utils.clip_grad_norm_
) to limit extreme weight updates. Review custom loss implementations for numerical edge cases: logarithms, divisions, or exponentials can fail if inputs are negative or zero. For example, a cosine similarity score of -1
might cause issues in certain loss formulations. Test the loss function with synthetic data (e.g., random embeddings) to reproduce the error. Also, check embedding initialization: poorly scaled initial weights (too large or small) can amplify instability in early training steps.
Finally, adjust training hyperparameters and environment settings. A high learning rate can cause abrupt weight updates, leading to numerical overflow. Reduce the learning rate (e.g., from 1e-3
to 1e-5
) and use a learning rate scheduler. If using mixed-precision training (fp16
), switch to fp32
temporarily—reduced precision can underflow/overflow gradients. For optimizers like Adam, confirm eps
(a stability term) isn’t set too low (default 1e-8
usually works). Monitor batch statistics (e.g., gradient norms, embedding magnitudes) with tools like TensorBoard. If using pooling layers (e.g., mean/max pooling), ensure they handle variable-length sequences correctly; a sequence length of 0
due to truncation can cause division-by-zero errors. For domain-specific data, consider pretraining on a smaller, cleaner subset to identify stability issues before full-scale training.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word