🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • What could cause a Sentence Transformer model to produce very low similarity scores for pairs of sentences that are obviously similar in meaning?

What could cause a Sentence Transformer model to produce very low similarity scores for pairs of sentences that are obviously similar in meaning?

A Sentence Transformer model might produce low similarity scores for semantically similar sentences due to three primary factors: training data mismatch, inadequate preprocessing, or suboptimal model configuration. First, these models are trained on specific datasets and tasks, so if the input sentences differ significantly from the training data in domain, style, or vocabulary, the embeddings may fail to capture meaningful relationships. For example, a model trained on formal news articles might struggle with colloquial social media text, even if the meanings align. Second, preprocessing steps like tokenization, casing, or handling special characters can distort sentence structure. A sentence pair like “It’s a rainy day” and “It is raining today” might receive low scores if contractions or word forms aren’t standardized. Third, the model’s pooling strategy (e.g., mean pooling vs. CLS token) or similarity metric (e.g., cosine vs. dot product) might not align with the use case, leading to unintuitive results.

Consider a scenario where a model trained for paraphrase detection is used for general similarity tasks. Paraphrase models are optimized to distinguish near-identical rephrasings from slight variations, which might cause them to score “The cat sat on the mat” and “A feline rested on the rug” lower than expected, despite clear semantic overlap. Similarly, if sentences contain domain-specific jargon—like medical terms in a model trained on movie reviews—the embeddings won’t reflect meaningful connections. Another example is multilingual models: translating “Hello, how are you?” to Spanish (“Hola, ¿cómo estás?”) might yield low scores if the model isn’t fine-tuned to align cross-lingual embeddings. These issues stem from mismatches between the model’s training objectives and the actual use case.

To diagnose the problem, developers should first verify the model’s training context. For instance, the all-MiniLM-L6-v2 model is tuned for general-purpose similarity, while all-mpnet-base-v2 prioritizes precision. If the task involves short phrases, a model trained on sentence pairs (e.g., nli-roberta-base) might perform better. Next, inspect preprocessing: ensure tokenization aligns with the model’s expected input (e.g., BERT-style WordPiece). Test the model with canonical examples like ["I love programming", “Coding is my passion”] to check baseline performance. Finally, experiment with normalization—applying mean pooling with L2 normalization often stabilizes cosine scores. If all else fails, fine-tuning the model on a small dataset of labeled sentence pairs from the target domain can recalibrate its similarity thresholds.

Like the article? Spread the word