Semantic search using Sentence Transformers can return poor results due to three main factors: suboptimal model selection or embedding handling, inadequate data preprocessing, or insufficient retrieval configuration. Let’s break these down and address solutions.
First, model choice and embedding handling matter. If you’re using a general-purpose embedding model (e.g., all-MiniLM-L6-v2
) for a domain-specific task (e.g., medical or legal documents), it may fail to capture niche semantic relationships. For example, a model trained on Wikipedia text might not understand technical jargon in engineering reports. Switch to a domain-specific model (like multi-qa-mpnet-base-dot-v1
for QA tasks) or fine-tune a base model on your data. Additionally, ensure embeddings are normalized before computing similarity—models like all-mpnet-base-v2
assume cosine similarity, and unnormalized vectors can skew results. Use sklearn.preprocessing.normalize
or equivalent to scale embeddings to unit length.
Second, poor data preprocessing harms embedding quality. If your input text contains noise (HTML tags, typos, or irrelevant phrases), the embeddings will encode that noise. For instance, a product description cluttered with boilerplate text like “Terms and conditions apply” might match unrelated documents. Clean inputs by removing non-text elements, standardizing casing, and segmenting long documents into chunks (e.g., 512-token sections). For asymmetric tasks (e.g., query vs. paragraph), preprocess queries and documents differently—queries might need keyword extraction, while documents require summarization. Tools like spaCy or NLTK can help with tokenization and stopword removal.
Third, retrieval methods need tuning. A naive exact nearest-neighbor search often underperforms because it ignores context. Use approximate nearest neighbor (ANN) libraries like FAISS or Annoy with parameters tuned for your dataset—for example, increasing the nlist
parameter in FAISS for larger datasets. Post-retrieval re-ranking with a cross-encoder (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2
) can refine results by scoring top candidates more accurately. For example, retrieve 100 candidates with ANN, then re-rank the top 20 with a cross-encoder to prioritize contextual relevance. Also, experiment with hybrid approaches combining BM25 (keyword-based) and semantic scores to balance precision and recall.
By addressing these three areas—model selection, data quality, and retrieval pipeline—you can significantly improve semantic search results. Start by auditing your current setup: test domain-specific models, clean your data, and implement re-ranking to see which adjustments yield the largest gains.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word