🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • How do you use Sentence Transformers in a multi-lingual setting (for example, loading a multilingual model to encode sentences in different languages)?

How do you use Sentence Transformers in a multi-lingual setting (for example, loading a multilingual model to encode sentences in different languages)?

To use Sentence Transformers in a multilingual setting, you start by loading a pre-trained multilingual model and then encode text in various languages using the same model. These models are trained on diverse language datasets, allowing them to map sentences from different languages into a shared embedding space. This means embeddings for semantically similar sentences—regardless of language—will be close numerically, enabling cross-lingual tasks like search or clustering.

First, install the sentence-transformers library and load a multilingual model. For example, the paraphrase-multilingual-MiniLM-L12-v2 model supports over 50 languages. Use SentenceTransformer to initialize it:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

Once loaded, pass sentences in any supported language to the model’s encode method. The model automatically detects the language and processes the text:

sentences = [
 "Hello, world!", # English
 "Bonjour le monde!", # French
 "Hola, mundo!" # Spanish
]
embeddings = model.encode(sentences)

The output is a list of 384-dimensional vectors (for this specific model), which you can compare using cosine similarity to measure cross-lingual semantic similarity.

Key considerations include model selection and input handling. Multilingual models vary in language coverage and performance—some prioritize European languages, while others include Asian or African languages. Check the model’s documentation for supported languages. For languages with complex scripts (e.g., Chinese or Arabic), ensure text is properly tokenized. Some models handle subword tokenization internally, but preprocessing steps like Unicode normalization might improve consistency. Also, avoid mixing languages within a single sentence unless the model explicitly supports it, as this can degrade performance. Finally, test the model’s output quality for your target languages using domain-specific data if needed. For instance, if building a cross-lingual FAQ matcher, verify that embeddings for equivalent questions in different languages align correctly.

Like the article? Spread the word