To use a custom transformer model for generating sentence embeddings, you first need to load the model and tokenizer, process input text, and extract hidden states from the model. Unlike pre-trained Sentence Transformers, custom models require manual handling of tokenization, inference, and pooling to convert token-level outputs into sentence embeddings. This involves using libraries like Hugging Face Transformers to access the model architecture and utilities, followed by post-processing steps to aggregate token representations into a single vector per sentence.
Start by loading your custom model and its corresponding tokenizer using AutoModel
and AutoTokenizer
from the Transformers library. For example, if your model is saved locally, use model = AutoModel.from_pretrained("path/to/model")
and tokenizer = AutoTokenizer.from_pretrained("path/to/tokenizer")
. Tokenize the input sentence(s) with tokenizer(text, padding=True, truncation=True, return_tensors="pt")
to generate input IDs and attention masks. These tensors are passed to the model via outputs = model(**tokens)
, which returns hidden states. By default, the last layer’s hidden states (accessed via outputs.last_hidden_state
) are used, but you can customize this based on your model’s design.
The next step is pooling the token-level embeddings into a sentence embedding. A common approach is mean pooling, where you average the token vectors while ignoring padding tokens. For example, compute attention_mask = tokens["attention_mask"].unsqueeze(-1).float()
and sum_embeddings = torch.sum(outputs.last_hidden_state * attention_mask, dim=1)
. Divide this by torch.clamp(attention_mask.sum(dim=1), min=1e-9)
to get the mean-pooled embedding. You can also experiment with other strategies, like taking the [CLS] token’s embedding or max pooling. Finally, normalize the embeddings using L2 normalization (e.g., torch.nn.functional.normalize
) to ensure consistent scales for downstream tasks like similarity comparisons.
In practice, wrap this logic into a reusable function. For instance, create a encode
method that accepts a list of sentences, processes them in batches, and returns normalized embeddings. Ensure your custom model is fine-tuned for sentence-level tasks (e.g., using contrastive loss) to produce meaningful embeddings. If performance is critical, optimize batching and device placement (e.g., moving tensors to GPU with .to("cuda")
). This approach offers flexibility but requires careful testing to match the quality of dedicated sentence embedding models.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word