🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How do embeddings handle rare words or objects?

Embeddings handle rare words or objects by using techniques that either break them into smaller components, leverage contextual information, or incorporate external data. For text, subword tokenization methods like WordPiece or Byte-Pair Encoding split rare words into familiar parts (e.g., “uncommon” becomes “un” + “common”), allowing models to reuse embeddings for common subwords. Character-level embeddings represent words as sequences of characters, enabling the model to handle unseen words by combining learned character patterns. For non-text objects, such as items in recommendation systems, embeddings can incorporate metadata (e.g., product categories) or use transfer learning from related domains to infer meaningful representations even with sparse data.

A concrete example in NLP is BERT’s use of WordPiece tokenization. If a rare word like “zebra” isn’t in the vocabulary, it might be split into subwords like “ze” and “bra,” whose embeddings are combined. Similarly, FastText represents words as the sum of their character n-grams (e.g., “blogging” as “blo” + “log” + “gging”), which works even for misspelled or rare terms. In recommendation systems, a rarely purchased item like a niche gadget might have its embedding initialized using features like category (“electronics”) or brand, allowing the model to infer similarities to other gadgets despite limited interaction data. Character-level approaches are particularly useful for domain-specific terms (e.g., medical jargon) or names, where even rare sequences like “Xyzzy” can be built from common characters.

Challenges include balancing vocabulary size with computational efficiency—adding too many subwords increases memory usage, while too few lead to more unknown tokens. For objects, reliance on metadata assumes such data is available and relevant. Out-of-vocabulary (OOV) words not covered by subword rules may still require fallback strategies, like hashing or default UNK embeddings, which sacrifice specificity. Developers must choose methods based on their data: subword tokenization suits languages with morphological complexity, while character-level models excel with highly variable or noisy text. For non-text objects, combining embeddings with metadata often provides the most flexibility, but requires careful feature engineering to ensure meaningful representations.

Like the article? Spread the word