🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How do embeddings handle rare or unseen data?

Embeddings handle rare or unseen data by relying on generalization techniques and structural patterns learned during training. When encountering a rare term or an out-of-vocabulary (OOV) item, embeddings often approximate their representation using similarities to known data. For example, in word embeddings like Word2Vec or GloVe, rare words might be assigned vectors based on their subword components or initialized randomly but adjusted during training. If a word like “antidisestablishmentarianism” appears infrequently, its embedding might be influenced by common prefixes (“anti-") or suffixes ("-ism”) observed in other words. This allows the model to infer meaning even with limited examples.

For unseen data, modern approaches like FastText or transformer-based models (e.g., BERT) use subword tokenization. FastText breaks words into character n-grams, so even an unfamiliar word like “blockchainify” can be represented as a combination of smaller units ("blo", "ock", "chain", etc.). Similarly, BERT uses WordPiece tokenization, splitting terms into known subwords (e.g., “unseen” becomes “un” + “seen”). These methods enable embeddings to construct representations for new or rare terms by leveraging their structural parts. This is especially useful in languages with complex morphology or domains like biomedical text, where technical terms are often rare but built from reusable components.

Contextual embeddings address rare data by dynamically adjusting representations based on surrounding text. For instance, a transformer model might infer the meaning of a rare word like “quokka” from its context in a sentence: “The quokka, a small marsupial, smiled at tourists.” Here, the words “small,” “marsupial,” and “tourists” provide clues, allowing the model to assign an embedding that aligns with known animals. In recommendation systems, collaborative filtering embeddings can handle new items by associating them with similar user interactions or metadata (e.g., a new movie tagged “sci-fi” gets placed near existing sci-fi films). While not perfect, these strategies allow embeddings to approximate useful representations without requiring exhaustive training data.

Like the article? Spread the word