🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How do LLMs handle out-of-vocabulary words?

Large language models (LLMs) handle out-of-vocabulary (OOV) words by breaking them into smaller, known components called subword tokens. Instead of relying on a fixed vocabulary, models like GPT and BERT use techniques like Byte-Pair Encoding (BPE) or SentencePiece to split unfamiliar words into parts. For example, a technical term like “quantumteleportation” might be divided into “quantum” and “teleportation” if both subwords are in the model’s training data. If even those subwords are unknown, the model might split further into smaller units like "qu", "ant", "um", etc. This approach allows the model to process words it hasn’t seen before by approximating their meaning through recognizable fragments.

When OOV words can’t be split into meaningful subwords, LLMs rely on context to infer their purpose. For instance, if a sentence contains a new slang term like “yeet” in “He yeeted the ball across the field,” the model analyzes surrounding words (“ball,” “across,” “field”) to guess that “yeet” relates to throwing. Similarly, domain-specific terms like a new programming library name (e.g., “PyTorchLightning2023”) might be parsed using adjacent keywords like “import” or “neural network.” However, this contextual inference isn’t foolproof. Ambiguous OOV words, especially those with no clear subword clues, might lead to incorrect interpretations. For example, a made-up medical term like “neurofloxazine” could be misinterpreted as a drug or a condition depending on sentence structure.

Developers can mitigate OOV issues by preprocessing text or fine-tuning models. Preprocessing steps like spell-checking, normalizing casing, or expanding abbreviations (e.g., converting “LLM” to “large language model”) reduce OOV occurrences. For domain-specific applications, retraining the model’s tokenizer on specialized data (e.g., medical journals or code repositories) helps it recognize technical terms. However, if an OOV word is critical (e.g., a brand name in a chatbot), explicitly adding it to the tokenizer’s vocabulary during fine-tuning ensures proper handling. Testing with real-world examples—like user-generated text with typos or niche jargon—is essential to identify and address gaps in OOV handling for specific use cases.

Like the article? Spread the word