N-grams in information retrieval (IR) are contiguous sequences of n items (words, characters, or tokens) extracted from text to improve how documents and queries are processed. By breaking text into overlapping or adjacent chunks, n-grams help capture context and relationships between terms that single words (unigrams) might miss. For example, a search query for “machine learning” treated as a bigram (two-word sequence) ensures the system looks for the exact phrase, rather than treating “machine” and “learning” as separate, unrelated terms. This approach enhances precision by preserving semantic meaning and reducing ambiguity in queries and documents.
In practice, n-grams are used during indexing and query processing. When building an inverted index, IR systems may tokenize text into n-grams of varying lengths (e.g., unigrams, bigrams, trigrams) to support flexible matching. For instance, a document containing “New York City” could be split into bigrams like ["New York", “York City”] and trigrams like [“New York City”], allowing searches for “York City” to directly match the bigram. N-grams also handle partial or misspelled queries when applied at the character level. For example, a typo like “aple” could be represented as trigrams ["ap", "pl", “le”], which might still match “apple” (trigrams ["app", "ppl", “ple”]) in a fuzzy search. This is particularly useful in autocomplete systems or spell-checking features.
However, n-grams come with trade-offs. Larger n values (e.g., trigrams) increase index size and computational overhead, as more unique terms are stored. For example, indexing 10,000 documents with bigrams instead of unigrams could double the number of index entries. Additionally, not all n-grams are meaningful—phrases like “the and” or “is of” add noise. Developers often mitigate this by combining n-grams with filters (e.g., removing stopwords) or using hybrid approaches (e.g., mixing unigrams with selective bigrams). Despite these challenges, n-grams remain a straightforward, effective way to balance specificity and flexibility in IR systems, especially when precise phrase matching or error tolerance is required.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word