An inverted index is a data structure used in information retrieval (IR) to efficiently map terms (like words or phrases) to the documents where they appear. Unlike a forward index, which lists the terms contained in each document, an inverted index flips this relationship. It acts like a dictionary where each entry is a term, and the value is a list of document identifiers (IDs) and often additional details like term positions or frequencies. For example, if a document with ID 1 contains the words “apple” and “banana,” and another document with ID 2 has “apple” and “orange,” the inverted index would map “apple” to [1, 2], “banana” to [1], and “orange” to [2]. This structure allows search engines to quickly locate all documents containing specific terms without scanning every document in a collection.
To build an inverted index, documents are first parsed into individual terms through processes like tokenization (splitting text into words), normalization (lowercasing, removing punctuation), and stemming (reducing words to their root form, like “running” to “run”). Stop words (common words like “the” or “and”) are often excluded to save space. Each processed term is then recorded alongside the document ID and its position within the document. When a user submits a query, the search engine breaks it into terms, looks them up in the inverted index, retrieves the corresponding document lists, and combines the results. For instance, a query for “apple AND banana” would intersect the lists for “apple” ([1, 2]) and “banana” ([1]) to return document 1.
The primary advantage of an inverted index is speed. By precomputing term-to-document mappings, it avoids scanning every document during a search, which is critical for large datasets like web pages or document databases. It also supports ranking algorithms like TF-IDF or BM25, which use term frequencies and document statistics to score relevance. Developers often optimize inverted indexes using compression (to reduce memory usage) or distributed storage (for scalability in systems like Elasticsearch). While building and maintaining an inverted index requires upfront processing, the trade-off is justified by the dramatic improvement in query performance, making it a foundational component of modern search engines, databases, and IR systems.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word