A sparse vector in information retrieval (IR) is a data structure used to represent documents or queries where most dimensions (or positions) have a value of zero. In IR, text data is typically converted into numerical vectors to enable mathematical operations like similarity comparisons. Each dimension in the vector corresponds to a unique term (word) from a predefined vocabulary. For example, if a vocabulary contains 10,000 terms, a document’s vector will have 10,000 dimensions. However, since most documents contain only a small subset of these terms, the majority of the vector’s values will be zero. This sparsity makes storage and computation efficient compared to dense vectors, where most values are non-zero.
Sparse vectors are commonly used in models like the bag-of-words or TF-IDF (Term Frequency-Inverse Document Frequency). For instance, consider a document containing the words “cat,” “dog,” and “fish” from a vocabulary of 1,000 terms. Its sparse vector might have non-zero values at positions corresponding to these three words (e.g., [0, 0, 3, 0, 2, …, 0, 1]) where the numbers represent term frequencies. The rest of the 997 positions remain zero. Storing this as a sparse vector avoids allocating memory for all 1,000 dimensions, reducing memory usage and speeding up operations like dot products (used in calculating similarity scores).
In practice, sparse vectors are implemented using data structures like dictionaries or hash maps, where only non-zero values and their indices are stored. For example, a Python dictionary might map the term “cat” to its TF-IDF weight, skipping all terms not present in the document. This efficiency is critical in large-scale systems like search engines, where processing millions of documents with high-dimensional vocabularies would be infeasible using dense representations. However, sparse vectors lack semantic relationships between terms (e.g., “cat” and “kitten” are treated as unrelated), which is a limitation addressed by dense embeddings (e.g., word2vec or BERT). Despite this, sparse vectors remain widely used for tasks like keyword search and ranking due to their simplicity and interpretability.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word