🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

What is TF-IDF, and how is it calculated?

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (a corpus). It combines two metrics: Term Frequency (TF), which measures how often a term appears in a document, and Inverse Document Frequency (IDF), which penalizes terms that appear frequently across many documents. The TF-IDF score is the product of these two values, highlighting words that are distinctive to a specific document. This method is widely used in information retrieval, text mining, and search engines to rank relevance.

To calculate TF-IDF, start with Term Frequency (TF). This is typically the count of a term in a document divided by the total number of terms in that document. For example, if the word “code” appears 5 times in a 100-word document, the TF for “code” is 5/100 = 0.05. This normalizes the frequency to avoid bias toward longer documents. Next, compute Inverse Document Frequency (IDF) using the formula: log(total documents / (number of documents containing the term + 1)). The “+1” prevents division by zero. If “code” appears in 10 out of 1,000 documents, the IDF is log(1000/11) ≈ 4.5. Finally, multiply TF and IDF to get the TF-IDF score (0.05 * 4.5 ≈ 0.225). Higher scores indicate terms that are both frequent in a document and rare in the corpus.

TF-IDF is practical for converting text into numerical features. For instance, in a search engine, documents with higher TF-IDF scores for a query term are ranked higher. Developers often use libraries like scikit-learn or Python’s TfidfVectorizer to automate these calculations, generating a document-term matrix where rows represent documents and columns represent TF-IDF scores for each term. While TF-IDF doesn’t capture semantic meaning (unlike modern embeddings), it remains foundational for tasks like keyword extraction, document clustering, and text classification. Its simplicity, efficiency, and interpretability make it a staple in preprocessing pipelines for NLP applications.

Like the article? Spread the word