TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (a corpus). It combines two metrics: Term Frequency (TF), which measures how often a term appears in a document, and Inverse Document Frequency (IDF), which penalizes terms that appear frequently across many documents. The TF-IDF score is the product of these two values, highlighting words that are distinctive to a specific document. This method is widely used in information retrieval, text mining, and search engines to rank relevance.
To calculate TF-IDF, start with Term Frequency (TF). This is typically the count of a term in a document divided by the total number of terms in that document. For example, if the word “code” appears 5 times in a 100-word document, the TF for “code” is 5/100 = 0.05. This normalizes the frequency to avoid bias toward longer documents. Next, compute Inverse Document Frequency (IDF) using the formula: log(total documents / (number of documents containing the term + 1))
. The “+1” prevents division by zero. If “code” appears in 10 out of 1,000 documents, the IDF is log(1000/11) ≈ 4.5
. Finally, multiply TF and IDF to get the TF-IDF score (0.05 * 4.5 ≈ 0.225). Higher scores indicate terms that are both frequent in a document and rare in the corpus.
TF-IDF is practical for converting text into numerical features. For instance, in a search engine, documents with higher TF-IDF scores for a query term are ranked higher. Developers often use libraries like scikit-learn or Python’s TfidfVectorizer
to automate these calculations, generating a document-term matrix where rows represent documents and columns represent TF-IDF scores for each term. While TF-IDF doesn’t capture semantic meaning (unlike modern embeddings), it remains foundational for tasks like keyword extraction, document clustering, and text classification. Its simplicity, efficiency, and interpretability make it a staple in preprocessing pipelines for NLP applications.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word