🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

What is TF-IDF, and how is it used in full-text search?

What is TF-IDF, and how is it used in full-text search?

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used to measure the importance of a word in a document relative to a collection of documents (a corpus). It combines two metrics: Term Frequency (TF), which counts how often a term appears in a document, and Inverse Document Frequency (IDF), which penalizes terms that appear frequently across many documents. The product of TF and IDF gives a score that highlights terms more unique to a specific document. For example, if the word “blockchain” appears 10 times in a document but rarely in others, it will have a high TF-IDF score, signaling its relevance to that document.

Calculation and Application in Search To compute TF-IDF, first calculate TF as the number of times a term appears in a document divided by the total terms in that document. IDF is the logarithm of the total number of documents divided by the number of documents containing the term. For instance, if a corpus has 1,000 documents and “database” appears in 100, IDF is log(1000/100) = 2. If a document has “database” 5 times among 100 total terms, TF is 5/100 = 0.05. The TF-IDF score is 0.05 * 2 = 0.1. In full-text search, this score helps rank documents by relevance. When a user searches for “database optimization,” the engine computes TF-IDF for each term in each document, sums the scores, and returns documents with the highest totals.

Example and Practical Use Consider a corpus with three documents:

  1. “Database systems use indexing for optimization.”
  2. “Machine learning optimization requires large datasets.”
  3. “Indexing speeds up database queries.”

A search for “database optimization” would tokenize the query into ["database", “optimization”]. For Document 1, “database” appears once (TF = 1/6 ≈ 0.17) and “optimization” once (TF = 0.17). If “database” appears in 2 of 3 documents, IDF for “database” is log(3/2) ≈ 0.18. “Optimization” appears in 2 documents, so IDF is log(3/2) ≈ 0.18. TF-IDF for Document 1 is (0.17 * 0.18) + (0.17 * 0.18) ≈ 0.06. Documents where both terms have higher TF-IDF scores (like Document 1) rank higher. This method ensures terms common in a document but rare overall drive relevance, improving search accuracy.

Like the article? Spread the word