TFIDF
TFIDF stands for term frequency- inverse document frequency.
The TFIDF weight is used in text mining and IR. The weight is a measure used to evaluate how important a word is to a document in a collection of documents.
When using a simple technique like a frequency table of the terms in the document, we remove stop words, punctuation and stem the word to its root. And then, the importance of the word is measured in terms of its frequency; higher the frequency, more important the word.
In case of TFIDF, the only text pre-processing is removing punctuation and lower casing the words. We do not have to worry about the stop words.
TFIDF is the product of the TF and IDF scores of the term.
TF = number of times the term appears in the doc/total number of words in the doc
IDF = ln(number of docs/number docs the term appears in)
Higher the TFIDF score, the rarer the term is and vice-versa.
TFIDF is successfully used by search engines like Google, as a ranking factor for content.
The whole idea is to weigh down the frequent terms while scaling up the rare ones.
Coming up next is how to implement TFIDF!