NLP Intro: What is tf-idf and how to use it?
Text analysis often involves comparing multiple texts or even collections of documents. This can require vertical (deep-diving in one text) and horizontal analysis (identifying common or differing patterns across documents).
A common way to make such comparisons is through a statistic called tf-idf. The name sounds really cryptic, but it is just a handy abbreviation of two commonly analysed metrics in text analysis:
- tf is term frequency: the rate of commonly appearing terms in a text (it can refer not just to individual words but also to phrases)
- idf is inverse document frequency, a fancy term that refers to weighing commonly used words to highlight which ones are unique to specific documents. In other words, words often appearing across all analysed documents have a lower weight (they are frequently used but also very common across the board), and words used less across documents have a higher weight.
So, for example, if you get a set of 100 cookbooks, words like “chop”, ”boil”, and ”roast” will be very frequent across all of them, but “savoiardi”, “cacio e pepe” and “arancini” are more likely to be prevalent in only a subset of books (those on Italian cuisine) that sets them apart.
Calculating tf-idf is based on multiplying the two types of frequency together. The core goal is to give you an idea of frequently used terms while also correcting and adjusting for how common they are across documents.
Depending on your goals and depth of analysis, this approach can have some drawbacks (explained in detail here), but it is a useful and common technique to master.
A quick example of putting tf-idf into practice
As in previous parts of this series, I am using Project Gutenberg datasets to analyse the works of some of the most prominent literary masters of horror with the help of the tidytext R package textbook.
For better or worse, H.P. Lovecraft has a very bespoke style as a fiction writer. Only so many times can one read the word “Cyclopean” without noticing a pattern. So I decided to use tf-idf to quickly tally which words are typical for specific works.
# Let's download Lovecraft books from project Gutenberg
hpdownload <- gutenberg_download(c(30637,31469,50133,68236,68283,68547,68553),
meta_fields = "title")
# Unnfesting tokens
hp_words <- hpdownload %>%
unnest_tokens(word, text) %>%
count(title, word, sort = TRUE)
#Plotting in ggplot
plot_hp <- hp_words %>%
bind_tf_idf(word, title, n) %>%
mutate(title = factor(title, levels = c("Writings in the United Amateur, 1915-1922",
"The Dunwich Horror",
"The Shunned House",
"The call of Cthulhu",
"The colour out of space",
"The festival")))
plot_hp %>%
group_by(title) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = title)) +
geom_col(show.legend = FALSE) +
labs(x = "tf-idf", y = NULL) +
facet_wrap(~title, ncol = 2, scales = "free")
Mining the texts for the terms unique to specific books consistently comes up with character names. If that’s not informative for your analysis, you can add them to your list of stopwords, for example. Or use tf-df to track recurring characters in fiction or figures of importance in non-fiction work.