NLP Intro: Exploring word frequency with the tidytext R package

Konstantina Slaveykova
DataDotScience
Published in
4 min readNov 28, 2023
Image: Susan Q Yin, Unsplash

In the previous article in the series, we introduced the term text mining and looked at the benefits of tidy datasets. Now let’s have a quick overview of the tidytext R package (read the textbook here, written by its creators Julia Silge and David Robinson) and what it can offer.

The text mining process has a set of steps that you need to go through, regardless of the specific research question you are trying to address with your analysis.

  1. Turning unstructured data (text) into structured data that can be easily analysed.
  • unnest_tokens() is a function in the tidytext package that both tokenizes your text and transforms it into a tidy data structure in one go. The function takes 2 arguments: first, the name of the new output column you are creating for the token and the name of the input column with the text that needs to be tokenized. Unnest_tokens() converts all words to lowercase, removes punctuation and retains all other columns in your original dataset.
  • tidytext uses tibbles, a type of data frame with some benefits for tidy analysis, like enhanced printing and not converting text strings to factors.
library(tidytext)

# This is mock code, so the dataset names are chosen to help you get the logic
tidydataset <- originaldataset %>%
unnest_tokens(word, text)

# You can test the code with your own dataset(s) or download preprocessed book datasets
# from Project Gutenberg
library(gutenbergr)

# To search for a specific author you can use the following code:
your_author <- gutenberg_works() %>%
filter(str_detect(author, 'Insert Name'))

2. Removing stop words

  • The most commonly occurring words in most texts (like the, of, to, and, etc.) are typically irrelevant to text analysis, so during data cleaning, a common first step is to remove them altogether.
  • You can do that easily in tidytext with anti_join() and the dataset stop_words, which is part of the tidytext package and already contains a comprehensive list of common stop words. Of course, depending on your needs, you can add a dataset with your own preferred stop words.
# You can use the tidytext stop_words dataset
data(stop_words)

# Just use anti_join() to specify the name of the dataset with words that need to be excluded
tidydataset <- tidydataset %>%
anti_join(stop_words)

3. Calculating & visualising word frequencies

  • In the early steps of text analysis, it is really common (and useful) to look at word frequency. It captures how often a word is mentioned in your dataset.
# Use count() to quickly get the count of the most frequently used words
tidydataset %>%
count(word, sort = TRUE)
  • In the example below, I compared the works of early masters of horror and science fiction, Edgar Allan Poe, H.P. Lovecraft and Mary Shelley (or at least those available through Project Gutenberg). The bar plots show the words with the highest frequency for each author.
  • Word frequencies also allow you to make quick comparisons and plots to visualise the linguistic similarities and differences between different bodies of text and/or authors.
# Calculating the word frequency for each author 
frequency <- bind_rows(mutate(tidy_shelley, author = "Mary Shelley"),
mutate(tidy_poe, author = "Edgar Allan Poe"),
mutate(tidy_hp, author = "H.P.Lovecraft")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = author, values_from = proportion) %>%
pivot_longer(`Mary Shelley`:`H.P.Lovecraft`,
names_to = "author", values_to = "proportion")


# Plotting with ggplot
plot<-ggplot(frequency, aes(x = proportion, y = `Edgar Allan Poe`,
color = abs(`Edgar Allan Poe` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001),
low = "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "Edgar Allan Poe", x = NULL)

plot
  • In the plots below, dots that are closer together around the diagonal line tend to appear with similar frequency in the works of both authors (Poe on the y-axis and Lovecraft and Shelley on the x-axis in each panel. Words further up and to the left are more typical for Poe, and closer to the lower right corner: for Lovecraft/Shelley.
  • A correlation test can be used to further quantify how similar or different the datasets are based on the derived word frequencies.

--

--

Konstantina Slaveykova
DataDotScience

Perpetually curious, alway learning | Analyst & certified Software Carpentry instructor | Based in Wellington, NZ