NLP Intro: What is the tidy approach to text mining?

Konstantina Slaveykova
DataDotScience
Published in
3 min readNov 28, 2023
Cover detail: Text Mining with R by Julia Silge and David Robinson

Data scientists Julia Silge and David Robinson created the tidytext R package to make life easier for everyone interested in analysing and visualising unstructured text data.

Their book Text Mining with R is a highly recommended read and if you feel a printed text is a bit redundant as a programming manual, the authors have also made the content freely available online.

You can go to their website for a deep dive into the topic. This article is an introduction to the topic aiming to explain key terms and concepts to beginners who are interested in text text analysis.

What is tidy data?

Popular R packages like tidyr others in the tidyverse can make the expression “tidy data” sound like complicated technical jargon. However, its core principles are very straightforward: to work efficiently with data, you need it in a tidy, properly organised format.

New Zealand-born statistician Hadley Wickham highlighted the importance of best practices in data tidying in a 2014 paper for the Journal of Statistical Software. He defined tidy datasets as “easy to manipulate, model and visualise” but also (crucially for data wranglers) having “a specific structure”, i.e.:

  1. each variable is a column
  2. each observation is a row
  3. each type of observational unit is a table

This encourages consistency in data structure and fosters efficiency in data cleaning and the use of data analysis tools. As Wickham points out, “tidy datasets are all alike but every messy dataset is messy in its own way.”

What is the tidy text format?

The three rules above are created for structured data, so how do they apply to unstructured text? As was pointed out in this article on tokenization, tokenizing your text is the step that allows you to analyse and visualise it.

The tidy text format is when your tokenized text is represented as a table with one token (text unit) per row. This sounds self-evident until you realise that in many analyses text is stored as a string (collection of characters), corpus (raw strings annotated with metadata and other details) or document-term matrix (where rows correspond to text documents and columns to the frequency of the terms appearing in them).

How the tidy approach and tidyverse tools apply to the different stages of text analysis: flowchart from Text Mining with R by Julia Silge and David Robinson

Text is usually first put in a data frame, from which it can easily become a tidy dataset. Once this happens, we can wrangle, explore and analyse it with one or multiple popular tools from the tidyverse (the umbrella term for popular packages like dplyr, tidyr, ggplot2, etc.).

Text mining & the benefits of applying the tidy approach

Turning unstructured data into structured data that can be analysed is the first stage of text mining. The next stage is using this data to identify meaningful patterns in your data.

  • text mining is a type of data mining: making sense of vast amounts of data by applying different techniques to detect patterns that help you gain insight. It is also called knowledge discovery in data (KDD).
  • it might take a bit of time upfront, but once you get your text data into a tidy format, subsequent work and text mining become easier.
  • the tidy one-token-per-row format is useful for all kinds of analysis, including one of the most common needs in text analysis: extracting sentiment, also known as opinion mining in NLP (natural language processing). The latter is central to sentiment analysis in both academic research and applied commercial context (i.e in media analytics, consumer intelligence, etc).

--

--

Konstantina Slaveykova
DataDotScience

Perpetually curious, alway learning | Analyst & certified Software Carpentry instructor | Based in Wellington, NZ