NLP Intro: what are N-grams and how to use them?

Konstantina Slaveykova
DataDotScience
Published in
5 min readOct 25, 2023
Photo by Aedrian on Unsplash

NLP stands for Natural Language Processing*. This is a branch of computer science focusing on the analysis of written or spoken language.

In computational linguistics, it can be used as a research tool and/or a driving force behind machine learning/AI efforts for building programs that translate, edit or interact through human language. It is also central to technologies like OCR (Optical Character Recognition): the handy thing that can turn scans of printed pages into a workable (hopefully!) Word document or PDF file.

*The term shares the acronym NLP with neuro-linguistic programming: a dubious therapeutic approach with no support from rigorous science.

What are N-grams?

In order to analyse large (or small: you need to start somewhere!) bodies of text, you first need to break them down into smaller chunks. This process is called tokenization.

In NLP, the term token is used to mean the smallest unit of text you are trying to analyse. Usually, that would be a single word, but it could be a phrase or a pair of words. Depending on the needs of your analysis you can specify how many.

An N-gram/ngram is the sequence of words by which you want to tokenise the text. N is a stand-in for number, so you can replace it with the value you need. So if you want to use single words, you will be working with unigrams (1-grams). If you want to tokenise by two words: bigrams, 3 — trigrams, etc. You get the drift.

For example, in R & tidyverse this would look like:

#adding the relevant libraries
library(tidyverse)
library(tidytext)


text_bigrams <- text %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
# In n= 2, replace 2 with the number of words you are after
text_bigrams

Why N-grams?

Analysing single words within a body of text has many limitations. Think about the way a word’s meaning (e.g colossal) can be dramatically changed by the context added by the word(s) before or after it (colossal failure).

Bigrams are used quite frequently in NLP but as mentioned above you can specify the exact number of n-grams. They are also very useful in looking at correlations between words in a text (and visualising them).

Want to explore n-grams? Try the Google Books Ngram Viewer

Google is everywhere. But apart from its main search engine capabilities, it offers a range of cool research resources. Google Books, for example, is a comprehensive virtual book catalogue and one of its two main arms, the Library Project aims to make it “possible for users to search on Google through millions of books written in many different languages, including books that are rare, out of print, or generally unavailable outside of the library system”. It currently covers over 40 million books (!) in more than 500 languages.

Such an impressive database is a treasure trove for analysing text content. One way on offer is the Google Books Ngram Viewer. It lets you search for phrases of varying length and see how many times they have occurred in the covered corpus of books across the available languages for the selected time period (currently covering 1800 to 2019). You can then use the Search in Google Books option below to explore specific books that contain the phrase for a given period.

Advanced search

  • you can use a wildcard (*) to search for any phrase containing a specific word, for example, * linguistics will show you the most commonly occurring phrases and, by proxy, the change in popularity of different fields of linguistics.
  • You can make your searches case sensitive or check for different grammatical category inflections by adding _INF (for example book_INF will show you variations like book, booked, booking, etc).
  • To specify which part of speech you mean or leave Ngram Viewer show you the most commonly used ones by entering the keyword with _* at the end. E.g meme_*
  • Have a look at more Advanced search operators here.

Natural Language Processing: R or Python?

If you are done exploring (the Ngram Viewer can be a bit of a rabbit hole, be warned!) you might want to take things a step further and perform an analysis of an entire book or multiple books from the corpus. Have a look at this tutorial on how to export text data from Ngram Viewer.

Depending on why you would like to use NLP, there are many options available. If you already have a background in either R or Python, the choice is probably obvious, but it is different if you are starting with a research question and looking for the right tool/programming language you need to learn to address it.

Both R & Python are open-source options supported by active and friendly online communities of researchers. I strongly recommend you look for local SoftwareCarpentry workshops to start from scratch and kickstart your skills in either language.

As a tool of thumb, both are great tools for research and data analysis but Python is the more general-purpose option. It can be used both for research/data analysis and production (of applications).

There are a number of NLP libraries and packages in both R and Python that can be used for cleaning, analysing and visualising language data. Researching the available options and answering some key questions about your goals could help you decide which one would work best for your use case.

Have fun researching!

--

--

Konstantina Slaveykova
DataDotScience

Perpetually curious, alway learning | Analyst & certified Software Carpentry instructor | Based in Wellington, NZ