Weekend Hacks: n-grams

Ethan Willis
Programming Tid-”bits”
5 min readNov 10, 2014

--

A quick intro to N-Grams using Python

What are n-grams?

Simply put n-grams are the progressive sets of n words from a given text. What do I mean by progressive? Well, n-grams are “selected” from text n at a time, and they overlap by n-1 words for each n-gram. As with most things this is more easily explained with an example.

For our purposes we will use the following sentence as our text: “The quick brown fox jumped over the lazy dog”

1-gram list (aka unigrams)
([“The”], [“quick”], [“brown”], [“fox”], [“jumped”], [“over”], [“the”], [“lazy”], [“dog”])

2-gram list (aka bigrams)
( [
“The”, “quick”], [“quick”, “brown”], [“brown”, “fox”], [“fox”, “jumped”], [“jumped”, “over”], [“over”, “the”], [“the”, “lazy”], [“lazy”, “dog”])

As can be seen from the example above n-grams are generated in order by constructing lists of n-words. Where the n-words are selected starting from the first word in a text, index 0. Until you have an n-gram that contains the last word in the text. It’s best to think of this process as sliding a window “n-words wide” along your text as you get each n-gram.

What are n-grams used for?

N-grams are used in the creation of n-gram language models. The basic idea is that in natural language the next word that will be used is dependent upon the previous words. The statistical properties of n-grams can be exploited to generate probabilities for what the “next word” after a given set of words will be. This ability to give a probability for what the next word will be is the n-gram language model.

Finding n-grams using Python

There are many existing libraries for Python that can be used in order to generate the list of n-grams for a corpora(text). However, I think it will be much more useful to go through the process of implementing a function that finds n-grams from scratch.

Whenever I begin writing any program I always think back to the IPO model of programming as a familiar tool. IPO stands for “Input, Process, Output.” This is the underlying mechanism for any program. First you must get input, then you must process it, and then finally you outpu the results from your processing.

IPO for finding n-grams in Python

The following will be our basic steps for finding n-grams using the “IPO” algorithm.

  1. Input: Take as input our text/corpora, and a number n.
  2. Process:
    a.) Tokenization: Split our text by “spaces.” in such a way that we have each single word in our text.
    b.) Windowing: Then we have a list of tokens. Token 0, token 1, token 2, …, token n-1. We want to find every “window” of size n on this list. Starting with Token 0, we take token 0 and n-1 more tokens and store this as the first window. And do this until we have a window with token n-1 in it. We store each “window” in a list.
  3. Output: We print our list of windows.

Python implementation

https://gist.github.com/sn3twork/1d775bc3206df0de76f9

Explanation of Python Code

This code is fairly straightforward and adheres to the IPO model. We have 3 functions defined. They are ngramsProgram(), findNGrams(), and outputNGrams()

  1. Our Input function is ngramsProgram()
    This function has the following parameters:
    corpora
    : This is our text that we are getting our ngrams from. (i.e. “The quick brown fox jumps over the lazy dog.”)
    n: This is the “n” we are using. It could be 1 for a unigram, 2 for a bigram, 3 for a trigram, 4 for a 4-gram, etc.
    Description: This function simply takes the corpora and size n as input and passes them to our “processing function”. Then it takes the results from the processing function and passes them to our “output function.”
  2. Our Process function is findNGrams()
    This function has the following parameters:
    corpora: This is our text that we are getting our ngrams from. (i.e. “The quick brown fox jumps over the lazy dog.”)
    n: This is the “n” we are using. It could be 1 for a unigram, 2 for a bigram, 3 for a trigram, 4 for a 4-gram, etc.
    Description: This function tokenizes our corpora, which effectively gives us all unigrams. Then this function finds all windows of size n and stores each “window” in a list. These windows of size n correspond to an ngram. Once all of the ngrams(windows) have been found, the list of ngrams is returned.
  3. Our Output function is outputNGrams()
    This function has the following parameters:
    ngrams:
    This is a list of the ngrams that were found in our corpora of size n.
    Description: This function takes a list of ngrams and for each ngram in the list it “prints” the ngram to the screen.

Further Enhancements

Further enhancements can be made to the input, processing, and output portions of this program. Here are some enhancements that can be made for each

Input function: Instead of hardcoding our corpora as is done on line 31 of our python program, we could alternatively open a text file, or directory of text files that are stored on our harddrive and use these as input.

Process function: We could enhance our “processing” of ngrams to look for only certain ngrams. For example say we were interested in finding/generating some rumors.. we’d be interested in finding some bigrams such as “they said”, “he said”, or “she said” We could also be interested in how many such ngrams exist and keep a running tally of how many such ngrams are in our corpora.

Output function: Printing out to a terminal/computer screen is useful when developing software, but not too useful when trying to share our results. One enhancement that could be done is to use python file I/O utilities in order to save our ngrams. Then we could save our ngrams as CSV files to be shared with our colleagues who could then open up our results in excel/libre office calc.

--

--