NLP Terminology and Pre-Processing

Published in

Nwamaka Imasogie’s Machine Learning and Artificial Intelligence Projects

13 min readSep 13, 2019

Even though the majority of NLP state-of-the-art algorithms are primarily made up of deep learning, it’s important to understand the basics before introducing machine learning. I’ll be following along with the Speech and Language Processing course by Dan Jurafsky at Stanford and I’ll document what I’ve learned along the way. This will be from a combination of reading textbooks, slides, watching videos and doing hands on coding projects.

Before statistical analysis came along linguists devoted their lives to studying the theory of language. So having domain knowledge of language theory serves as a great foundational area to familiarize yourself with…things like: pragmatics, semantics, syntax, and morphology, etc.

In this blog I’ll cover all the different pre-processing techniques for text data like stemming, lemmatization, stop word removal and tokenization using NLTK. And finally I’ll wrap it up all up with a project!

Pre-Processing

Text data is usually very messy. Rarely will our data come in the form of a corpus that’s ready to be fed into a model. You’ll usually have to clean it up first by doing things like stemming, lemmatization, removing stop words, tokenization, etc.

I’ll be using NLTK to clean up text data.

I watched these 5 video lectures which cover basic text processing:

Word Tokenization

Every NLP task needs to do segmenting/tokenizing words in running text, normalizing word formats and segmenting sentences in running text.

Definitions

Lemma: same stem, part of speech, or rough word sense like ‘cat’ and ‘cats’ are the same lemma. ‘They’ and ‘their’ are the same lemma but different wordforms.
Wordform: the full inflected surface form. So ‘cat’ and ‘cats’ are different wordforms.
Type: an element of the vocabulary.
Token: an instance of that type of in running text.

So in the following sentence:

they lay back on the San Francisco grass and looked at the stars and their

There are 15 tokens if you consider ‘San’ and ‘Fransisco’ two different tokens or 14 if you consider them as one token.

There are 13 types because ‘the’ occurs twice. Or 12 if you consider ‘San Fransisco’ as one word. Or 11 if you consider ‘they’ and ‘their’ as the same type, but this will depend on our goal.

Creating a Standard for Tokenization

You might run into some common issues during tokenization like apostrophe’s dashes, acronyms with periods, etc. So you’ll have to create a standard for handling those issues. And how you decide to handle the issues will depend on your goal. Here are some examples:

Finland’s capital → Finland or Finlands or Finland’s ?
what’re, I’m, isn’t → what are, I am, is not ?
Hewlett-Packard → Hewlett Packard ?
state-of-the-art → state of the art ?
San Fransisco → one token or two?
m.p.h., PhD. → ??

Normalizing and Stemming

Once we have tokenized our text (a.k.a. segmenting) we need to normalize and stem them.

Case Folding

For Information Retrieval purposes we would want to, for example, match U.S.A. and USA. We can implicitly define equivalence classes of terms by deleting the periods in a term.

In general, because users tend to use lower case we will reduce all letters to lower case in Information Retrieval, except when we see capital letters that occur mid-sentence like: General Motors. For Sentiment Analysis, Machine Translation and Information Retrieval the case is very important because US versus us have different meanings.

Lemmatization

In lemmatization we reduce variant forms to one base form. In general, the task of lemmatization is finding the correct dictionary headword form. For am example, we convert:

am, are, is → be
car, cars, car’s, cars’ → car

So a phrase like the boy's cars are different colors gets lemmatized to the boy car be different color.

Morphology

Morphemes are the smallest meaningful units that make up words. There are two types of morphemes:

Stem: the core meaning-bearing unit.
Affix: A piece that adheres to a stem. Affixes usually have grammatical functions. For example, in the word meaningful , meaning is the stem and ful is the affix. And in the word units, unit is the stem and s is the affix.

Stemming

Stemming is the task of reducing terms to their stems by crudely chopping off affixes. Stemming is a simplified version of lemmatization. We just use the prefix to represent a word and chop off the suffix. For example, automate(s), automatic, automation all reduced to automat.

This sentence:

for example compressed and compression are both accepted as equivalent to compress.

After stemming it becomes:

for exmpl compress and compress ar both accept as equival to compress

We lost the e in example, turned compressed and compressing both became compress, accepted became accept and equivalent became equival.

Porter’s algorithm

Porter’s algorithm is the most common English stemmer for simple stemming. It’s an iterative series of simple “replace” rules. I’ll walk through some parts of the algorithm:

Step 1a

It will take strings like sses and replaces them with ss (e.g. caresses becomes caress).

It replaces ies with i (e.g. ponies becomes poni).

The rules operate in order so if there is an ss left, it will remain ss (e.g. caress remains caress).

But if there is any other s left, it gets removed (e.g. cars becomes cat). Here’s the order of the logic:

sses → ss       caresses → caress
ies  → i        ponies → poni
ss   → ss       caress → caress
s    → ∅        cats → cat

Step 1b

If there’s a word in this format, (*v*)ing where there’s an additional vowel before the ing, then we remove the ing (e.g. walking becomes walk but note that sing will remain sing because there’s no vowel before the ing). The same vowel idea applies for ed as well. The logic is:

(*v*)ing → ∅     walking → walk 
                 sing → sing
(*v*)ed  → ∅     plastered → plaster

I’ll jump straight into summarizing the logic for step 2 and 3 below.

Step 2 (for long stems)

ational → ate     relational → relate
izer    → ize     digitizer → digitize
ator    → ate     operator → operate

The rules get even more complicated as you get to longer stems.

Step 3 (for longer stems)

al   → ∅     revival → reviv
able → ∅     adjustable → adjust
ate  → ∅     activate → activ

…

and the algorithm just keeps going in a similar fashion.

Sentence Segmentation and Decision Trees

How can we tell when a sentence ends and a new one begins? Exclamation marks, !, and question marks, ?, are great because they are relatively unambiguous. While periods can represent sentence boundaries, periods, ., are very ambiguous because they can represent abbreviations and decimal numbers like Inc. or 5.3. To solve this “period” problem we need to build a binary classifier which will look at a period and output a yes/no decision that you are either “at the end of a sentence” or “not at the end of a sentence”.

To make this classifier we could use hand-written rules, regular expressions, or a machine-learning classifier. The simplest kind of classifier for this is a decision tree. A decision tree is a simple if/then procedure that asks a question and branches based on the answer to the question.

Here’s a simple decision tree that decides whether a word is an end-of-sentence or not.

There are still a bunch more sophisticated decision tree features that we could add that will tell us, for example:

We could look at the word with the period. If it’s uppercase, lowercase, all caps or a number. Because a word that’s all caps is likely to be an abbreviation.
We could look at the case of the word that comes right after the period. Because if the next word after the period starts with a capital letter then it’s likely to be a period that ends the sentence.
We can look at numeric features such as the length of a word with the period. For example, abbreviations and acronyms tend to be relatively short.
For a more sophisticated assessment we could consider the scenario where we have a corpus and already know where the sentence boundaries are. We could calculate how this word occurs at the end of a sentence.
We could also look at the probability that a word after a period starts a sentence. For example, the word ‘The’ appearing after a period tends to start a sentence.

In general, building the structure of a decision tree is difficult to do by hand. It can be hand-built for very simple features or simple domains. For numerical features it can be especially hard to pick thresholds. And that’s why we generally use machine learning. Machine learning will learn the structure from a training corpus.

NLTK Hands-On Tutorial

NLTK is one of the most popular Python packages for Natural Language Processing. I will do several hands on examples that demonstrate how to use NLTK. Here’s the github repository where my various examples are stored. I cover the following topics:

1. Text Analysis using nltk.text (view code)

I extract interesting data from a given text. I’ll highlight just a few subjects from this lesson…

Concordance: viewing every occurrence of a given word, together with some context.
Collocations: A sequence of words that occur together unusually often, like red wine.
Dispersion plot: I create a plot that shows the location of a word in the text (how many words from the beginning does it appear).
Plotting the Frequency Distribution: I plot the 20 most common tokens.
Common Contexts: Shows when the use of a list of words share the same surrounding words. I use the Reuters Corpus from NLTK’s corpora and can see, for example, that August and June occur in several of the same surrounding words. For example, last August when and last June when.

2. Creating N-Grams for Language Classification (view code)

In this lesson I derive n-grams from text and categorize which language it belongs to. I’ll highlight some of the subjects that I cover:

To demonstrate an understanding of the process, I dive deep into how to create a quad-gram, and flatten it into its proper format. I keep the top 300 most repeated quad-grams because starting at rank of 300 or so the N-gram frequency profile starts becoming specific to the topic. By the way, if you were wondering how to choose a value for “n” I touch on that in this Jupyter Notebook as well!
I derive bi-grams, tri-grams and quad-grams (for multiple values of n…for n = 1, 2, 3, 4) using NLTK’s everygrams module.
I categorize French and English text using NLTK’s guess_language module.

3. Using Stop Words to Detect Language (view code)

Stop words are common words that add no additional meaning to text such as ‘a’, ‘the’, etc. We can use stop words to detect language as a simple way to find out what language a text is written in. In this segment I cover:

Exploring corpora: I explore NLTK’s stopwords corpus which has a corpus of stop words from various languages. Corpus readers have a variety of ways to read data from a corpus like .words(), .raw(), and .sents()
Classification: I start by computing language probability based on which stopwords are used. I do this by looping through the list of stop words in all of the languages. Then check how many stop words our tokenized text contains in each language. The text is finally classified based on the language in which it has the most stop words.

4. Identifying Language Using Word Bigrams (view code)

In this section I use bigrams, at the sentence level, to identify language. Here bigrams are basically a set of 2 co-occuring words within a given sentence. For example, for the sentence “The cow jumps over the moon” the bigrams would be:

the cow
cow jumps
jumps over
over the
the moon

In this section I demonstrate a few key things:

Step-by-step code demos to understand the process: I start by doing step-by-step code demonstrations before finally putting it all together and ultimately performing language classification. In order to demonstrate the process, I create a tokenizer, create unigrams and bigrams, and plot the frequency distribution, etc.
Probability: I create a probability equation and give an in-depth breakdown explaining the math and the rationale.
Classify: There are 3 training files. Each file contains meeting minutes from the same meeting but transcribed in different languages: English, French and Italian. Given a bigram and a language I calculate the probability that the bigram is of that language. Using the test data my classifier is able to predict the language with 98% accuracy.
Confusion Matrix: I create a confusion matrix to visually evaluate the performance of the classifiers guessed languages versus the actual language.

5. Stemming, Lemmatization and Bigrams (view code)

I work through examples on stemming, lemmatization and bigrams using NLTK. In this section I cover:

Explore the Reuters corpus: The Reuters Corpus contains 10,788 news documents from the Reuters financial newswire service, totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called “training” and “test”. This split is for training and testing algorithms that automatically detect the topic of a document.
Word bigrams from the Reuters corpus: Remove stopwords and lowercase before creating bigrams. Create a frequency distribution and plot the 25 most common bigrams.
Stemming: Stemming is the process of reducing inflected (or derived) words to their word stem. It usually refers to a crude heuristic process that chops off the ends of words. It reduces the size of the corpus that our model will work with (especially helpful when working with large datasets). Additionally, we are helping our model by explicitly correlating words that have similar meanings — this is beneficial in case our model is not be able to figure out which words are related by using context only. I code examples of a few stemming algorithms: Porter, Snowball and Lancaster; and I discuss the differences between them.
Lemmatization: Lemmatization aims to achieve a similar base “stem” for a word, but it derives the proper dictionary root word, not just a truncated version of the word. I note the key differences and touch on how to determine whether you should perform stemming or lemmatization. Lemmatizing is slower than stemming because it uses WordNet (a large English dictionary) whereas stemming just follows predefined steps. And when lemmatizing you have to define the part-of-speech (POS) tag — noun, verb, adjective or adverb. My code demonstrates lemmatization based on it’s parts-of-speech using the nltk.stem WordNetLemmatizer() and the WordNet NLTK corpus reader.

6. Finding Unusual Words in a Given Language (view code)

This section demos how to find “unusual” within a text by using the nltk Wordlist corpus and simply comparing the difference between my text and the words in the corpus. The nltk Wordlist corpus is just a newline-delimited list of dictionary words. It is a standard file located on any Unix operating systems. It’s used by some spell checkers and we can use the Words Corpus to find either unusual or mis-spelt words in our text.

7. Creating a Part-of-Speech Tagger (view code)

Part-of-Speech tagging (or POS for short) is labelling each word with their appropriate Part-of-Speech such Noun, Verb, Adjective, Adverb, Pronoun, etc. These word classes (also known as lexical categories) are useful categories for many language processing tasks. In this notebook I train a classifier to determine which suffixes are most informative for POS tagging. Here are highlights from this section:

The Brown corpus: I train the POS tagger using the Brown corpus because it consists of about 1 million words of American English that was painstakingly tagged with part-of-speech markers.
Feature Extraction: Before starting training a classifier, we must agree first on what features to use. I use the 2-letter suffix and the 3-letter suffix. The 2-letter suffix is a great indicator of past-tense verbs, ending in “-ed”. And the 3-letter suffix helps recognize the present participle ending in “-ing”. I’d like to note that we can do better by also looking at the word itself, the word before and the word after. However, for the scope of this project we’ll move forward with just the suffixes.
Decision Tree Classifier: I do an 80/20 split and train a Decision Tree Classifier from NLTK resulting in a 62% accuracy, which is not great. To improve the classifier, if we worked with tagged sentences instead of tagged words we can add more contextual features as I mentioned before, like the word itself, the word before and the word after. As well as the previous tag!
Pseudocode: Decision tree’s are fairly easy to interpret and NLTK makes it easy to print out the decision tree’s steps as pseudocode. I explain the tagger’s pseudocode step-by-step down to 4 levels.

8. Part-of-Speech and Meaning (view code)

In this section I explore and code for the following topics:

WordNet: WordNet is a large English dictionary in the form of a semantic graph a.k.a. semantic network. In NLTK WordNet is just another corpus reader. It groups English words into sets of synonyms called synsets. A synset is a set of synonyms that share a common meaning. I also explore hypernym’s, hyponyms and antonyms. A hypernym is a word with a broad meaning constituting a category into which words with more specific meanings fall (e.g. color is a hypernym of red). A hyponym is a word of more specific meaning than a general or superordinate term applicable to it (e.g. spoon is a hyponym of cutlery).
Similarity: I use the Wu-Palmer score to determine how similar two word senses are. This score is based on the distance between the words in the semantic graph.
Chunking: I created a Noun Phrase chunker using NLTK’s RegexpParser. I defined the regex “rules” that tell the sentences when chunking should happen.
Named Entity Recognition: I performed named entity recognition on a sentence using NLTK’s ne_chunk. The goal of entity recognition is to detect entities such as Person, Location, Time, etc. The most commonly used types of named entities are: Organization, Person, Location, Date, Time, Money, Percent, Facility and Geo-political entities such as city, state and country.

9. Eminem, Akon & NLP — PROJECT (view code)

In this last (but not least) part of the NLTK Hands-On Tutorial I use NLTK to perform stemming, lemmatiziation, tokenization, stop word removal this dataset where “Akon explains how Eminem treats recording music like a nine-to-five job”. I gather the data by scraping it from Genius.com and use it as my dataset. I pre-process the text, stem using the Porter stemmer and I wrote a mapping function to convert NLTK’s Treebank tags to WordNet’s part-of-speech constants so that I can proceed with using the WordNet lemmatizer.