Building a Stemmer

An Intro to Text Normalization — let all look alike!

Published in

Analytics Vidhya

11 min readFeb 10, 2020

It is time to talk about stems.

Stems are the main body or stalk of a plant or shrub, typically rising above ground but occasionally subterranean. Well, that’s what google says, and it is right!

But here we’re going to talk about Word Stems. If you’re coming from my previous articles, you know that this is an optional step in the NLP Preprocessing Pipeline.

In this article we’ll implement the Porter Stemmer, probably the most famous algorithm for stemming out there, created by Martin Porter in 1979 (yes, it is old!). Even though this algorithm is “old”, it accomplishes well the task for most cases and will help us to understand the problematic involved in Text Normalization.

As usual, I’ll provide an overall explanation and motivation section and then we’ll get our hands into coding. But before that, I have to point where you can find more information about the Porter Stemmer — the best site out there is Martin’s own site, the tartarus. And even if you feel it is too ‘oldschool’ to be good (doesn’t even have css or js!!!), you can find all the important information out there (even for other stemmers). So, here’s the link/banner:

Porter Stemming Algorithm

This is the 'official' home page for distribution of the Porter Stemming Algorithm, written and maintained by its…

tartarus.org

Why Stemming?

To understand why, let us think back into 1979, the year when Martin Porter wrote his algorithm: the world is adapting to the first personal computers and most researchers are going back and forth with huge stacks of perforated cards or, if you’re in a very wealthy university/lab, a 5 inch, 160kb floppy disk to plug into a mainframe unit.

This would be the place where you would be researching computers if you were in the 70's.

So that you know, a raw .txt file with the Universal Declaration of Human Rights occupy a 10kb file and a simple python file of instructions to do tokenization some other 6kb (python did not exist by then, and most languages, like FORTRAN, was way more verbose— at the exchange of lesser compiler complexity). To apply a simple tokenizer, you would have about 10% of your non-volatile memory occupied — now, imagine the volatile (RAM) memory? I’ll give you a hand — the TI-99/A, one of the fastest PCs at the time, had exacly 16kb vRAM (256 bytes real RAM).

The TI-99/A: this was the best PC you could have in 1979. It had 16kb vRAM.

The facts above shows that in 1970’s/ early 80’s you couldn’t simply store a huge list of words with all its variations to do some NLP (mostly Information Extraction). Also your computer power did not allow you to make powerful matricial operations to work with vectors and then extract information from text data (no passing huge sets of sequence pairs, mr. seq2seq). This means that if you wanted to research or do NLP, you had to do some Text Normalization. That’s a term we haven’t seen yet in this series, but which will become very common from now on.

Text Normalization

Yes, Normalization is related to the idea of Normal distribution.

To Normalize is to make something “Normal”, or more close to the common distribution. Therefore, normalizing a text means reducing variations (specially style variation) the most possible. One way to achieve this is by removing word suffixes, the objective of Stemming. However this is the point where we have to add a word of caution.

Word of Caution: Stemming is COMPLETELY language specific. The tasks we’ve done so far (tokenizing and sentencizing) are more general and text/sentence structure is similar in most western languages. However, Word formation is way more related to cultural formations — and hence, to each language. For example, english words are usually composed of a root and some affixes (suffix or prefix), but german words can be composed of more than one totally distinct terms (and hence stemming one of them would kill the phrase), das ist wunderbar, nicht wahr!?

A tool provided for english language teachers to help explain how a word (morpheme) is formed in english.

So, to conclude, we can use stemming for Text Normalization by removing morpheme affixes (fancy words, right) and attempting to reach the root word (like decaying criminology, crimes, and criminal down to crime) — this will implicate in a direct reduction (around 30%) of our vocabulary.

However, keep in mind that this technique kills any future attempt to have other future analysis regarding syntax or semantics. For example, the application of the rules may kill any differentiation between “Relate” and “Relativity”, which may have a completely distinct root depending on the context.

Implementing the Porter Stemmer

To prepare the ground for our Stemmer, let us define folder structure and some OOP classes.

This is a follow up from a previous tutorial, so if you don’t want to code anything and want to be able to follow the same structure, you can get the code so far by accessing this commit in Github.

For the directory structure. We just create the stemming.py file inside the preprocessing folder. All coding for this tutorial goes there. Test_stemming.py is there for test purposes and will be accessible in the commit at the end of this article.

In our stemming.py, we first create an AbstractStemmer class with a simple stem method. The idea is to have a common interface for other stemmer implementations. And then, we create the PorterStemmer class inheriting from this AbstractStemmer.

The Logic behind Porter Stemmer

To achieve the purpose of reducing a morpheme to its stem, Porter takes note on how words are formed in general — words are made of sets of letters. These letters can be either consonants or vowels (and, in english, there’s the weird special case of ‘y’ that can be a vowel or consonant depending on the preceding letter). Then, there’s a special notation:

c → letter is a consonant (bcdfghjklmnpqrstvwxz)
v → letter is a vowel (aeiou)

But there are diphthongs (two vowels), triphthongs (three vowels), hiatus (two vowels) and digraphs (two consonants), which are orthography constructs used to generate the sounds we use to speak with Latin letters. In this case, considering that there are “units” of vowels/consonants, Porter establishes the C (any sequence >0 of consonants) and the V (any sequence greater than 0 of vowels).

C → ccc…
V →vvv…

With this, Porter summarizes the construction of ANY word in english:

CVCV…C

2. CVCV…V

3. VCVC…C

4. VCVC…V

Which can be summarized to:

[C]VC…[V]

Where square brackets denote that the preceding C and succeeding V are optional. Yeah, you can test — this describes any and all words in english.

Monkey →C[M]V[o]C[nk]V[e]C[y]
Apparatus →V[A]C[pp]V[a]C[r]V[a]C[t]V[u]C[s]
eye →V[e]C[y]V[e]
…

Let us put that in coding terms. First, we divide an input word into its groups. For it, we need to differentiate between consonants and vowels — which we do by making ‘in’ search in a string of vowels or consonants for each letter (and appending to a list whenever we complete a group/find different type of letter). We have two methods: one to divide and one to compare letters:

Let us check it:

Word monkey is divided in the following groups: ['m', 'o', 'nk', 'e', 'y'] 
Word apparatus is divided in the following groups: ['a', 'pp', 'a', 'r', 'a', 't', 'u', 's'] 
Word eye is divided in the following groups: ['e', 'y', 'e']

Okay. Now, a method to classify groups into “C” or “V” and a method to encode a word into groups:

Checking:

Word monkey is has the following groups: ['C', 'V', 'C', 'V', 'C'] 
Word apparatus is has the following groups: ['V', 'C', 'V', 'C', 'V', 'C', 'V', 'C'] 
Word eye is has the following groups: ['V', 'C', 'V']

Now, there’s a measure called “m”, that denotes the number of VC’s between the begin (after the optional consonant group) and the end of the word (before the optional vowel group). This is how it is:

[C]VC{m}[V]

Examples:

Tree, by: m = 0
Trouble, oats, trees, ivy: m = 1
Troubles, private, oaten: m = 2

And so on…

Let us implement a method to determine the number of m:

Testing:

Word Tree is of size m: 0 
Word by is of size m: 0 
Word Trouble is of size m: 1 
Word oats is of size m: 1 
Word trees is of size m: 1 
Word ivy is of size m: 1 
Word Troubles is of size m: 2 
Word private is of size m: 2 
Word oaten is of size m: 2

Great!

Understanding the division of the words is important because Porter uses the size of {m} as a condition to apply the rules. In fact the conditions are defined as such:

(condition) word_termination (or S1) → new_termination (or S2)

Ex.: Replacement at [(m >1) ement → __(none)] becomes replac.

Other important conditions are defined as such:

*S — stem ends with S (or other letters, such as L, T…)
*v* — stem contains a vowel.
*d — stem ends with a double consonant of any type.
*o — stem ends with cvc (consonant followed by vowel followed by consonant) where second consonant is not W, X or Y (see, weird y again!).

Let us implement each of them:

So chk_LT checks then end of a stem is of one of the letters provided in the lt string; chk_v checks if the stem contains a vowel; chk_d checks if there’s a double consonant ending; chk_o checks whether the stem ends with cvc (except W, X and Y).

With these implemented, we can now go forward with the stemming steps themselves. There are 5 of them with distinct sub-steps, so, bear with me a while more!

Step 1: designed to deal with plurals and past participles. There are three sub-steps:

Step 1a

    SSES -> SS                         caresses  ->  caress
    IES  -> I                          ponies    ->  poni
                                       ties      ->  ti
    SS   -> SS                         caress    ->  caress
    S    ->                            cats      ->  catStep 1b

    (m>0) EED -> EE                    feed      ->  feed
                                       agreed    ->  agree
    (*v*) ED  ->                       plastered ->  plaster
                                       bled      ->  bled
    (*v*) ING ->                       motoring  ->  motor
                                       sing      ->  sing

Here there’s a condition — if steps 2 or 3 of 1b passes, another (sub)sub step has to be done — this step will add letters. Why? To be able to generalize better in further steps.

(if step 1b is true)
AT -> ATE                       conflat(ed)  ->  conflate
BL -> BLE                       troubl(ed)   ->  trouble
IZ -> IZE                       siz(ed)      ->  size
(*d and not (*L or *S or *Z))
       -> single letter
                                    hopp(ing)    ->  hop
                                    tann(ed)     ->  tan
                                    fall(ing)    ->  fall
                                    hiss(ing)    ->  hiss
                                    fizz(ed)     ->  fizz
     m=1 and *o) -> E               fail(ing)    ->  fail
                                    fil(ing)     ->  file

And still step 1 isn’t done. There’s one last sub step related to it:

Step 1c

    (*v*) Y -> I                    happy        ->  happi
                                    sky          ->  sky

Let’s merge it all in one method:

Phew! That’s a lot of core, isn’t it? Well, I just followed Porter’s rules and that’s it. Remember that this code is the result of a long research onto how to “normalize” words and that involves, for example, knowing word classes and terminations in english. But this first step clearly deals with plurals (by removing ‘ies’, ‘sses’ and ‘s’) and past participles( by removing ‘eed’, ‘ing’ and ‘ed’), aside from some minor tweaks.

Next is Step 2, related to several common terminations:

Step 2

    (m>0) ATIONAL ->  ATE           relational     ->  relate
    (m>0) TIONAL  ->  TION          conditional    ->  condition
                                    rational       ->  rational
    (m>0) ENCI    ->  ENCE          valenci        ->  valence
    (m>0) ANCI    ->  ANCE          hesitanci      ->  hesitance
    (m>0) IZER    ->  IZE           digitizer      ->  digitize
    (m>0) ABLI    ->  ABLE          conformabli    ->  conformable
    (m>0) ALLI    ->  AL            radicalli      ->  radical
    (m>0) ENTLI   ->  ENT           differentli    ->  different
    (m>0) ELI     ->  E             vileli        - >  vile
    (m>0) OUSLI   ->  OUS           analogousli    ->  analogous
    (m>0) IZATION ->  IZE           vietnamization ->  vietnamize
    (m>0) ATION   ->  ATE           predication    ->  predicate
    (m>0) ATOR    ->  ATE           operator       ->  operate
    (m>0) ALISM   ->  AL            feudalism      ->  feudal
    (m>0) IVENESS ->  IVE           decisiveness   ->  decisive
    (m>0) FULNESS ->  FUL           hopefulness    ->  hopeful
    (m>0) OUSNESS ->  OUS           callousness    ->  callous
    (m>0) ALITI   ->  AL            formaliti      ->  formal
    (m>0) IVITI   ->  IVE           sensitiviti    ->  sensitiv

In code, we pass a little list of tuples of “termination” → “substitution”. Also, we create a new method (we’ll do the same for each step).

More straightforward, right? Just simple substitution. Notice now how important is the concept of {m}! One note here: Porter suggests that the algorithm could be faster if we made a switch on the penultimate letter. I chose not to for the sake of simplicity.

Time for Step 3:

Step 3

    (m>0) ICATE ->  IC              triplicate     ->  triplic
    (m>0) ATIVE ->                  formative      ->  form
    (m>0) ALIZE ->  AL              formalize      ->  formal
    (m>0) ICITI ->  IC              electriciti    ->  electric
    (m>0) ICAL  ->  IC              electrical     ->  electric
    (m>0) FUL   ->                  hopeful        ->  hope
    (m>0) NESS  ->                  goodness       ->  good

Another simple code block:

And step 4, for removal of suffixes (now we see that stemming doesn’t care about readability):

Step 4

    (m>1) AL    ->                  revival        ->  reviv
    (m>1) ANCE  ->                  allowance      ->  allow
    (m>1) ENCE  ->                  inference      ->  infer
    (m>1) ER    ->                  airliner       ->  airlin
    (m>1) IC    ->                  gyroscopic     ->  gyroscop
    (m>1) ABLE  ->                  adjustable     ->  adjust
    (m>1) IBLE  ->                  defensible     ->  defens
    (m>1) ANT   ->                  irritant       ->  irrit
    (m>1) EMENT ->                  replacement    ->  replac
    (m>1) MENT  ->                  adjustment     ->  adjust
    (m>1) ENT   ->                  dependent      ->  depend
    (m>1 and (*S or *T)) ION ->     adoption       ->  adopt
    (m>1) OU    ->                  homologou      ->  homolog
    (m>1) ISM   ->                  communism      ->  commun
    (m>1) ATE   ->                  activate       ->  activ
    (m>1) ITI   ->                  angulariti     ->  angular
    (m>1) OUS   ->                  homologous     ->  homolog
    (m>1) IVE   ->                  effective      ->  effect
    (m>1) IZE   ->                  bowdlerize     ->  bowdler

And finally, the last, 5th Step!

Step 5a

    (m>1) E     ->                  probate        ->  probat
                                    rate           ->  rate
    (m=1 and not *o) E ->           cease          ->  ceas

Step 5b

    (m > 1 and *d and *L) -> single letter
                                    controll       ->  control
                                    roll           ->  roll

Here I’ve made a couple changes to the code. Somehow I was having problems with probate and other examples, which was causing the creation of many small stems. To solve that, I made a size constraint — the second rule also checks for stems to be larger than 4 in length before applying:

That’s it! Well, for sure there were many rules and lines of code. But the idea itself is not so complex — you study some words and try to find common terminations — then, apply a chain of rules and try to reduce them the most without killing the root word behind it.

Just to wrap up with the code, we have to implement that method we created in the AbstractStemmer class. And there we’ll join all the steps:

Now we can play with it!

Result:

The number of distinct words in the Universal Declaration of Human Rights is: 551. After stemming, the number is: 476

75 words less. That’s some economy. If we take stopwords and numbers out as well, the percentage may increase. However, as mentioned, this is not fully precise and far from perfect. We’ll only be able to have better results (without losing too much information) if we move to Lemmatizing. However, this requires another step to be done — POS tagging.

Time to get really messed up. Next up, let us talk about POS tagging (and play with some machine learning)!

Btw, Here’s the project so far (this link will take you to the commit after the Stemmer is implemented):

Sirsirious/NLPTools

A set of NLP tools created during my medium NLP Explanation series. - Sirsirious/NLPTools

github.com

And, if you notice something that has to change or have any suggestion regarding the code, feel free to leave a comment or make a pull request to the GitHub project (just keep it simple, ok)!