Building a Part of Speech Tagger

PoS Tagging — what, when, why and how.

Tiago Duque
Analytics Vidhya
12 min readFeb 21, 2020

--

Time to dive a little deeper onto grammar.

In this article, following the series on NLP, we’ll understand and create a Part of Speech (PoS) Tagger. The idea is to be able to extract “hidden” information from our text and also enable future use of Lemmatization, a text normalization tool that depends on PoS tags for correction.

In this article, we’ll use some more advanced topics, such as Machine Learning algorithms and some stuff about grammar and syntax. However, I’ll try to keep it understandable as promised, so don’t worry if you don’t know what is a Supervised Machine Learning Model, or if you have doubts about what is a Tree Bank, since I’ll try to make it as clear and simple as possible.

What are Parts of Speech?

To start, let us analyze a little about sentence composition. Have you ever stopped to think how we structure phrases? They are not random choices of words — you actually follow a structure when reasoning to make your phrase.

Of course, we follow cultural conventions learned from childhood, which may vary a little depending on region or background (you might have noticed, for example, that I use a somewhat ‘weird’ style in my phrasing — that’s because even though I’ve read and learned some english, portuguese is still my mother language and the language that I think in).

Grammars! The word always reminds me of bulky ancient books.

However, inside one language, there are commonly accepted rules about what is “correct” and what is not. For example, in English, adjectives are more commonly positioned before the noun (red flower, bright candle, colorless green ideas); verbs are words that denote actions and which have to exist in a phrase (for it to be a phrase)…

These rules are related to syntax, which according to Wikipedia “is the set of rules, principles, and processes that govern the structure of sentences”. Now, if you’re wondering, a Grammar is a superset of syntax (Grammar = syntax + phonology + morphology…), containing “all types of important rules” of a written language.

syntax […] is the set of rules, principles, and processes that govern the structure of sentences (sentence structure) in a given language, usually including word order— Wikipedia.

To better be able to depict these rules, it was defined that words belong to classes according to the role that they assume in the phrase. These roles are the things called “parts of speech”. Now, the number of distinct roles may vary from school to school, however, there are eight classes (controversies!!) that are generally accepted (for English). In alphabetical listing:

A neat set of posters form this site. Nice to learn this way, right?
  • Adjective: beautiful, green, awesome…
  • Adpositions (prepositon/postposition): to, with, in…
  • Adverb: hardly, above, soon…
  • Articles: a, the, an…
  • Conjunction: and, but, yet…
  • Noun: cat, ape, apple…
  • Pronoun: it, he, you…
  • Verb (including auxiliary): to be, working, stood…

In the case of NLP, it is also common to consider some other classes, such as determiners, numerals and punctuation. Also, there can be deeper variations (or subclasses) of these main classes, such as Proper Nouns and even classes to aggregate auxiliary information such as verb tense (is it in the past, or present? — VBP, VB).

In current day NLP there are two “tagsets” that are more commonly used to classify the PoS of a word: the Universal Dependencies Tagset (simpler, used by spaCy) and the Penn Treebank Tagset (more detailed, used by nltk).

Ultimately, what PoS Tagging means is assigning the correct PoS tag to each word in a sentence. Today, it is more commonly done using automated methods. Let us first understand how useful is it, then we can discuss how it can be done.

When and Why to use PoS Tagging?

So you want to know what are the qualities of a product in a review? One way to do it is to extract all the adjectives into this review. Also, you could use these words to evaluate the sentiment of the review. This is one of the applications of PoS Tagging.

Also, as mentioned, the PoS of a word is important to properly obtain the word’s lemma, which is the canonical form of a word (this happens by removing time and grade variation, in English).

For example, what is the canonical form of “living”? “to live” or “living”? It depends semantically on the context and, syntactically, on the PoS of “living”. If “living” is an adjective (like in “living being” or “living room”), we have base form “living”. If it is a noun (“he does it for living”) it is also “living”. But if it is a verb (“he has been living here”), it is “lo live”. This is an example of a situation where PoS matters.

Acquiring canonical form of “living” is one example of why PoS is important.

Considering these uses, you would then use PoS Tagging when there’s a need to normalize text in a more intelligent manner (the above example would not be distinctly normalized using a Stemmer) or to extract information based on word PoS tag.

Another use is to make some hand-made rules for semantic relation extraction, such as attempting to find actor (Noun or Proper Noun), action (Verb) and modifiers (Adjectives or Adverbs) based on PoS tags.

So, if there are many situations where PoS Tagging is useful, how can it be done?

How to get better grades in English class… Syntax… How to… Grammar… Bzzzzzz!

How?

Now it is time to understand how to do it.

If you’re coming from the stemming article and have no experience in the area, you might be frightened by the idea of creating a huge set of rules to decide whether a word is this or that PoS. I understand you. When doing my masters I was scared even to think about how a PoS Tagger would work only because I had to remember skills from the secondary school that I was not too good at.

Let us scare of this fear: today, to do basic PoS Tagging (for basic I mean 96% accuracy) you don’t need to be a PhD in linguistics or computer whiz. But before seeing how to do it, let us understand what are all the ways that it can be done.

There are four main methods to do PoS Tagging (read more here):

1. Manual Tagging: This means having people versed in syntax rules applying a tag to every and each word in a phrase.

  • This is the time consuming, old school non automated method. Reminds you of homeworks? Yeah… But it is also the basis for the third and fourth way.

2. Rule-Based Tagging: The first automated way to do tagging. Consists of a series of rules (if the preceding word is an article and the succeeding word is a noun, then it is an adjective…). Has to be done by a specialist and can easily get complicated (far more complicated than the Stemmer we built).

  • The position of “Most famous and widely used Rule Based Tagger” is usually attributed to E. Brill’s Tagger.

3. Stochastic/Probabilistic Methods: Automated ways to assign a PoS to a word based on the probability that a word belongs to a particular tag or based on the probability of a word being a tag based on a sequence of preceding/succeeding words. These are the preferred, most used and most successful methods so far. They are also the simpler ones to implement (given that you already have pre annotated samples — a corpus).

  • Among these methods, there could be defined two types of automated Probabilistic methods: the Discriminative Probabilistic Classifiers (examples are Logistic Regression, SVM’s and Conditional Random Fields — CRF’s) and the Generative Probabilistic Classifiers (examples are Naive Bayes and Hidden Markov Models — HMM).

4. Deep Learning Methods: Methods that use deep learning techniques to infer PoS tags. So far, these methods have not shown to be superior to Stochastic/Probabilistic methods in PoS tagging — they are, at most, at the same level of accuracy — at the cost of more complexity/training time.

Today, some consider PoS Tagging a solved problem. Some closed context cases achieve 99% accuracy for the tags, and the gold-standard for Penn Treebank is kept at above 97.6 f1-score since 2002 in the ACL (Association for Computer Linguistics) gold-standard records.

Some consider PoS Tagging a solved problem!

These results are thanks to the further development of Stochastic / Probabilistic Methods, which are mostly done using supervised machine learning techniques (by providing “correctly” labeled sentences to teach the machine to label new sentences).

So, how we’ll do it? I’ll try to offer the most common and simpler way to PoS Tag. But to do that, I won’t be posting the code here. Instead, I’ll provide you with a Google Colab Notebook where you can clone and make your own PoS Taggers. Also, we get free resources for training! All the steps in downloading training and exporting the model will be explained there.

But I’ll make a short summary of the things that we’ll do here.

  1. First, download a corpus. A corpus is how we call a Dataset in NLP. We’ll use Penn Treebank sample from NLTK and Universal Dependencies (UD) corpus. We’ll also see how to use the CoNLL-u format, the most common format for linguistic annotated corpora (the plural of corpus).
  2. Second step is to extract features from the words. We do that to by getting word termination, preceding word, checking for hyphens, etc. This will compose the feature set used to predict the POS tag.
  3. Third, we load and train a Machine Learning Algorithm. We’ll use a Conditional Random Field (CRF) suite that is compatible with sklearn, the most used Machine Learning Module in Python.
  4. We test the trained models, checking f1 scores (explained there) for each. We also provide a way to test the models in a more “practical” manner.
  5. We save the models to be able to use them in our algorithm.

Enough chatting — Here’s the link:

If you’ve went through the above notebook, you now have at hands a couple pickled files to load into your tool. Let us start putting what we’ve got to work.

If you have not been following this series, here’s a heads up: we’re creating a NLP module from scratch (find all the articles so far here). Since we’ll use some classes that we predefined earlier, you can download what we have so far here:

Following on, here’s the file structure, after the new additions (they are a few, but worry not, we’ll go through them one by one):

I’m using Atom as a code editor, so we have a help here. It is integrated with Git, so anything green is completely new (the last commit is from exactly where we stopped last article) and everything yellow has seen some kind of change (just a couple lines).

Let us first analyze our changes:

In core/structures.py file, notice the diff file (it shows what was added and what was removed):

Changes in structures.py

Aside from some minor string escaping changes, all I’ve done is inserting three new attributes to Token class. They’ll be able to hold the token PoS and the raw representation and repr (will hold the lemmatized/stemmed version of the token, if we apply any of the techniques). I also changed the get() method to return the repr value.

The changes in preprocessing/stemming.py are just related to import syntax. You can find the whole diff here.

Moving forward, let us discuss the additions.

First, since we’re using external modules, we have to ensure that our package will import them correctly. For that, we create a requirements.txt. For now, all we have in this file is:

sklearn-crfsuite==0.3.6

Also, do not forget to do pip install -r requirements.txt to do testing!

Next, we have to load our models. I’ve defined a folder structure to host these and any future pre loaded models that we might implement. This is done by creating preloaded/models/pos_tagging. There, we add the files generated in the Google Colab activity.

If you didn’t run the collab and need the files, here are them:

The following step is the crucial part of this article: creating the tagger classes and methods. That’s what in preprocessing/tagging.py. Let’s go through it step by step:

1. Imports and definitions — we need re(gex), pickle and os (for file system traversing). sklearn-crfsuite is inferred when pickle imports our .sav files.

Tagging imports

2. Creating Abstract Tagger and Wrapper — these were made to allow generalization. As long as we adhere to AbstractTagger, we can ensure that any tagger (deterministic, deep learning, probabilistic …) can do its thing with a simple tag() method. The TaggerWrapper functions as a way to allow any type of machine learning model (sklearn, keras or anything) to be called the same way (the predict() method).

Wrappers for generalization

3. Creating the Machine Learning Tagger (MLTagger) class — in it we hardcode the models directory and the available models (not ideal, but works for now) — I’ve used a dictionary notation to allow the TaggerWrapper to retrieve configuration options in the future. In the constructor, we pass the default model and a changeable option to force all tags to be of the UD tagset.

The highlight here goes to the loading of the model — it uses the dictionary to unpickle the file we’ve gotten from Google Colab and load it into our wrapper. This will allow a single interface for tagging.

4. Creating the feature extraction method — we need a way to turn our tokens into features, so we copy the same one we used to train the model — this way we ensure that our features will look the same and the predictions will follow the model.

5. Creating a conversor for Penn Treebank tagset to UD tagset — we do it for the sake of using the same tags as spaCy, for example. Just remember to turn the conversion for UD tags by default in the constructor if you want to.

6. Implementing our tag method — finally! We’re doing what we came here to do! With all we defined, we can do it very simply. We force any input to be made into a sentence, so we can have a common way to address the tokens. Then, we form a list of the tokens representations, generate the feature set for each and predict the PoS. The next step is to check if the tag as to be converted or not. Finally, the PoS is loaded into the tokens from the original sentence and returned.

Yay! We can do some PoS tagging!

To make that easier, I’ve made a modification to allow us to easily probe our system. I’ve added a __init__.py in the root folder where there’s a standalone process() function.

It basically implements a crude configurable pipeline to run a Document through the steps we’ve implemented so far (including tagging). It looks like this:

Lets run it:

$python3
>>>import NLPTools
>>>doc = NLPTools.process("Peter is a funny person, he always eats cabbages with sugar.")
>>>for sentence in doc.sentences:
... for token in sentence.tokens:
... print("("+token.raw+", "+str(token.PoS)+")", end=" ")
out: (<SOS>, None) (pet, NNP) (i, VBZ) (a, DT) (funni, JJ) (person, NN) (,, ,) (he, PRP) (alwai, RB) (eat, VBZ) (cabbag, NNS) (with, IN) (sugar, NN) (<EOS>, None)

What happened? Well, we’re getting the results from the stemmer (its on by default in the pipeline). But we can change it:

>>>doc = NLPTools.process("Peter is a funny person, he always eats cabbages with sugar.", pipeline=['sentencize','pos'])
>>>for sentence in doc.sentences:
... for token in sentence.tokens:
... print("("+token.raw+", "+str(token.PoS)+")", end=" ")
out: (<SOS>, None) (Peter, NNP) (is, VBZ) (a, DT) (funny, JJ) (person, NN) (,, ,) (he, PRP) (always, RB) (eats, VBZ) (cabbages, NNS) (with, IN) (sugar, NN) (<EOS>, None)

Btw, VERY IMPORTANT: if you want PoS tagging to work, always do it before stemming. Otherwise failure awaits (since our pipeline is hardcoded, this won’t happen, but the warning remains)!

We’re done!

So, PoS tagging? Not as hard as it seems right? After this was done, we’ve surpassed the pinnacle in preprocessing difficulty (really!?!? Nah, joking). Now, it is down the hill!

Next, we’ll see lemmatizing!

Here’s the project so far:

Don’t be afraid to leave a comment or do a pull request in git, if you find room for improvement.

Some good sources that helped to build this article:

--

--

Tiago Duque
Analytics Vidhya

A Data Scientist passionate about data and text. Trying to understand and clearly explain all important nuances of Natural Language Processing.