Part of Speech Tagging with Python NLTK

Jessica Lu
Strategio
Published in
3 min readSep 9, 2022

Natural language processing is a field that makes the natural human language usable for computers. Python NLTK (Natural Language Toolkit) is used to preprocess the unstructured data into pieces more readily usable by computer programs.

Turns out, preprocessing in the world of Natural Language Processing is kinda a big deal.

First off, why do we need to preprocess the input data?

Imagine you are trying to run a program on a book — one of the basic definitions of a human-readable text. If you were to try running computations on the lexicon of that book, well, let’s just say it would not be a nice experience. By preprocessing the text, you could split into standardized units of meaning and possibly use it in the future with machine learning (ML will not be discussed today).

Next, let’s go over some key concepts in NLP pertaining to today’s topic.

  • Corpus — a collection or set of texts
  • Lexicon — similar to a dictionary, a lexicon is a specified list of words that contain semantic meaning
  • Part of speech — a tag that categorizes each word in a text its depending on its placement and definition.

There are many ways to preprocess a corpus with NLTK. In this article I will be going over the part-of-speech tagger.

The first step would be to split a corpus into the smallest units of meaning usually defined by its lexicon. Then, the part-of-speech tagger will be able to assign tags/roles to each word such that it becomes clear what the relationship between each word is. This would be necessary in the future to build and draw semantic trees.

NLTK has a whole list of POS tags that can be explored but some basic ones are the following:

  • CC, coordinating conjunction — and, or
  • DT, determiner — the
  • EX, existential there — there is an apple
  • MD, modal — could, will
  • NN, noun, singular — desk
  • PRP, personal pronoun — I, he, she

Tutorial Section

Here, I will do a quick walkthrough on how to install and tag a text with the NLTK part of speech tagger.

The Environment

Note: make sure that you have Python3 installed.

We are going to create a virtual environment using the Python venv module first. Open a terminal window and create a directory, postag, for your project.

mkdir postag

Navigate to the postag directory.

cd postag

Now create the virtual environment using venv to create a folder in your postag directory that contains the virtual environment.

python3 -m venv venv

Enter the virtual environment.

source venv/bin/activate

If done correctly, you should see a (venv) at the beginning of your terminal command line.

Now that we have set up our environment, lets install the NLTK library.

pip3 install nltk

Once that is complete, create a file “pos_tagger.py” in the postag directory and open it with the IDE of your choice. I will be using Visual Studio Code.

touch pos_tagger.py
code pos_tagger.py

The Code

The printed result will be the following.

Parts of Speech: [(‘It’, ‘PRP’), (“‘s”, ‘VBZ’), (‘almost’, ‘RB’), (‘time’, ‘NN’), (‘for’, ‘IN’), (‘the’, ‘DT’), (‘blisk’, ‘NN’), (‘tree’, ‘NN’), (‘to’, ‘TO’), (‘go’, ‘VB’), (‘up’, ‘RP’), (‘!’, ‘.’)]

Something to note: in line 8, I specified that word_tokenize would split a text into the smallest unit of semantic meaning. We can see in the first two tuples in the result list actually make up the first word in the sentence “It’s.” From a natural human language standpoint, our first instinct would be to consider “It’s” to be its own word. However, NLTK recognizes that its a conjunction and the semantic meaning is actually “it” and “is” which is reflected in the printed result.

Thank you for reading! Hopefully this walkthrough was helpful. If there are any questions let me know down below.

--

--