Two minutes NLP — NLTK cheatsheet

Tokenization, Lemmatization, POS tagging, synonyms, and antonyms

Fabio Chiusano
NLPlanet
3 min readFeb 1, 2022

--

Photo by Daria Nepriakhina on Unsplash

Hello fellow NLP enthusiasts! As my readers seem to particularly enjoy articles where I talk about the most popular NLP libraries, I am building little cheatsheets for each of them. Obviously I will continue to write about all the fundamentals of the discipline to make an overview as complete as possible, but you can expect a few more articles about libraries from here on out. Enjoy! 😄

NLTK (10.4k GitHub stars), a.k.a. the Natural Language Toolkit, is a suite of open-source Python modules, datasets, and tutorials supporting research and development in Natural Language Processing. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Installation, import, and model downloads

You can install NLTK using pip.

After importing NLTK in your Python code, for some NLP functions it is necessary to download trained models or corpora that are not installed automatically with pip. You can download all of them with a single command, but if you want to optimize your memory usage you can also download the specific models for your specific use case, as you’ll see in the later examples.

Counting Word Frequencies

NLTK has a FreqDist object which makes it easy to count the occurrences of tokens in a text and plot them.

Plot of the frequency distribution of the words “dog”, “cat” and “car” of the previous example. Image by the author.

Stopwords

This is how you can download stopwords for your specific language and use them to filter the text you want to analyze.

Corpora

NLTK makes it easy to download some standard NLP corpus used to train and evaluate models.

In this example, we see how to use the Brown corpus. The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres.

Tokenization

For tokenization, NLTK asks you to download the Punkt Sentence Tokenizer. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

Stemming and Lemmatization

NLTK makes it very easy to apply stemming and lemmatization: just choose one of the available stemmers or lemmatizers and call their stem or lemmatize methods.

POS Tagging

Part of Speech (POS) Tagging can be done by downloading a POS tagging model and using the pos_tag function.

Definitions, synonyms, and antonyms

Definitions, synonyms, and antonyms can be retrieved with NLTK leveraging knowledge bases such as WordNet.

Try all the code samples

You can try all these code samples with this publicly-shared Colab!

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence