Two minutes NLP — NLTK cheatsheet
Tokenization, Lemmatization, POS tagging, synonyms, and antonyms
Hello fellow NLP enthusiasts! As my readers seem to particularly enjoy articles where I talk about the most popular NLP libraries, I am building little cheatsheets for each of them. Obviously I will continue to write about all the fundamentals of the discipline to make an overview as complete as possible, but you can expect a few more articles about libraries from here on out. Enjoy! 😄
NLTK (10.4k GitHub stars), a.k.a. the Natural Language Toolkit, is a suite of open-source Python modules, datasets, and tutorials supporting research and development in Natural Language Processing. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
Installation, import, and model downloads
You can install NLTK using pip.
After importing NLTK in your Python code, for some NLP functions it is necessary to download trained models or corpora that are not installed automatically with pip. You can download all of them with a single command, but if you want to optimize your memory usage you can also download the specific models for your specific use case, as you’ll see in the later examples.
Counting Word Frequencies
NLTK has a FreqDist object which makes it easy to count the occurrences of tokens in a text and plot them.
Stopwords
This is how you can download stopwords for your specific language and use them to filter the text you want to analyze.
Corpora
NLTK makes it easy to download some standard NLP corpus used to train and evaluate models.
In this example, we see how to use the Brown corpus. The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres.
Tokenization
For tokenization, NLTK asks you to download the Punkt Sentence Tokenizer. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.
Stemming and Lemmatization
NLTK makes it very easy to apply stemming and lemmatization: just choose one of the available stemmers or lemmatizers and call their stem or lemmatize methods.
POS Tagging
Part of Speech (POS) Tagging can be done by downloading a POS tagging model and using the pos_tag function.
Definitions, synonyms, and antonyms
Definitions, synonyms, and antonyms can be retrieved with NLTK leveraging knowledge bases such as WordNet.
Try all the code samples
You can try all these code samples with this publicly-shared Colab!