Natural Language Processing

NLP Snippets in Python

Doing an NLP project? Bookmark this article.

Benedict Neo

Published in

bitgrit Data Science Publication

6 min readOct 13, 2022

Find yourself googling the code for text preprocessing tasks such as removing punctuation and stopwords, expanding contractions, stemming and lemmatization, etc.?

Save the trouble and bookmark this article instead!

Interested in snippets for Pandas too? Check out 40 Useful Pandas Snippets

Code for this article → Deepnote

1. Text Cleaning
  ∘ Lower case
  ∘ Remove whitespace, linebreaks, and tabs.
  ∘ Remove punctuation
  ∘ Remove single characters
  ∘ Remove HTML tags
  ∘ Remove URL
  ∘ Remove emojis
  ∘ Remove stopwords
  ∘ Convert emojis to words
  ∘ Convert digits to words
  ∘ Expand contractions
  ∘ Lemmatization
  ∘ Stemming
  ∘ Correct spelling
  ∘ Text Cleaning pipeline
2. Text Visualization
  ∘ n-grams
  ∘ word cloud
  ∘ T-SNE (T-distributed Stochastic Neighbor Embedding)
3. Text Search
4. Text Vectorization
  ∘ Bag of words
  ∘ TF-IDF
  ∘ Word2Vec

Import Libraries

Let’s first go through some of the libraries used and what they do.

spellchecker — Pure Python Spell Checking using the Levenshtein Distance
contractions — Fixes contractions such as “you’re” to you “are”
num2words — Convert numbers to words. 42 → forty-two
emoji — Emoji for Python
textacy — Python library for NLP tasks built on spaCy
nltk — Leading platform for building Python programs to work with human language data
yellowbrick — Machine Learning Visualization
gensim — Topic Modelling for Humans, the fastest library for training of vector embeddings
tqdm — A Fast, Extensible Progress Bar for Python
termcolor — Color formatting for output in terminal / notebook

Text Cleaning

Text cleaning is the process of removing unwanted text from a text corpus. It is an essential step in the data preparation process for any NLP project.

Lower case

To lower case text, we can utilize the str.lower() method.

Remove whitespace, linebreaks, and tabs.

The default default separator for str.split() is any whitespace, so we can use that to remove even linebreaks and tabs.

Remove punctuation

We can use the str.maketrans() and str.translate(map) method to remove punctuation.

maketrans creates a translation table and translate takes that mapping and returns a mapped copy of a string.

To remove punctuations, we pass in string.punctuation which is just a string of punctuations — !”#$%&\’()*+,-./:;<=>?@[\\]^_`{|}~ as the third argument, and each character is mapped to None

Remove single characters

To remove a single character, we can use re.sub and pass in a regex that tells python for any single character alphabet, replace it with an empty string.

Remove HTML tags

To remove multiple tags, we use re.compile to save that pattern into a regular expression object, we can then use sub to replace whatever html tags we find with an empty string.

Remove URL

We use a similar approach to remove HTML, but with a regex for URLs.

Remove emojis

To remove emojis, we use special unicode for emojis and emoticons.

Remove stopwords

Stopwords are common words in text that are not significant.

Sometimes, you want custom stop words and you can add them with the update method. You might also want to remove some stop words from the standard stopwords set give by NLTK, you can do that with -=

Convert emojis to words

You might also want to capture some information from emojis, you can convert it to text using the demojize function from the emoji package.

Convert digits to words

If you have operations that remove digits from your raw text, converting them to words is useful if you want to keep that information.

Expand contractions

To expand contractions, we’ll use the contractions python package. It’s as simple as calling the fix function.

Note: this function can get slow if you’re applying to thousands of sentences.

Lemmatization

Here we do lemmatization with POS (part of speech tags) so that it can differentiate between different types of words (verb, noun, adjective, etc.)

For each word, we first identify its type with nltk.pos_tag and pass that information to our lemmatizer.

You can also pass in a custom lemmatizer you want to use.

Stemming

Here is a function for stemming and similar to above, we can pass in a custom stemmer.

Correct spelling

Spelling errors in text is a big issue and to solve that we can use the correction function provided by pyspellchecker

Now let’s apply the above functions to a real-world data set!

Read data

We’ll use a tweets dataset from Kaggle Competition — Natural Language Processing with Disaster Tweets. Get the dataset here.

Look at the first five rows.

Text Cleaning pipeline

We can take all the functions we have defined earlier and put them into this pipeline, which is a list of functions to use.

In the prepare function, we loop over the text cleaning functions and apply them to the text.

Note that this is a sequential process, so for example, there would be no point to expand contractions after removing punctuation.

To view the progress of the prepare function, I’m using the progress_apply function by tqdm.

Let’s see an example of what our clean text looks like.

Text Visualization

Now that our text has been cleaned, it’s time to extract some insights from them through visualization.

n-grams

n-grams are a simple way to analyze the context of words by taking common sequences of words.

Here we have a function to get the top ngrams (10 by default) from text.

By default, it gets the top 10 most common unigrams.

Since this is a disaster tweets data set, we see some words that are related to disasters such as emergency, police.

We can get bigrams by setting ngrams = 2.

In bigrams, it’s even more evident with words like burning buildings and California wildfire.

To plot the top words, we can use the FreqDistVisualizer function.

To show other ngrams, I’ve written this function below.

word cloud

Word clouds are another useful visualization technique to see all the common words in one graph.

For some reason, we have û as the most common text. This is probably because wordcloud doesn’t filter out unicode when tokenizing the text.

Check out this word cloud I created of my LinkedIn chat messages in this article

T-SNE (T-distributed Stochastic Neighbor Embedding)

t-SNE is a tool to visualize high-dimensional data

Here we useTSNEVisualizer by yellowbricks, a package for visualizing machine learning models.

Here’s the output for all the tweets we have after cleaning them.

It appears we have a large cluster at the center and other smaller clusters of points.

You could further investigate this by using either the disaster target label, or the location or keywords to label each document to see if those clusters relate to the labels.

Text Search

Sometimes you want to investigate the context of the text from the corpus you have.

With the textacy library, you can!

Previously we saw the character û , let’s see where they appear in the text.

Now let’s try searching the keywords love and hate.

Since it uses regex as the keyword, you need to leave a space at both sides of the string.

Text Vectorization

Text vectorization is the process of converting text to numbers, here are three common techniques.

Bag of words

To implement bag-of-words, we can use the CountVectorizer method by scikit-learn.

TF-IDF

The better approach that weighs words by importance is TF-IDF, and scikit-learn also provides that method with TfidfVectorizer.

Word2Vec

To create word embeddings, we can use the popular Word2Vec method, which is provided by Gensim

With a word2vec model trained, we can find keywords that are most similar in the entire corpus we have.

That’s all for the snippets.

Am I missing other snippets in this article? Share them in the comments!

Thanks for reading!

Liked this article? Here are three articles you may like:

Like my writing? Join Medium with my referral link. You’ll be supporting me directly 🤗

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit below to stay updated on workshops and upcoming competitions!