How to perform Data Visualization for an NLP project using WordClouds

Michael Orlando
8 min readSep 28, 2022

--

Photo by Pero Kalimero on Unsplash

In this 6-part series, I’ll explain my process of using Natural Language Processing and Machine Learning to classify the genres of screenplays.

For more information, check out my repo.

Part 1: Business Objective

Part 2: Data Collection

Part 3: Data Wrangling

Part 4: Exploratory Data Analysis (you are here)

Part 5: Model Building (not posted yet)

Part 6: Model Deployment (not posted yet)

Welcome data science and movie enthusiasts of Medium. This is part 4 of my 6-part series where we use NLP and Machine Learning to build a multi-label classification model to label the genres of a movie screenplay.

If you have not checked out Parts 1, 2, & 3 of the series, where I discuss how to use BeautifulSoup to scrape for film screenplays and how to use The Movie Database API to label our data, then the links are above.

Part 4: Exploratory Data Analysis — Using the word cloud package

In our dataset consisting of film screenplays, there are 18 genres. They are Crime, Romance, Animation, SciFi, Fantasy, History, Action, Drama, War, Thriller, Mystery, Documentary, Horror, Family, Adventure, Music, Comedy, and Western.

Each screenplay may have a different combination of each genre.

The purpose of visualizing each genre with word clouds is to understand the common words in each genre. The count of words is the feature our model is training on. If we have a solid understanding of the count of words in each genre, then we’ll have a better model.

Steps We’ll Take:

  1. Import Necessary Packages
  2. Loading in Data
  3. Visualizing Genres
  4. Visualizing Genres with text cleaning
  5. Visualizing Genres with text cleaning and text normalization
  6. Visualizing Genres with text cleaning, normalization, and optimal stop word list

For this source code check out EDA_pt1 and EDA_pt2 in my repo.

  1. Importing Necessary Packages

The TextPreprocessing class can be found here to download. However, I’ll be writing the functions of the class in this article.

Check out this article for more information on creating custom transformers. Himanshu does a great job of explaining how and why to create your classes for ML pipelines.

2. Loading in data

#loading in dataset
data = pd.read_csv("data/cleaned_data.csv", index_col=[0])

This data was collected and wrangled in the previous tutorials, so check those out first if you want to use the same dataset.

3. Visualizing Genres

When using NLP and machine learning, the meat of building a great model derives from our text preprocessing efforts. This includes cleaning our text using regex, testing different text normalization techniques such as stemming and lemmatization, and creating a solid stop words list.

In this project, we’ll be doing all three, but before we do so, I want to show you what happens if we don’t, by visualizing a genre without any preprocessing.

First, let’s separate our genres into different data frames.

#intializing empty dct to store dataframes
genre_df = {}
for genre in genre_lst:
#creates a key and value of only that genre
genre_df[genre] = data[data[genre] == 1]

When we looped through our genre_lst, we created a key for each genre and the corresponding values are data frames where the row value in the genre column equals 1.

For example, the value for the comedy key would be a pandas dataframe like this:

Now to use the wordcloud package, we’re first going to join all the comedy text in each row and save it to a variable.

comedy_text = " ".join(genre_df['Comedy']['text'])

Let’s visualize the first 1000 elements of our string type variable.

print(comedy_text[:1000])

Now let’s take a look at the size of our variable:

print(len(comedy_text))84708731

Run this code:

Output:

As clearly shown, there’s no indication that these are common words used in comedy movies.

4. Visualizing Genres with text clean

To improve the previous word cloud, we’re going to clean our text.

The function takes in one parameter which is the text we desire to clean. The first line replaces ‘\\r’, ‘\\n’, and ‘\\’, with blank spaces and then splits the text to create a list. The second line uses the regex package to replace all special characters in any word in the list with blank spaces. Then the third line rejoins all the text after checking whether the element is not a number and if it’s greater than 1 character.

To see the difference cleaning text makes, let’s look at the first 1000 characters of Knocked Up before cleaning:

Now after cleaning:

There’s a big difference, which will help our model’s performance down the road.

Note that cleaning text isn’t as systematic as other data science techniques. It depends on what dataset you’re working with. For example, I wouldn’t use this same function to clean the text of tweets or Facebook posts. I would modify it to work with how those tweets or posts are written.

For more information, check out this article. Kashish does a great job discussing the different approaches to cleaning text in different datasets.

Now to visualize:

This is much more legible, however, there is still no indication that this cloud represents the most common words of a comedy screenplay.

5. Visualizing Genres with text cleaning and text normalization

Text normalization is the process of translating a word from its inflectional state to its based state. This will help us condense our text and illustrate a better word cloud.

Examples: Flowers (inflection) -> Flower (base), Running (inflection) -> Run (base), Smarter (inflection) -> Smart (based)

To learn more about inflection and based forms in English, check out this article.

In NLP and Data Science, we use either stemming or lemmatizing to perform text normalization.

Stemming is the process of reducing words to their root form. For example, if we have change, changing, & changed in our corpus, then all these words would be reduced to chang.

The second text normalization technique is called lemmatization. Lemmatization is the process of reducing words to their base form or in other words, their lemma form. For example, if we have change, changing, & changed in our corpus, then all these words would be reduced to change.

Stemming is computationally faster than lemmatizing, but lemmatizing yields more accuracy. In our project, we’re going to use lemmatization. If you like to learn more about text normalization, its mathematics, etc: check out this article. Diego does a great job of explaining the pros, cons, differences, and algorithms behind the two techniques.

Lowercasing all your text is good practice in NLP. This is because Python does not compute different cased words as the same. For example, Python believes “All” and “all” are different values.

Also, if you notice, we used a basic stop word list from the package NLTK. This is also good practice because the words made up of that list are common words such as “I”, “was”, “you”, etc. Our feature for the model is the count of words in each genre. By eliminating these common words, we’ll have a more differentiated count of words for our target classes.

The next step is tokenizing the words in our text. Tokenizing helps our models learn the context and sequence of words. Then we perform lemmatization on each token in our corpus.

Wordcloud after text normalization:

There’s not a clear difference between the word cloud before lemmatizing and after lemmatizing. However, if you look at the second-word cloud, there’s LOOK in the center and LOOKING on the far right. Now if you look at the third cloud, there’s only LOOK in the center and no LOOKING on the right. This is because our lemmatization efforts reduced LOOKING to LOOK. This ultimately creates a more accurate count of words. It’s visually a small difference but trust me, it will yield better results when we create the count feature for our model in the next tutorial.

6. Visualizing Genres with text cleaning, normalization, and optimal stop word list

Look at these two graphs:

As you see, the top 15 words by count in both Comedy and Crime screenplays are very similar. It’s safe to assume that it is probably very similar for all 18 of our genres. Therefore, to eliminate the common words between each genre, we’re going to create a stopword list with the intersection of the 500 most common words in each genre.

We’ll use the FreqDist package to create a dict of words by genre.

In this code, we find the count of words by genre by using the FreqDist class from the nltk package. We store the count of words in a Python dictionary where the keys are the genres and the values are the FreqDist objects.

Now let’s find the intersection of all 18-word frequency counts.

This function extracts the most common words in FreqDist object. In our case, we’ll do the top 500 words for each genre.

This code loops through all the genres and extracts the top 500 words. In the stops variable, we store the intersection of the top words from each genre.

This is our stop word list:

In the next tutorial, I’ll show how we remove these words from our training data before performing our feature engineering. But for now, check out some of our optimized word clouds!

--

--