Cloudy With a Chance of Words: Part 1

A How-To For Creating Basic Word Clouds to Enhance Your Text-Based Visualizations

Payson Chadrow
The Startup
7 min readAug 13, 2020

--

Image by author

When working with text data and NLP projects, word-frequency is often a useful feature to identify and look into. However, creating good visuals is often difficult because you don’t have a lot of options outside of bar charts. Lets face it; bar charts get old and boring quick! This is where word clouds come into play. In this blog learn how to spice up your visualizations using word clouds on your next project.

Up until my most recent project I actually didn’t know a word cloud library existed in python, but I assure you it does, and it has some amazing features!

The full WordCloud library and documentation can be found here for those interested.

TLDR

Part 1 of this blog will walk you through obtaining the appropriate libraries and the basic parameters and functions of the wordcloud library as well as how to create a generic word cloud. Part 2 will build upon this and walk you through creating custom masks for word clouds and other unique visual options.

Getting Started With WordCloud

Before we can start making visuals, we’ll need to make sure we have the libraries we need to create our word clouds. You’ll need the following libraries:

  • numpy
  • matplotlib
  • PIL
  • wordcloud
  • nltk (This is only necessary for the purpose of this blog and as a source of sample text to create word clouds from)

All of these libraries can be pip installed if you’re unable to import them. For my specific project, I used Google Colab which required a slightly more unique solution to import wordcloud. For Google Colab users, you can use the following command to install wordcloud:

!pip install git+https://github.com/amueller/word_cloud.git #egg=wordcloud

That last part is important for Colab because it identifies and effectively names the library so that it can be properly imported.

Once we have all of our needed libraries installed, we can use the following set of import statements:

We’re now ready to create some word clouds!

Generic Word Clouds

To start with, lets explore generic word clouds. For those that want to follow along, we’ll use some corpora from the nltk library.

First off, we’ll need to acquire our text. I’ll note here that there are two forms of text that WordCloud can use to generate a visual. The first, and the main one we’ll use, is in the form of a string. The second, is from a dictionary of words and their frequency as key-value pairs.

If you’re following along, or want to attempt this using other sample text from nltk, you can use the following code to acquire our text samples:

This shows a list of the different authors and texts we have to choose from within nltk’s gutenberg files
This shows a list of the different authors and texts we have to choose from within nltk’s gutenberg files

Feel free to attempt creating word clouds from any of the above options. The one that we’ll continue with in these examples, however, will be Moby Dick.

To gather our sample text as a single string you can use the following command:

Now that we have our text, let’s take a look at how to turn this into a word cloud. What we’re doing in the code block below is instantiating a WordCloud object, we then use that object to generate a cloud based upon the text that we pass in. Once we have the cloud generated, we then want to be able to show it without the unnecessary x and y axis.

Look at that! We made a word cloud!

Now personally, I’m not a fan of the black background and it seems a little small, so let’s change that with some simple parameters.

Now we’re talking! Although, there seems to be some strange things showing up in our generic word cloud doesn’t there?

Parameters and Language Processing

Looking at the cloud above we notice some things. Some words seem to be paired.

  • the whale
  • the ship
  • the sea
  • the captain
  • White Whale

So on and so forth. Our word cloud is still showing word frequencies however one of the parameters WordCloud has is ‘collocations’ which it defaults to True. What this does is also looks at pairs of words and their frequencies. In some instances this can definitely be useful, but in this one I think we’ll get better results not using it.

Notice the difference?

A keen eye may recognize that the word ‘the’ no longer appears in our word cloud. This is because ‘the’ is recognized as a stop-word and excluded from the cloud even though it appears quite frequently in the text.

You may be wondering where stop-words came into play, and that is one of the really cool features of the wordcloud library. The library comes with it’s own list of stop-words that it uses by default. The library actually uses quite a few NLP practices by default that makes creating the clouds that much easier and also adjustable for the more experienced NLP practitioner. Some of these additional NLP parameters that are used are:

  • regexp — an optional parameter that if left blank will use r”\w[\w’]+” by default. Custom regex string can be passed in here.
  • normalize_plurals — default = True; For words that appear both with and without a trailing ‘s’, that ‘s’ is removed from the plural and it’s counted as another of it’s singular version

In our original import statement we imported STOPWORDS from the wordcloud library. You can print this to see the entire list of words that are being excluded by default, but it currently uses 192 of the most common stop-words. You can also add to this list if you have additional words you want excluded. You can also supply your own stop-words if prefer. Note that the stopwords must be passed in as a set and not a list.

What a difference!

One last thing we’ll talk about before moving on to making fun and unique word clouds is “relative scaling”.

Relative scaling is what’s used to determine the size of the word based upon its frequency. By default, relative scaling is set to 0.5, which is essentially the equivalent of saying that a word that occurs twice as often as another word will be 50% larger.

Relative scaling can be set to any number between 0 and 1. With 0 being essentially kind of pointless as all words will be the same size, and 1 being that words that occur twice as often will be twice as large. In some cases this can be useful to better identify the differences in frequency. However, this doesn’t always look very good and can affect the fit of a word cloud to a mask which we will talk about later.

In this case, using a relative scaling of 1 actually doesn’t look too bad! We’ll soon see how this translates to using it with an image mask.

Saving Your Word Cloud

Once you have your word cloud the way you want it, you’ll probably want to save it. To do so, you can run the following code which will save the current state of your WordCloud object.

Keep in mind this will save the image to your local folder and if you have a specific location in mind, you will need to add in the appropriate path.

Other Parameters Worth Playing With

We looked at the key parameters for making word clouds, but there are many more that are worth looking into and toying with. These parameters are fairly self-explanatory and can be used to further tweak your clouds:

  • prefer_horizontal — (float)If set to 1, all words will appear horizontal while lower values will increase the frequency of vertical words. default = 0.9
  • min_font_size — (int) Smallest font size to be used. default = 4
  • max_words — (int) default = 200
  • min_word_length — (int) Minimum number of letters required in a word to be in the cloud. default = 0
  • include_numbers — (bool) default = False
  • repeat — (bool) Determines if words/phrases will be repeated until max_words or min_font_size is reached. (Can be used to create word clouds from a single word) default = False

Unique and Custom Word Clouds

Due to this blog turning out much longer than I had initially planned, I’ll discuss using image masks to create custom word clouds, how to create your own image masks from any image, and how to apply an image’s color to your cloud in a soon to follow, Part 2 of this blog.

--

--

Payson Chadrow
The Startup

Just a guy playing with data, trying to find his perfect dataset