Visualizing “Gone with the Wind” Book Text With Word Cloud!

Tahsin Mayeesha
Learning Machine Learning

--

After the “Gone With The Wind” book dataset came out in Kaggle, as a historical-fiction junkie I had to work on it. I’ve made a cool word cloud with the book text data already in my kernel named “Frankly My Dear, I Just Want a Word Cloud”.

I’ll give a quick overview on how to use Andreas Muller’s wordcloud package in Kaggle to generate visualizations in this post.

Word Cloud

Word cloud is a very popular and common visualization tool to show words according to it’s relative frequency or importance in a text. The general format of a word cloud however can be quite boring, so we can add mask images to change the shape.

In Kaggle, we can’t download mask images directly from web, so I had to make a zip file containing the mask images I wanted to try out and upload to Kaggle as a dataset. After that I was able to try different samples until I liked one.

The kernel had to combine multiple datasets to generate one word cloud! I hope they will let us just get the images from web in future.

The mask image I liked most is given below. Choosing a good mask image is the key to create a cool visualization. In the beginning I chose images with filled background instead of white and the results were mediocre, instead of a shape I got a squared image despite using masks.

Code

Muller’s package make generating word clouds extremely easy. The code is already self explanatory but I’ll add brief overview.

max_words is the number of most frequent words that would be included in the visualization, mask is the numpy array containing the mask image which I loaded with Pillow, STOPWORDS are the list of common words we don’t want to include because they do not provide any insight.

We can also control the height and the width of the result but doing so didn’t help me in this visualization. I’ll probably try some sentiment analysis soon.

--

--

Tahsin Mayeesha
Learning Machine Learning

Deep Learning Engineer. New grad, CSE.GSOC 19 participant@Tensorflow. Previous GSOC 18 @ Berkman Klein Center of Internet and Society. Kaggler,fast. ai internat