Programatically creating a wordcloud of fleetx, using just our blog posts

Text Corpus Visualization Tool in Python

Piyush Kumar
fleetx engineering
6 min readAug 7, 2020

--

Recently, I was in need of an image for our blog and wanted it to have some wow effect or at least a better fit than anything typical we’ve been using. Pondering over ideas for a while, word cloud flashed in my mind. 💡
Usually, you would just need a long string of text to generate one, but I thought of parsing our entire blog data to see if anything interesting pops out and to also get the holistic view of the keywords our blog uses in its entirety. So, I took this as a weekend fun project for myself.

PS: Images have a lot of importance in marketing. Give it quality!👀

Getting your hands dirty:

Our blog is hosted on Ghost and it allows us to export all the posts and settings into a single, glorious JSON file. And, we have in-built json package in python for parsing JSON data. Our stage is set. 🤞

For other popular platforms like WordPress, Blogger, Substack, etc. it could be one or many XML files, you might need to switch the packages and do the groundwork in python accordingly.

Before you read into that JSON in python, you should get the idea of how it’s structured, what you need to read, what you need to filter out, etc. For that, use some JSON processor to pretty print your json file, I’d used jqplay.org and it helped me figure out where my posts are located ➡
data['db'][0]['data']['post']

Next, you’d like to call upon pd.json.normalize() to convert your data into a flat table and save it as a data frame.

👉 Note: You should have updated version of pandas installed for pd.json.normalize() to work as it has tweaked names in older versions.
Also, keep the encoding as UTF-8, as otherwise, you’re likely to run into UnicodeDecodeErrors. (We have these bad guys: ‘\xa0’ , ‘\n’, and ‘\t’ etc.)

import pandas as pd
import json
with open('fleetx.ghost.2020-07-28-20-18-49.json', encoding='utf-8') as file:
data = json.load(file)
posts_df = pd.json_normalize(data['db'][0]['data']['posts'])
posts_df.head()
Posts Dataframe

Looking at the dataframe you can see that ghost is keeping three formats of the posts we created, mobiledoc (simple and fast renderer without an HTML parser), HTML and plaintext, and range of other attributes of the post. I choose to work with the plaintext version as it would require the least cleaning.

The Cleaning Job:

  • Drop missing values (any blank post you might have) to not handicap your analysis while charting at some point later. We had one blog post in drafts with nothing in it. 🤷‍♂️
  • The plaintext of the posts had almost every possible unwanted character from spacing and tabs (\n, \xao, \t), to 14 marks from grammar punctuations (dot, comma, semicolon, colon, dash, hyphen,s etc.) and even bullet points. Replace all of them with whitespace.
  • Next, I split up the words in each blog post under the plaintext column and then joined the resulting lists from each cell to have a really long list of words. This resulted in 34000 words; we have around 45 published blogs each having 700 words on average and a few more in drafts, so this works out 45*700=31500 words. Consistent!🤜
posts_df.dropna(subset=['plaintext'], axis=0, inplace=True)posts_df.plaintext = posts_df.plaintext.str.replace('\n', ' ')
.str.replace('\xa0',' ').str.replace('.',' ').str.replace('·', ' ')
.str.replace('•',' ').str.replace('\t', ' ').str.replace(',',' ')
.str.replace('-', ' ').str.replace(':', ' ').str.replace('/',' ')
.str.replace('*',' ')
posts_df.plaintext = posts_df.plaintext.apply(lambda x: x.split())
words_list =[]
for i in range(0,posts_df.shape[0]):
words_list.extend(posts_df.iloc[i].plaintext)

If you’re eager for results now, you can run collections.Counter on that words_list and get the frequency of each word to get an idea of how your wordcloud might look like.

import collectionsword_freq = collections.Counter(words_list)
word_freq.most_common(200)

Any guesses on what could be the most used word for a blog? 🤞
If you said ‘the’, you’re right. For really long texts, the article ‘the’ is going to take precedence over any other word. And, not just ‘the’ there were several other prepositions, pronouns, conjunction, and action verbs in the top frequency list. We certainly don’t need them and, to remove them, we must first define them. Fortunately, wordcloud library that we will use to generate the wordcloud comes with default stopwords of its own but it’s rather conservative and has only 192 words. So, let’s head over to the libraries in Natural Language Processing (NLP) that do huge text processing and are dedicated to such tasks. 🔎

  • National Language Toolkit library (NLTK): It has 179 stopwords, that’s even lower than wordcloud stopwords collection. Don’t give it an evil eye for this reason alone, this is the leading NLP library in python.
  • Genism: It has 337 stopwords in its collection.
  • Sci-kit learn: They also have a stopword collection of 318 words.
  • And, there is Spacy: It has 326 stopwords.

I went ahead with the Spacy, you can choose your own based on your preferences.

But…. 😓

This wasn’t enough! Still, there were words that won’t look good from a marketing standpoint, also we didn’t do the best cleaning possible. So, I’d put them in a text file (each word on a new line) and then read it and joined with the spacy’s stopwords list.

Instructions on setting up Spacy.

import spacynlp=spacy.load('en_core_web_sm')
spacy_stopwords = nlp.Defaults.stop_words
with open("more stopwords.txt") as file:
more_stopwords = {line.rstrip() for line in file}
final_stopwords = spacy_stopwords | more_stopwords

Setting up the design shop:

Now that we have our re-engineered stopwords list ready, we’re good to invoke the magic maker ➡ the wordcloud function.
Install the wordcloud library with pip command via Jupyter/CLI/Conda.

pip install wordcloudimport matplotlib.pyplot as plt
import wordcloud
#Instantiate the wordcloud objectwc = wordcloud.WordCloud(background_color='white', max_words=300, stopwords=final_stopwords, collocations=False, max_font_size=40, random_state=42)# Generate word cloud
wc=wc.generate(" ".join(words_list).lower())
# Show word cloud
plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
# save the wordcloud
wc.to_file('wordcloud.png');

Much of the above code block would be self-explanatory for python users, though let’s do a brief round of introduction:

  • background_color : background of your wordcloud, black and white is most common.
  • max_words : maximum words you would like to show up in the wordcloud, default is 200.
  • stopwords : set of stopwords to be eliminated from wordcloud.
  • collocations: Whether to include collocations (bigrams) of two words or not, default is True.

What are Bigrams?

These are sequences of two adjacent words. Take a look at the below example.

Bigrams of a sentence

Note: Parse all the text to wordcloud generator in lowercase, as all stopwords are defined in lowercase. It won’t elimiate uppercase stopwords.

Alright, so the output is like this:

Wordcloud of a fleet-Industry blog.
WordCloud of blog data

For a company doing fleet management, it’s spot on! The keyword fleet management has heavy weightage than anything else.

Though, the above image misses the very element all this is about: the vehicle. Fortunately, you can mask the above wordcloud on an image of your choice with the wordcloud library. So, let’s do this.

  • Choose a vector image of your choice. I’d picked my image from Vecteezy.
    You would also need to import the Pillow and NumPy library this time to read and convert the image into a NumPy array.
  • Below is the commented code block to generate the masked wordcloud, much of which is the same as before.
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
import os
# Read your image and convert it to an numpy array.
truck_mask=np.array(Image.open("Truck2.png"))
# Instantiate the word cloud object.
wc = wordcloud.WordCloud(background_color='white', max_words=500, stopwords=final_stopwords, mask= truck_mask, scale=3, width=640, height=480, collocations=False, contour_width=5, contour_color='steelblue')
# Generate word cloud
wc=wc.generate(" ".join(words_list).lower())
# Show word cloud
plt.figure(figsize=(18,12))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
# save the masked wordcloud
wc.to_file('masked_wsordcloud.png');

Here’s the output:

Voila! We produced our wordcloud programmatically! 🚚💨

Thank you for reading this far! 🙌

Ref:

--

--

Piyush Kumar
fleetx engineering
0 Followers
Writer for

Data Analyst at fleetx | Udacity Alumni | Data Science Enthusiast | Content Writer | Loves to eat, learn, and ride.