Introduction to Data Science: Custom Twitter Word Clouds

Sep 11, 2018 · 5 min read
Python Generated Word Cloud

I first got into data science almost out of coincidence because of a side project. My friend and I originally came from computer science backgrounds, but we were fascinated at how much attention the presidential candidate tweets were getting; so we stumbled upon Twitter APIs, the difficulty of cleaning text, and data visualizations — except we didn’t know how to make visualizations without the drag and drop functionality of Tableau. Revisiting the project, I realized how making a word cloud is a fun introductory tutorial for those wanting to see the entire process of collecting, cleaning, and visualizing data in pure Python. So with this code, you too can create a word cloud in the shape of your favorite logo/image!

Step 1: Import the necessary packages

import numpy as np
import matplotlib.pyplot as plt
import re
from twython import Twython
from PIL import Image
from wordcloud import WordCloud, STOPWORDS
from IPython.display import Image as im

Step 2: Collect the data

The first step is to gain access to Twitter’s API by registering for an app to obtain the token keys for the API at and then creating a new app at Once the app has been created, head over to “Keys and Access Tokens” to get the API Key and API Secret.

Figure 1: API Key and Secret for Twitter

Once the application has been created and approved, our next step is to connect to the Twitter API with the Twython package in order to extract the raw Tweets from our lucky, chosen user handle.

#Connect to Twitter
APP_KEY = "Dgh9WnBxXeLUcBYhXeohInazm"
APP_SECRET = "6fTM5hIYVbMGwVWoIoII7smrE3BexV6x8E9Xj7Ye6EJgw2LGA3"
twitter = Twython(APP_KEY, APP_SECRET)

From there, we can now call the function on a user timeline. Note that Twitter limits how far one can go back on a user’s timeline, which is 3200 at this point in time for free tier. While not as simple as the other parts of creating the word cloud, the following block of code checks for the most recent id and uses that as a placeholder to get the next batch of 200 tweets.

#Get timeline 
#get most recent id
last_id = user_timeline[0]['id']-1
for i in range(16):
batch = twitter.get_user_timeline(screen_name='Nike',count=200, max_id=last_id)
last_id = user_timeline[-1]['id'] - 1
Figure 2: Our Output in Python

While this output is confusing, fortunately, there are resources to help break down the JSON tweet format.

Figure 3: Full Tweet JSON Format (Source: Raffi Krikorian)

The above image, although created in 2010, breaks down the JSON Tweet format and is an invaluable source for almost any Twitter project — such as time series, natural language processing, and geospatial projects.

Using Figure 3 as our guide, we can see that the “text” field of each tweet contains the information we need. Furthermore, it appears that our timeline data is a list of dictionaries, so we can iterate through the list and grab each value with the “text” key.

#Extract textfields from tweets
raw_tweets = []
for tweets in user_timeline:
Figure 4: Raw Tweet Fields

Step 3: Clean all the data

At this point, we have our raw data - all URL links, special characters, emojis, and extra white spaces. Unfortunately, these types of punctuation and words take the word out of a word cloud; so we need to deal with them on a case by case basis.

Drake Alphabet
#Create a string form of our list of textraw_string = ''.join(raw_tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
Figure 5: Words without special characters

Beyond removing special characters, we can see filer words like “and”, “the”, and “but” which don’t have much significance given their frequency in all sentences. And so we can get rid of these words, known as “stop words” by keeping them only if they are not part of the predefined words part of the wordcloud package. Additionally, we can also remove any straggler letters that might’ve remained by checking the length of each word and keeping it if it matches a certain length.

words = no_special_characters.split(" ")
words = [w for w in words if len(w) > 2] # ignore a, an, be, ...
words = [w.lower() for w in words]
words = [w for w in words if w not in STOPWORDS]
Figure 6: Clean words without fillers

Step 4: Visualize the data

Pick any of your favorite images to use as a background image! (Pro tip: works better with larger images with a white background)

Figure 7: Nike Stencil
mask = np.array('/Users/shsu/Downloads/nike.png'))

Create the word cloud instance (note that the documentation requires that the text be in a string format):

wc = WordCloud(background_color="white", max_words=2000, mask=mask)
clean_string = ','.join(words)

And finally show off our new image:

f = plt.figure(figsize=(50,50))
f.add_subplot(1,2, 1)
plt.imshow(mask,, interpolation='bilinear')
plt.title('Original Stencil', size=40)
f.add_subplot(1,2, 2)
plt.imshow(wc, interpolation='bilinear')
plt.title('Twitter Generated Cloud', size=40)
Figure 8: Stencil versus Cloud

Ta-da! We’ve now got our own word cloud image made from Twitter data with 100% pure Python code! You’re one project closer to Veni, Vedi, Vici-ing the data science world!

Thank you for taking the time to read my tutorial and feel free to leave a comment or connect on LinkedIn as I’ll be posting more tutorials on data mining and data science.


  1. Twitter App
  2. Twython documentation
  3. Andreas Mueller Word Cloud Mask

Stephen Hsu

Written by

Freelance Data Scientist, Machine Learning Enthusiast, I 💖APIs;

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade