I first got into data science almost out of coincidence because of a side project. My friend and I originally came from computer science backgrounds, but we were fascinated at how much attention the presidential candidate tweets were getting; so we stumbled upon Twitter APIs, the difficulty of cleaning text, and data visualizations — except we didn’t know how to make visualizations without the drag and drop functionality of Tableau. Revisiting the project, I realized how making a word cloud is a fun introductory tutorial for those wanting to see the entire process of collecting, cleaning, and visualizing data in pure Python. So with this code, you too can create a word cloud in the shape of your favorite logo/image!
Step 1: Import the necessary packages
import numpy as np
import matplotlib.pyplot as plt
from twython import Twython
from PIL import Image
from wordcloud import WordCloud, STOPWORDS
from IPython.display import Image as im
Step 2: Collect the data
The first step is to gain access to Twitter’s API by registering for an app to obtain the token keys for the API at https://dev.twitter.com/ and then creating a new app at https://apps.twitter.com/. Once the app has been created, head over to “Keys and Access Tokens” to get the API Key and API Secret.
Once the application has been created and approved, our next step is to connect to the Twitter API with the Twython package in order to extract the raw Tweets from our lucky, chosen user handle.
#Connect to Twitter
APP_KEY = "Dgh9WnBxXeLUcBYhXeohInazm"
APP_SECRET = "6fTM5hIYVbMGwVWoIoII7smrE3BexV6x8E9Xj7Ye6EJgw2LGA3"
twitter = Twython(APP_KEY, APP_SECRET)
From there, we can now call the function on a user timeline. Note that Twitter limits how far one can go back on a user’s timeline, which is 3200 at this point in time for free tier. While not as simple as the other parts of creating the word cloud, the following block of code checks for the most recent id and uses that as a placeholder to get the next batch of 200 tweets.
user_timeline=twitter.get_user_timeline(screen_name='Nike',count=1) #get most recent id
last_id = user_timeline['id']-1
for i in range(16):
batch = twitter.get_user_timeline(screen_name='Nike',count=200, max_id=last_id)
last_id = user_timeline[-1]['id'] - 1
While this output is confusing, fortunately, there are resources to help break down the JSON tweet format.
The above image, although created in 2010, breaks down the JSON Tweet format and is an invaluable source for almost any Twitter project — such as time series, natural language processing, and geospatial projects.
Using Figure 3 as our guide, we can see that the “text” field of each tweet contains the information we need. Furthermore, it appears that our timeline data is a list of dictionaries, so we can iterate through the list and grab each value with the “text” key.
#Extract textfields from tweets
raw_tweets = for tweets in user_timeline:
Step 3: Clean all the data
At this point, we have our raw data - all URL links, special characters, emojis, and extra white spaces. Unfortunately, these types of punctuation and words take the word out of a word cloud; so we need to deal with them on a case by case basis.
#Create a string form of our list of textraw_string = ''.join(raw_tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
Beyond removing special characters, we can see filer words like “and”, “the”, and “but” which don’t have much significance given their frequency in all sentences. And so we can get rid of these words, known as “stop words” by keeping them only if they are not part of the predefined words part of the wordcloud package. Additionally, we can also remove any straggler letters that might’ve remained by checking the length of each word and keeping it if it matches a certain length.
words = no_special_characters.split(" ")
words = [w for w in words if len(w) > 2] # ignore a, an, be, ...
words = [w.lower() for w in words]
words = [w for w in words if w not in STOPWORDS]
Step 4: Visualize the data
Pick any of your favorite images to use as a background image! (Pro tip: works better with larger images with a white background)
mask = np.array(Image.open('/Users/shsu/Downloads/nike.png'))
Create the word cloud instance (note that the documentation requires that the text be in a string format):
wc = WordCloud(background_color="white", max_words=2000, mask=mask)
clean_string = ','.join(words)
And finally show off our new image:
f = plt.figure(figsize=(50,50))
plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.title('Original Stencil', size=40)
plt.title('Twitter Generated Cloud', size=40)
Ta-da! We’ve now got our own word cloud image made from Twitter data with 100% pure Python code! You’re one project closer to Veni, Vedi, Vici-ing the data science world!
Thank you for taking the time to read my tutorial and feel free to leave a comment or connect on LinkedIn as I’ll be posting more tutorials on data mining and data science.