Tracking tweets using word clouds

gnegrelli
6 min readOct 6, 2021

--

So, a while back I had an idea to develop a service in Python to analyze tweets from public figures. The main idea was to fetch those tweets, clean up the data by removing stopwords, and present the core words (and concepts) of the posts.

For the visualization those concepts present on the tweets, I decided to use word clouds. Word clouds is a simple way to depict keywords of texts, where the importance of each word is shown with font size or color. An example of a word cloud is shown below.

Example of word cloud

The service would also allow to check the topics of deleted tweets of these people (in case it was previously stored on the database). This would give us an insight of deleted tweet main topics.

Finally, I also thought about providing this visualization with a given date range. This would allow us to check how the topics evolved across time.

Access to the Twitter API

The first thing I needed was access to the Twitter API, so I could easily fetch the tweets from a list of profiles. In order to use their API, you must first create a Twitter profile (in case you live under a rock and does not have one already) and then apply for a developer account here.

With the developer account enabled, you can now access various information about users, their tweets and interactions from Twitter API endpoints. To do so, just register your project and app in the user portal. This will generate the keys, secrets and tokens needed for accessing the API programmatically.

It is important to say that all the data accessible through the API is public and already available at their website. The API only allow us to easily access, gather and work with this data.

Modelling data

Let’s start to build the application by modelling the data and creating their tables. For now, I’m only interested in storing data from specific users and their tweets. Thus, the only models I ended up creating are TwitterUser and Tweets.

Since I’m using Django to create the back-end of this application, the models can be easily created on the app folder inside models.py, as shown in the gist below.

Models of Twitter Users and their Tweets

The fields of the user model are:

  • twitter_id: id of user created by Twitter;
  • username: Twitter username (name on Twitter handle, e.g. @gnegrelli_ );
  • profile_name: name of user shown on profile (e.g. Gabriel Negrelli);
  • followers: number of followers;
  • verified: whether or not the account is verified;
  • joined_at: date and time user joined twitter;
  • created_at: date and time the instance was created on the database; and
  • updated_at: date and time of the last data update.

The fields of the tweet model are:

  • tweet_id: id of tweet created by Twitter;
  • user: tweet owner;
  • content: tweet content;
  • tweeted_at: date and time tweet was posted;
  • deleted: whether or not tweet was deleted by user;
  • likes: number of likes tweet received;
  • retweets: number of times tweet was retweeted by others;
  • created_at: date and time the instance was created on the database; and
  • updated_at: date and time of the last data update.

The Tweet model also has a tokens property. This property is responsible of tokenizing and cleaning the tweet content. This will provide only the relevant words present in the tweet.

Populating tables

Alright, with the models created, I am now able to gather the data from Twitter. Since I’ll be connecting to the API all the time, it is better to create a method that handles the connector creation.

Method to create TwitterAPI connector

In order to add user information and their tweets to the database, I also developed two methods that access the Twitter API, fetch and store these data in our database. Is important to notice that both methods described below use the connector above to communicate to Twitter.

The add_twitter_user method receives as input a list of strings containing the twitter handles (without the @ symbol) of the users we would like to monitor and populates the database with their information. To gather the tweets of one of those users, we can simply provide the TwitterUser instance to get_user_tweets and it will fetch and store their tweets.

The params kwarg on the second method allows us to configure the Twitter request. The main purpose of that is to be able to retrieve more tweets per call (as default, Twitter API only returns 10 tweets) and ignore retweets (Twitter API returns a truncated content). For more information about the request parameters and how they can be used, please refer to this link.

Improving text tokenization and cleansing

So, I’ve just now realized that when a tweet has a media attached to it, the Twitter API returns it as a link. This link is usually starts with https://t.co. In order to correctly tokenize the tweets with attached media, our tokenizer should remove these hyperlinks from the generated list of tokens.

Since the tokenizer we are using does not do that, we can simply create our own tokenizer based on the one from nltk. The CustomTokenizer will then inherit from nltk’s TweetTokenizer and extend its tokenize method. This new class developed is show below.

The regex pattern MEDIA_RE matches any token starting with either http://t.co or https://t.co, thus removing them from the list of tokens.

With the tokenization problem solved, we can now focus on improving token cleansing. Due to the grammatical structure of language, texts contain many tokens that doesn’t add much meaning to the message, such as articles, prepositions and punctuation. Also, these tokens are extremely frequent on texts and end up polluting the data. For that reason, these types of tokens (also referred as stopwords) must be removed from the list of tokens before any analysis.

The nltk package already has built-in lists of stopwords for most of its supported languages. However, having them doesn’t imply that the lists aren’t crappy. For the portuguese (language of most tweets I’m analysing), nltk’s list is a good start, but it is far from having a good coverage. Due to that, I will extend it to remove other undesired words from the tweets.

Extending list of stopwords

Note: it is always a good practice to check if both the tokenizer and the list of stopwords available are well suited for your application. If not, you can tailor them to improve the results or build your own from scratch.

So, after tokenizing and cleaning every tweet made by our users, we are now able to create a wordcloud for their content. To do so, I created a method that receives the TwitterUser as input and displays the worcloud for their tweets. This method applies two packages in order to render the wordcloud: matplotlib and wordcloud, as shown below.

At first, the method gathers all tweets stored on our database that were tweeted by the user. It then proceeds to create a list with all the tokens contained in those tweets. Finally, a WordCloud object is created and displayed. An example made from a few tweets of @emicida is shown below.

Example of wordcloud created

By default, the wordclouds created by the package have a dark background. In order to have a transparent background, I have passed the arguments background_color and mode to the class constructor. To further customize your wordcloud, have a look at the additional options in the package documentation.

--

--