How I Mined Tweets to Learn More about A Zambian TV Show I Wasn’t Watching

Mbuyu Makayi
Developer Circle Lusaka
5 min readAug 24, 2018

A few months back, I decided to tap into some data science after being a fellow in the DataHack4FI. So I got my self a course on Udemy and started learning the tools of the trade described as the new sexiest job.

Like most developers, I couldn’t just sit through the tutorial and repeat what was being taught. I needed a pet project to cement these ideas I had learnt. So I naturally took to looking at datasets on Kaggle but none had what I was looking forI was looking for data with what I could say a life(I know, corny right?). That’s when the idea of mining Twitter’s API hit me. I could get Tweets on my timeline and see what I could find.

It so happened that at that time, there was a show on Zambezi Magic called Tizibika that everyone was tweeting about. This presented the perfect subject for a pet project on Data Science.

Fetching Data from Twitter

So to get the Tweets, I used a Python library called TwitterScraper which has the method query_tweets. In this method I supply the hashtag or word, a date range when this tweets were made. I would then store the data in an excel file.

This is the resulting dataframe

Cleaning Data

After loading the data, I noticed that the timestamp wasn’t in the right format. I need to separate it into time and date and to then get the hours part. This was useful to plot which hours people were mostly actively tweeting about the show.

The hour column is now available

Analysing The Tweets

When did the Show air?

The first thing I did was to find out how many tweets were sent on particular day.

The result of the above code was this line chart , which shows the date when tweets containing the word or hashtag Tizibika started appearing on Twitter. You also can see that between February 4th and 11th is when the show premiered just based on the activity.

Line plot of twiter activity based on date

What time does the Show air?

The second thing I wanted to find out was when the show aired on a particular day. From the plot, you can see that the show aired mostly around 20hrs GMT (22:00hrs locally).

Who’s the most influential Twitter user on the topic?

After finding out what dates the show airs on and time, the next step was to find which people have quite an influence regarding Tweets about about the show. From the Bar chart generated, you can see that tozy_b is person with the most likes. After these results, I was so curious to see who tozy_b and I got to find out that she is one of lead actors on the show (data doesn’t lie).

Bar graph show Tweeps with the most likes

Who was the Most Retweeted User?

To answer this question, I simply changed the code I had written to find who was the most user with the most likes.

Most retweeted Users

From the plot you can see Arushapot was the most retweeted user. Rightly so cause she happens to be the director of the show (told you data doesn’t lie)

Natural Language Processing with NLTK

What words were being used to describe Tizibika?

The part of my little tweet data analysis was to understand what people were particularly saying about to the show. To do this I had to use a Natural Language processing library called NLTK. The first part was to remove all the punctuation marks and stopwords words like I, me,if and and other things like Twitter hashtags.

After all the cleaning the data, I then counted each occurance of every word and then plotted in it in Bar graph. From this you can deduce that it most viewers loved or liked the show and that the show was airing on Zambezi Magic.

Conclusion

I started off with having little knowledge on the show but from the analysis I gained more information about it. It was satisfying being able to deduce so much information from just what seemed to be regular Tweets. This is another example of how Data Science can be used to gain more information on so many subjects and why everyone should consider a career it or the basics of it like I did.

I wish I could have gone further with the project by adding Sentiment analyis but life happened and it been six months since I worked on it. So If you are interested in the project, you can check out the repo here and contribute to it.

And thanks to my friend Daniel Schmitt who shared some of his code with me.

--

--

Mbuyu Makayi
Developer Circle Lusaka

Software Engineer at Chipper Cash| ex-Facebook Developer Circle:Lusaka Community Lead | F1 Fan