‘We Rate Dogs’: Twitter Data Analysis

Udacity Data Analyst ND Data Wrangling Project Write Up.

Sakina Fakhruddin
Women Data Greenhorns
7 min readJan 29, 2019

--

Source: Freedom Crossroads

For the data wrangling project on Udacity Data Analyst Nanodegree, the learners are given the chance to go through the whole data analysis process, from collecting the data to cleaning and analyzing it and finally visualizing trends from the data. The data collected was from the Twitter account ‘WeRateDogs’, a humorous account which gives most dogs a rating above 10. But first,

What is Data Wrangling?

Data Wrangling is the process of transforming and mapping data from “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

Collecting Data

Data had to be collected from three data sources:

  • A file at hand that was available as it in the resources tab of the Udacity Nanodegree classroom. It had the major chunk of the data about tweets of the WeRateDogs account from 2015 to 2017.
  • A file that was to be programmatically downloaded from the Udacity servers which had the results of the machine learning algorithm performed on the images from the WeRateDogs account. I downloaded this file using the Python library requests.
How to download a file with the given URL
  • The third source for gathering data was web scrapping off Twitter using its Tweepy API using the tweet IDs found in the file at hand. The Tweepy API is an easy to use Python-based API which connects to a twitter account using secret and public keys. Once authenticated, one can easily scrap tweets off twitter. To get started: follow the documentation.

For those who are new to web scraping, here a very good blog post written by a fellow Bertelsmann Scholar @Barbara Stempień.

Overall, I had three files with an ample data to analyze.

Cleaning Data

Of course, since there were three different data sources, there had to be problems between the three files. The task at hand was to find and clean at least 8 data quality and tidiness issues. I managed to find and clean 12 issues.

Heat map plotting the Null Values in the main data file.

My approach to finding these data issues was to first find basic information about the three data sets. Then I did a visual analysis i.e. I just looked at specific columns of the data and found issues, such as there being a lot of missing values in a few columns which were then validated via programmatic analysis such as using the info() function.

The issues I found were major such as missing values, bad data types, bad data within the entries and smaller issues such as dog stages in 4 different columns which can be fit into one column and the different number of records within the three data files including duplication and retweets which had to be removed.

The final list of issues found can be divided into two main types:

  1. Quality Issues:- Here, the issues are because of dirty data i.e. the data has problems with its content. Common data quality issues include missing data, invalid data, inaccurate data, and inconsistent data.
  2. Tidiness Issues:- Basically, here, issues are due to the structure of the data. It can also be referred to as messy data. A good guide to tidy and untidy data is found here.

Descriptive Analysis

Once I had completed my assessment of the data and fixed all the issues within my data set, I was left with two clean and tidy data sets, one on tweet data and one on tweet images. I had a lot of fun with doing analysis on this clean data.

My approach to analysis was to ask a question and then try to answer the question with the data that I had.

The burning question on my mind was how well did the model perform?

A describe() on the model predictions gave the following result:

Running describe on the model predictions. Tweet_id and img_num are to be ignored.

Oddly, that there was a probability of a 1.0 as max seen in p1_conf. This means that the model was 100% sure in its predictions. The following is the record with the prediction next to the probability as ‘jigsaw puzzle’.

Row with a 1.0 confidence level
Source: WeRateDogs Twitter

So to evaluate if this was the correct prediction or not, I checked the URL and was led to the image on the left.

The model seems to have completely overlooked the dog in the picture and only considers the jigsaw puzzle which makes up most of the image. A few things to note here is that I have not made the model, nor do I know if any optimization techniques have been run on it, nor do I even know if this was treated as a single or multi-class problem. This is simply an analysis of the data provided. Thus, based on this information, this model can vastly be improved.

To validate this conclusion, a few other rows of interest i.e the ones with a prediction of ‘not dog’ were checked. Out of 2075 entries, there are 832 entries with the chances of the image not being a dog. However, there were many instances within these parameters where there was a false negative such as the following which had the prediction of ‘shopping cart’:

Source: WeRateDogs Twitter.

Obviously, then there are major problems with the model that have to be fixed to give better results as the dog in the above picture has captured most of the picture and the prediction is still to be not a dog. The next question:

What are the most common dog names?

This was easy. The dog names had been corrected during the data cleaning part. ‘An’, ‘a’ and ‘the’, were all in the list of names which had to be cleaned before any analysis could be done on it.

Top ten names for dogs, apart from None.

The resulting list showed the top ten names people have kept for their dogs. The none is there when no names are found.

Visual Analysis

The last requirement of the project was to create visualizations from the data. A similar approach to the analysis part was used i.e a question was asked with hopes of finding an answer with the visualization.

Research Question 1:

How did the retweet count and favourite count improve over time?

Did it increase as the popularity of the account increased?

Scatter Plot of retweets and favourites over time.

Trend Findings :

In the beginning, the favourite counts and the retweet counts are at a similar level, yet the number of tweets per time is more. As the 2016 and 2017 progress, the number of tweets per time decrease (seen via the low number of blue and red dots), but then the number of the favourite counts and retweet counts becomes higher and higher.

Another trend noticed is that favourite counts seem to increase drastically going up to 10000 for a few tweets, yet the retweet counts remain less than 5000 for the entire duration.

Research Question 2:

What type of dogs are there in the tweets?

Top ten dog types predicted.

Findings:

The major predictions seen is of the golden retriever, which seems to be a popular choice for pet dogs, followed by Pembroke and the Labrador. It is a mix of small and big dogs!

Conclusion

Since this project was primarily a data wrangling project, there were a few major takeaways from this project:

  • New functions within the pandas' library such as the melt() and pivot() which help combine and de-clutter the data
Rating Distribution
  • The importance of visually checking the data. For a lot of people, checking or looking at data manually is the task they struggle with. However, it was only when I looked at the actual data and not an abstract version of it did I find some of the faults in it. It also helped me become an expert at my data set so by the time I moved my analysis, I knew my data set inside out, which is highly recommended by expert data analysts.
  • Usually finding duplicates is easy as using the Pandas duplicated() function. However, for this data that did not work! I found duplicates by the URLs as they were the only unique identity I had. Each tweet was supposed to have a unique URL associated with it, so any two records with the same URL meant the underlying tweet was the same and the record had to be removed.
  • How important it is to have an initial analysis to find any and all data issues with the data set. This analysis helps in further planning and understanding of the data set and helps in optimizing the data cleaning process.

The data wrangling project was one of the most fun projects that I have done to date. It is always fun to sit down with a data set and try to gauge what it is telling you. Along with the fun bit, there was also quite a bit of learning in this project. It is available on my Github, and I would love to hear what you think of it!

--

--