Chisom Nnamani
7 min readJun 24, 2022

--

Twitter Data Analysis — “WeRateDogs”

A Blog about Everything Data

Source: WeRateDogs Twitter

“In God we trust, all others must bring Data” — W. Edwards Deming

A data-driven data wrangler knows the importance of working with the right and complete data. Just before data analysis, a good data wrangler will need to gather his data from the right sources and be able to put it in a form that is useful.

With that said, “I like to think of data as the new soil, Get in and get your hands dirty” — David McCanless.

And that absolutely is the whole point of my Data Wrangling project at Udacity.

As an Udacity Scholar, Data Analyst Nanodegree Program from Udacity gave me the amazing opportunity to go through the data wrangling process. The project emphasizes everything I have learned which involves gathering data from a variety of resources in a variety of formats, assessing the data quality and tidiness, cleaning the data, and showcasing my wrangling efforts through analysis and visualizations.

I wrangled and analyzed the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people’s dogs with a humorous comment about the dog, giving the dog a rating that is mostly always greater than 10, which is their rating denominator.

Source: WeRateDogs Twitter

The dogs that get a rating that is greater than 10 are known as the good dog Brent!

Because the focus of this project is data wrangling, what then is data wrangling?

Data wrangling can be defined as the process of cleaning, organizing, and transforming raw data into the desired format for analysts to use for prompt decision-making.

I first imported the python libraries and packages that are going to be used for this analysis.

Gathering the Data

The data for this project came in three different formats:

  1. Twitter Archive File from WeRateDogs: WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.
Importing the already provided first dataset

Udacity programmatically downloaded this file and provided it as a CSV file to be downloaded manually.

2. Image Prediction File: The images in the WeRateDogs Twitter archive (that is the first dataset above) were run through a neural network that can classify breeds of dogs. The image predictions were stored in this file. This file was hosted on Udacity’s server in a TSV format and was downloaded programmatical using the URL the python Requests library.

Requests is a versatile HTTP library in python with various applications. One of its applications is to download or open a file from the web using the URL.

Using the request method to download the second dataset and then importing it

3. JSON File from Twitter API: Using the tweet IDs in the WeRateDogs Twitter archive, I queried the Twitter API for each tweet’s JSON data using Python’s Tweepy library and stored each tweet’s entire set of JSON data in a file called tweet_json.txt file. Each tweet’s JSON data was written to its own line. Then I read the .txt file line by line into a pandas DataFrame.

Tweepy is an open-source Python package that gives you a very convenient way to access the Twitter API with Python. You can find more details on setting up an app in Twitter and accessing Twitter API using Python.

Authenticating the API
Querying Twitter’s API for the JSON data of each tweet ID in the Twitter archive
Extracting missing columns like retweet count and favorite count and then importing it

And that enabled me to get the retweet count, favorite count, e.t.c of each tweet.

Assessing the Data

The three data have been gathered and were properly assessed. With the assessment, I looked for quality and tidiness issues.

Quality: Low-quality data is commonly referred to as dirty data. Dirty data has issues with its content. The Data Quality Dimensions are Completeness, Validity, Accuracy, and Consistency.

Tidiness: Untidy data is commonly referred to as “messy” data. Messy data has issues with its structure. Tidy data is where:

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

After visually assessing the data in DataFrames and in excel spreadsheets, and programmatically assessing the three DataFrames individually, I found 10 quality issues and 4 tidiness issues in all, and all were documented in my python notebook.

Cleaning the Data

I cleaned all of the issues I documented that I documented during the assessing stage. It’s good to know that cleaning the data doesn’t mean changing the form of the data. It simply means improving the quality of the data so that one can work with it. The data is cleaned to improve its quality and tidiness.

I cleaned the data using the three cleaning processes which are Define, Code, and Test.

  • Define: the issues that were gotten in the assessment are converted into cleaning tasks.
  • Code: the cleaning task is converted into code and then run.
  • Test: I used codes to test my cleaning efforts to make sure they worked.

Storing the Data

After cleaning the data, I combined the three cleaned datasets using a common attribute, which is the Twitter id, and then saved the master dataset in a CSV file named `twitter_archive_master.csv`.

Analysis and Visualizations

The data has been gathered, assessed, cleaned, and is now ready for analysis. In other to get insights from the data, I asked the data some questions.

  1. How many image numbers occurred most for each tweet’s most confident image prediction?

The above graph and the statistical distribution show that the most occurring image number that corresponds to each tweet’s most confident prediction is 1.

2: What is the most popular dog stage according to the neural network’s image prediction?

Looking at the distribution of dog images, it shows that ‘pupper’ (a small doggo, usually younger) is the most popular dog stage, followed by ‘doggo’ and ‘puppo’.

Let me tell you what these dog stages mean?

  • Pupper: A pupper is a small doggo, usually younger. Can be equally if not more mature than some doggos.
  • Doggo: A doggo is a big pupper, usually older. It appears to have its life in order. Probably understands taxes and whatnot.
  • Puppo: A puppo is a transitional phase between pupper and doggo. Easily understood as the dog’s equivalent of a teenager.

So our analysis is telling us that from all the image predictions, most of them were in the pupper stage. It could be due to the fact that a young and unmatured dog is usually cuter than an adult dog and that’s why most people buy, adopt and own them. It should also be noticed that there’s a huge amount of missing data in the dog stage column of the master dataset, thus the distribution may not reflect the truth.

3: Does retweet count positively correlate with the favorite count?

We see that there is a linear relationship between the two variables. This doesn’t mean that increase in retweet_count causes an increase in favorite_count but when you compare both linearly, there is a strong positive linear relation between retweet_count and favorite_Count.

Conclusion

Further analysis and visualization can be carried out on this dataset, but because data wrangling is the major focus of this project, more time was spent in that section.

For my project submission at Udacity, I submitted two reports with my jupyter notebook. The first report is called ‘wrangle_report` which briefly describes your wrangling efforts. This report is also framed as an internal document. The second report is called the `act_report` which communicates all the insights and displays the visualizations produced from the wrangled data. This second report is framed as an external document, like a blog post or magazine article, for example.

This project was an interesting and exciting one for me. It has boosted my skills as a good data wrangler. I have been able to use python libraries to wrangle data and I can’t wait to see the future projects that I will be working on.

You can view my project along with my reports on my Github repository.

You can also connect with me on LinkedIn and Twitter. :)

Thanks for reading!

--

--

Chisom Nnamani

Data & Analytics Engineer | Passionate about Data | Connect with me on Linkedin — https://www.linkedin.com/in/chisom-nnamani/