The Data Cleaning Challenge: A Twitter Data Analysis Project

9 min readApr 3, 2023

Finally I can tick this off my to-do list!! This project introduced me to the world of Social Media Analytics.

I have been meaning to analyze a data scraped from twitter, Not just any dataset from twitter, but one that I did the scraping myself.

The last twitter data analysis project I worked on was not scraped by me but it is a project I am proud of. You can find it here.

This project is very exciting for me because I am friends with the Organizers of the #DataCleaningChallenge so I made it a personal goal of mine to scrape the tweets relating to the hashtag and do justice to the project by analyzing the tweets and visualizing my insights. This project was done using Python and Power BI.

The Challenge

The #datacleaningchallenge is an twitter event aimed at promoting best practices in data cleaning. The challenge encourages participants to share their experiences, tips, and tricks in data cleaning by working on a dirty data. It also served as a medium for enthusiast to get a feel of what data cleaning is all about while receiving mentorship from the organizers. The two organizer are Chinonso Promise (Data Analyst) and Victor Somadina (Data Analyst).

The online event launched on the 9th of March 2023 via a twitter space and with the Organizers, two speakers Wofai Eyong (Data Analyst) and Daniel Bamidele (GIS Analyst) assisted with the opening and briefing about the challenge.

Project Overview

This project is a Python-based analysis of tweets related to the #datacleaningchallenge. The project aims to provide insights on the data gotten from the challenge, how people perceive data cleaning, the most talked about tools which could give a hint on the tools the participants used and the strategies on how to make the next challenge even bigger (Watch Out for the Next one in April).

The project consists of four main parts:

Twitter Scraping
Data Cleaning and Preprocessing
Exploratory Data Analysis
Data Visualization

Twitter Scraping

For scraping the tweets, Snscrape was my alternative since Tweepy has many restrictions at the moment. The good thing about Snscrape is that you do not need API keys and it offers a lot more flexibility.

The Data Cleaning Challenge commenced on March 9, 2023 so I scraped tweets for the entire march just to know if the hashtag was in use before that day.

Usimg Snscrape, a total of 922 tweets were returned from 502 different users. I collected data containing the tweet id, username, tweet content, timestamp of the tweet and more. I then saved it as a csv file

The scraped data can be found Here.

Data Cleaning and Preprocessing

The main aim of this section is to identify and fix issues with quality and structure in the data and transforming the scraped data into a more usable format for analysis. Read more about data cleaning here

I started inspecting my data by checking duplicated tweets. I also inspected for null values and issues with the data types. I also started checking the content of each column.
The language column is supposed to contain a two letter language code except und which means “undefined”. I found two three letter words in the column which are qht and qme. The reasons they exist are detailed in this post… click here. I cleaned the column by replacing them with undefined (und).
I also looked into the content column and it contained tags similar to that of a HTML code tag. I wanted to extract the content of the tag. I used this code to extract the content of the tag

# to define a function that can be used to clean the language column
def replace_long_rows_with_und(tweets, max_length):
    # Iterate over the rows of the DataFrame
    for index, row in tweets.iterrows():
        # Check if the length of the language value in the row is greater than max_length
        if len(row["language"]) > max_length:
            # If the language is not "und", replace it with "und"
            if row["language"] != "und":
                tweets.at[index, "language"] = "und"
    return tweets

# Call the replace_long_rows_with_und function with max_length=2
tweets = replace_long_rows_with_und(tweets, 2)

I also extracted the date (yyyy/mm/dd) from the timestamp column.

The full data cleaning steps and procedures are documented on my GitHub repository.

Exploratory Data Analysis

Before proceeding with my exploratory data analysis, I set out to get answers to a few questions of mine

What are the total number of users who made tweets about the Data Cleaning Challenge, the total likes and the total retweets?

Finding the total number of unique users who made the tweet, the total likes and the total retweets in a Twitter data analysis helps to understand the reach and engagement of the tweets related to a certain hashtag. It can provide insights into the popularity of the topic, the level of engagement of the audience, and the potential impact of the tweets.

From the analysis, there were over 900 tweets from over 500 users using the hashtag #DataCleaningChallenge. A total of 4,783 retweets and 16,865 likes were gotten from the analysis. I also went as far to find out the users with the highest number of likes and retweets

Users tweet with the Mosts likes and retweets

This insights helps with identifying potential collaborators or partners for future challenges if the person is an active member of the data community.

2. I would love to observe the Twitter Activity per Day. What was the peak period (day)

The online event was launched on the 9th of March 2023 using a twitter space that took place in the evening. Observing the twitter activity can give a good idea about how the users are interacting with the challenge

The highest number of tweets using the hashtag came in on the 11th of March 2023. There had been a sudden increase in the number of tweets using the hashtag after the launch date. I can say that the peak period is from 10th of march to 14th of March 2023. The most tweets came on the 11th of March, 2023 having 357 tweets.

3. What is the language distribution of the audience?

Analyzing the language distribution of the tweets will tell us the most commonly used languages among the audience, which can help to understand the geographic and cultural diversity of the audience. Knowing the language of your audience helps you improve on communicating with them better

I segregated the languages into English and others and the chart above shows the percentage difference. This tells us that over 90% (871) of the tweets are in English. Other Languages include Tagalog 1.4% (13), Indian 1.0% (9), French 0.3% (3), Arabic 0.1% (1), Italian 0.1% (1), German 0.1% (1), Haitian 0.1% (1) and Estonian 0.1% (1). The remaining 2.3% (21) are Undefined.

4. Which Data Analytics tool was mentioned the most?

For the data cleaning challenge, Participants were allowed to use the popular data analytics tools such as Power BI, Python, Excel, SQL, R and Tableau.

The number of times data cleaning tools are mentioned in tweets can provide you with valuable insights into which tools are popular among Twitter users, preferred by the users for carrying data cleaning tasks and and trending tools in the data cleaning community

N.B: You can estimate the frequency of usage of a tool from the number of times it was mentioned. However, keep in mind that the number of times a tool is mentioned may not directly correlate with how frequently it is used, as tweets may only mention the tool in passing or as part of a larger conversation.

Sentiment Analysis

According to Amazon, Sentiment analysis is the process of analyzing digital text to determine if the emotional tone of the message is positive, negative, or neutral.

This analysis would tell how people perceived the data cleaning challenge. Positive sentiment could indicate that users are finding the challenge to be engaging, informative or useful, while negative sentiment may suggest that users are not enjoying the challenge or having difficulty with it.

My Exploratory Data Analysis is fully documented on my GitHub.

Data Visualization

I exported the cleaned file and some other files containing insights from the data as text files (CSV format). I then loaded then into Power BI to build my report.

You can interact with the dashboard HERE

Recommendations

Before I make recommendations, I would like to congratulate the Organizers and the speakers for a job well done. On the 11th of March 2023, #DataCleaningChallenge was one of the trending topics in Nigeria.

Some recommendations I can make are

For wider reach and more online presence, I recommend that the organizers encourages the participants to write their experience, the things they learnt and the struggles they faced when participating in subsequent challenges organized. This way it gives the Organizers the chance to help them and improve on the sentiments from the participants.
Looking at the most talked about tool, we can see that the users have preference tending towards Excel, SQL and Python for data cleaning. For subsequent data cleaning challenge, I recommend teaching sessions be held on how to effectively use these three tools and take their skillset to the next level. This should be done before commencement of subsequent challenges so that the participants would not be stuck when the challenges begin.
Concerning tools that are the least frequently talked about, it could be related to the fact that the users find them difficult to use. So to help individuals get familiar with more data analytics tool, The organizers could fix session to train all the participants on one tool at a time, maybe Power BI since it has similarities to Excel.
I analyzed for the top mentioned Twitter handles and the most common hashtags. For future challenges and to foster collaboration, I would recommend the Organizers collaborate with anyone from the names below. This could help for wider reach and bring in more users to partake in the challenge.

I lastly recommend that a broadcasting strategy is put in place to make this challenge known to more users. For example, a flier could be made as a means of making the challenge look more official. You can even plead with Data Influencers to spread the word and encourage those who participated in the previous challenge to spread the word as well.

Generally, I would say the challenge was a success and it did well in terms of social media outreach but there is room for improvement.

For subsequent challenges, I recommend a target to be set on the number of users tweeting about the challenge. A total of 502 twitter users were discovered for this challenge, 1000 could be the target for the next challenge.

Thank you for reading and feel free to comment, share and correct me in any aspect of the work. I would also love feedbacks.

Feel free to reach out to me on LinkedIn and on Twitter.