Cereal Project Part 2: Transforming Data (and Challenges)

Ridwan O
2 min readJan 15, 2023

Now that I had a good amount of data to work with, it was time to start transforming the data into something I could use to build the dashboard and provide the data being requested in the business problem.

Data Values, Members, and Features

These were the columns of importance that were provided from the twitter scraping python library.

  • created on date
  • tweet id
  • tweet text
  • tweet language (I only kept track of the tweets in English)
  • number of likes

I also added a column to keep track of the cereal names.

Sentiment Data

Based on the text of the tweet, I needed to determine if the tweet was positive, negative, or neutral. As I pulled twitter tweet data, I ran the tweet texts through a sentiment analysis function. I found that tutorial here. (Not going to go through the explanation of how the sentiment was calculated — because it’s explained in detail in that tutorial). This helped me determine the the sentiment value that I would later use to determine whether there was a positive, negative, or neutral sentiment for each tweet.

Testing the Data

After a botched interview last year, I realized I didn’t really learn/do much about testing data during my experience learning about Data Engineering…and decided it was time to learn. I did a ton of research on Great Expectations, but realized that may have been overkill for this small project. I later came across the python library Pandera. It is a small (but effective) data validation library…which is exactly what I needed. I read through the docs, and set it up with my project. Once set up, I tested the data I was ingesting, and made corrections (dropped rows with empty tweets, mismatches in tweet language, etc) in code as necessary to address any errors/problems in the data I was using.

Reflections

I was pretty upset I didn’t spend time thinking about data validation before. Wrong data (not necessarily informationally wrong, but issues with data types, empty values, duplicates, etc) could really mess up your final product / deliverable. It’s important to validate your data!

Table of Contents

  1. Cereal Project Overview
  2. Cereal Project Part 1: Extracting Data (and Challenges)
  3. Cereal Project Part 2: Transforming Data (and Challenges)
  4. Cereal Project Part 3: Loading the Data
  5. Cereal Project Part 4: Dashboard
  6. Challenges Along the Way

--

--