Not So Sweet Candy Data (Untidy Cleaning/EDA)

Published in

Spring 2019 — Information Expositions

10 min readFeb 24, 2019

Working with untidy data sets are never a simple or enjoyable task, but it certainly tests your skills as a sufficient data scientist. In this blog post I walk through the task of retrieving untidy data, cleaning, manipulating the dataset, and conforming it into tidy data. Having tidy data is essential for conducting EDA and other information science processes. Without tidying data it is impossible to effectively structure exploration practices, produce models and support discoveries through plotted trends. It took me a bit of searching to find untidy data that was manageable to clean within the timeframe of this assignment. Luckily I ended up stumbling upon this atrocious candy survey launched from by a University in California who publicly submitted the collected data.

Originally I wanted to explore what ages like certain candies and what the most popular candy per state was? After digging into my data I spent more time cleaning than anticipated so I explored more concise questions like what is California’s participants favorite candy, compared to Colorado’s favorite? To have a bit more fun I looked into some wacky questions around users who answered “Which color is the dress, Blue/Black or White/Gold?” I was interested in seeing how their favorite candies differ from one another? To finish it off I explored who enjoys raisins and who is going out trick-or-treating? I looked into raisins because they were the only healthy option on the candy survey and most answered they despised it.

EDA Questions:

What is California’s participants favorite candy, compared to Colorado’s favorite?
How favorite candies differ from people who saw Blue/Black dress vs. White/Gold?
Who enjoys raisins?
Who is going trick-or-treating?

(Untidy)Original Candy Hierarchy Data Set

Tidying the Data

The problems with the raw data?

At first glance the Candy Hierarchy survey raw data looked semi manageable to clean up, but once I started exploring I exposed how untidy it really was. I discovered the data contains a variety of inconsistencies, bad organization, missing data values, inappropriate column labels, joke candy submissions, and non-serious participant responses. Throughout my cleaning process I was able to handle the missing, irrelevant, mislabeled data.

Steps I took to clean it up?

To begin my process of tidying up this sloppy candy data I first looked into removing the bad data. I determined there was enough data submissions that removing the users NaN with a dropna made the data frame more manageable to explore, discover, and analyze. I thought about filling the NaN values with the mean of other surrounding values but it was unnecessary to retain all the missing data to continue EDA practices. After dropping all the NaN values I wanted to review the 119 columns to pinpoint any issues. The last 11 columns of the data was additional data on digital apps like Daily Dish, Science, ESPN, and Yahoo. Although plotting the placement of where participants touched the app would be interesting the time frame of this project allowed me to focus mainly on the candy survey questions. Since I wanted to focus on the candy data I split the data frame into two new sections, one retaining all the “Candy” survey questions and another “Other” with the off topic questions. Splitting the data into sets I want to focus on allowed me to explore further data inconsistencies and untidy issues.

Other Data Section(Unused) ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Candy Data Section(Used)

After reviewing all the new candy column labels I noticed someone had been submitting candies that didn’t exist to troll the survey participants. What caught my eye about these improper survey questions were the silly candy names. These inappropriate, unreal candy submissions clearly needed to be removed from the data in order to create consistency and legitimate data responses. To remove these unwanted columns of data I utilized another .drop on things like ‘Bonkers (the board game)’,’Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)’, ‘Any full-sized candy bar’,’Candy that is clearly just the stuff given out for free at restaurants’, ‘Cash, or other forms of legal tender’, ‘Chardonnay’, ‘Chick-o-Sticks (we don’t know what that is)’,’Creepy Religious comics/Chick Tracts’,’Dental paraphernalia’,’Generic Brand Acetaminophen’,’Glow sticks’,’Gum from baseball cards’,’Healthy Fruit’,’Hugs (actual physical hugs)’,’Jolly Rancher (bad flavor)’,’Jolly Ranchers (good flavor)’,’JoyJoy (Mit Iodine!)’,’Senior Mints’,’Kale smoothie’,’Minibags of chips’,’Real Housewives of Orange County Season 9 Blue-Ray’,’Sandwich-sized bags filled with BooBerry Crunch’,’Spotted Dick’,’Those odd marshmallow circus peanut things’,’Vials of pure high fructose corn syrup, for main-lining into your vein’,’Vicodin’,’White Bread’,’Whole Wheat anything’. As you can see the joke submissions were abundant within the dataset and required tidying to perform successful EDA.

Following the removal of the joke survey questions I browsed the new data frame and instantly discovered the lack of inconsistent responses submitted to the “Q4: COUNTRY, and Q5: STATE, PROVINCE, COUNTY, ETC” columns. I noticed that many respondents used caps lock on submissions and other had all lowercase which skewed the data analysis. I saw that “USA” was the proper way to respond where as other users answered in the “OTHER” option so they could provide troll answers like “Merica, us, the 1 and only u s of a.” I wanted to focus on cleaning and combining all the variations of USA so I can explore this countries candy preferences sufficiently without other distracting, unnecessary data and their users.

Since the main candy rating only allowed users to pick from DESPAIR, MEH, and JOY the answers were appropriately consistent. It appeared that only the Age, Country, and State submissions allowed users to add their custom response which made the data messy and required tidying. After pinpointing the massive amount of inconstant responses of state abbreviations and different country spellings I realized that cleaning and combining similar submissions would take longer than I had time for. After approaching my peers about this issue they suggested that I could minimize my data by looking only at the respondents who took the survey seriously, meaning they used proper submission instructions by submitting the choice of “USA” instead of other and, used only numeric values for Age. I took this advice since I was only looking at the USA results to begin with. I used Pandas Contains function to grab all the “USA” respondents who answered appropriately into a new data frame assuming that they took the rest of the survey with seriousness. It was important to continue to clean and consolidate the Candy Hierarchy data before launching EDA. Retaining the troll submissions only skewed the real data questions i’m trying to uncover.(pic of drop age)

Now that I removed the ridiculous survey question columns, missing values, separated the unneeded digital application data, and consolidated my participants to the serious USA respondents I was ready to edit the new Candy data frame to make it more presentable. To do this I focused on cleaning up the column names then organizing specific data frames to compare specific users in my final EDA process. Originally the column names contained the unneeded question numbers, others were bulky, distracting and contained irrelevant details. I renamed the column headers in a more concise fashion making it easier and quicker to navigate. Although I didn’t have time to rename all 120 columns I did rename the key sections I was performing EDA on which will be discussed in the continuation of the blog post.

Data Interviewing and EDA Processes.

The findings once it was clean?

To being my EDA process I utilized my new clean and concise Candy Hierarchy data frame to look at USA participants and explore their favorite/least favorite candies. To start my exploratory process I created specific data frames that focused on particular groups from the United States, doing so allow me to uncover certain trends of specific users groups. I created clean data frames separating male respondents from female, California residents from Colorado’s, people who saw blue/black dress versus white/gold, and participants who enjoyed raisins.

I started my EDA by looking at the male versus female trends or differences. I used a describe function calling in objects and categories from the data frame to gather analysis insights. This exposes total submission counts, unique responses, the top most common response, and the frequency of that top response. Throughout this candy survey there were 258 male participants with only 132 females which is why it was appropriate to separate and explore both sides. From the chart I can conclude males from California don’t go trick-or-treating, majority of survey respondents were 50 years old, they enjoy the rich chocolate candies like 100 grand bar, Cadbury Creme Eggs, and Caramellos, and dislike other cheap compressed sugar based candies like Mary Janes, and Candy Corn. On the other side we see females from California also don’t go out trick-or-treating, users common age was 48, 2 younger than majority of male participants. Surprisingly the candy preferences between female and male are almost identical besides the female ‘despair’ ranking on Black Jacks compared to males ‘meh’ response. I began to think maybe the majority of California users were skewing the results showing what candy California people like more than the separation of genders. If I had the time to randomly select respondents to limit the survey with the same amount of users from each state this could have been even more effective.

Since I got a clear idea of the California candy users and their preferences I wanted to explore what Colorados results were. Colorado has underwhelming participants with 6 males and 1 labeled other but enough to generate the trends data. Colorado participants were among the younger crowd containing 37 year olds compared to California 42 year old crowd. There were differences between candy preferences of California users and Colorado’s. Colorado’s users enjoy 100 Grand Bars whereas California residents think they are meh. Other results are comparable but it was interesting to see the overall majority of people enjoy Cadbury Creme Eggs, and Caramellos. I believe this is a consistent trend because those candies are among the tastier, richer production chocolates. Getting into the more bizarre comparisons of this EDA I started to focus on users who saw different colors of the same dress. Of the Colorado and California survey participants both majorities saw the dress as White and Gold.

Separating the users who saw the dress as one color versus the other I was able to explore deeper on if there were any interesting candy preference similarities or differences. Out of all the users 1081 saw the dress White and Gold and 635 say its Blue and Black, 59% of people think the dress is White and Gold. Surprisingly there were no candy preference differences between the two groups, weather this is coincidence or true it was interesting there was such a lack of differentiation. I thought that maybe different age ranges see different colors making the dress so I plotted the users age and who saw what color dress. The plot shows that the younger users 14–30 saw blue and black whereas 57–68 saw white and gold. Since there are inconsistencies across other age results it is hard to conclude that certain age groups are bound to see the dress one specific color but the findings are interesting.

Over the course of this data cleaning and EDA the trend of despair with Box O Raisins was very prevalent. I wanted to discover who did enjoy Raisins so I gathered all the users who labeled the Raisins as JOY to uncover who those people were. My first assumption was that users who label Raisins as JOY are probably more healthy or allergic participants. The data frame below does support this idea of healthy participants because the ones who enjoyed the Raisins label the chocolate and other sugar filled candies as meh or despair. It was interesting to see that among all the participants 16 of them dislike candy buy enjoy raisins. From these results it is pretty clear that if a person really enjoys raisins they probably don’t enjoy snacking on candy. Another random finding to wrap this up shows people who like Raisins most also see the dress as White and Gold.

To conclude my EDA and have a bit of fun I was curious who was attending trick-or-treating events. It was shocking to see the majority of users don’t plan to go out on halloween. Even the younger participants had low results of going out. Is trick-or-treating dying out? I think maybe safety is becoming a large concern among families which is why both the young and older generation with kids don’t and won’t be attending trick-or-treating activities.

Not So Sweet Candy Data (Untidy Cleaning/EDA)

Written by Steven Rothaus