Slaying the Beast of a Messy Dataset

Published in

Spring 2019 — Information Expositions

5 min readFeb 25, 2019

For my most recent class assignment I was tasked with finding a messy dataset and cleaning it up, how scary could that be? I went on to Google and searched “datasets that need to be cleaned” and was intrigued by a dataset released by the United States Department of Agriculture. The Quarterly Food-at-Home Price Database (QFAHPD) “provides food price data to support research on the economic determinants of food consumption, diet quality, and health outcomes”. The data seemed interesting and I had never analyzed a dataset like that so I decided to check it out.

The attractiveness of the data quickly disappeared when I saw its shape. It was shocking, scattered, and scary. I was looking forward to a quick and easy assignment but I came face-to-face with a monstrous task. The data is like what you would confront in your nightmares, four different Excel files with 62 different sheets in all. Each of the four files was a different subset of food — fats, beverages, and prepared foods; fruits and vegetables; grains and dairy; meats, nuts, eggs — with each of the 62 sheets being a subset food group of those main four food groups. To make matters worse, there was almost 53,000 unique rows of data between the four files. The data was like a colossal monster with many heads that were disorderly and enraged, and I was responsible for taming the beast.

To begin my extensive fight to clean up the monstrous data, I loaded each of the four Excel files into a Jupyter Notebook and then had to read each of the 62 Excel sheets into the Notebook. To attempt to tame the disorder of sheets, I created a new column that would connect each of the rows with the food group that it belonged to. I then concatenated each of the 62 food group subsets into four groups, mirroring the structure of the original Excel files. Again, to add another level of order to the Notebook, I created another column that would bind each row to one of the four main food groups. At this point, I had loaded and properly labelled four dataframes into a Jupyter Notebook, but it was still scattered and scary to play with. In order to transform this data into a usable form, I had to combine the many heads of the monster into one large head; that way it was more controllable and I could focus all of my attention to one place. To do so, I had to use another concatenation that resulted in a towering, frightening dataframe with 52,831 rows of data across 12 columns.

The monster may now have only had one head, but its head was a chaotic concoction of strings, integers, and floats. I had created the qualitative data when concatenating the data but much of the original quantitative data was senseless without context because it was code for qualitative values. To generate harmony inside of the beast, I renamed the quantitative values to their code names so that the dataframe was meaningful when just looking it at rather than having to refer to the code on the original Excel files.

THE BEAST WAS NOW SLAYED! I had conquered the challenge of finding a messy dataset and cleaning it up. I had transformed the menacing four filed, 62 sheet, disorderly beast into one harmonious dataframe that was easily understandable. The journey was long, rigorous, and filled with obstacles, but in the end I was successful.

After performing significant transformations on a dataset it is important to run through the EDA checklist to make sure that the data transferred accurately and there aren’t any noticeable errors. The first steps were to check the packaging and look at the top and bottom of the data to make sure it was what was expected. There should be 52,831 rows and 12 columns of data and the first and last markets should have data for each year between 2004–2010 and all four quarters for each year.

The next step is to check the ‘n’s to clarify that all of the information was complete in the dataframe. The ‘n’s accurately showed the various features of the dataframe including that the data was collected from 35 marketplaces across the US. The data was then grouped by quarters, years, divisions of the US, region of the US, main food group, and then a subsets of that main food group.

Now that my data was in a usable format I could answer the question I originally had of the data: Is it getting more or less expensive to eat healthy over time? To start answering the question I had to divide the food groups into healthy and unhealthy. This was a fairly subjective process where foods that were described as non-alcoholic, diet, water, unsweetened, fresh/frozen, whole grain, low fat, or raw were labeled healthy; and unhealthy foods had the descriptors fats, sugar, caloric, desserts, sweet, canned, premade, packaged, juice, non-whole grain, whole fat, processed.

Once the food groups were separated I used a combination of groupby/ aggregation functions, pivot tables, matplotlib, and Altair to find an answer to my question. I concluded that it is getting more expensive to eat healthy over time, and the increase in price is occurring at a higher rate than inflation (2.26% per year). This finding is troublesome although not shocking. It is a shame that prices are higher for fresh, wholesome foods than they are for junk food because it has resulted in a rise in obesity and unhealthy habits for Americans. Policymakers and food retailers need to work together to change this trend and help improve the health of people in the US.

Slaying the Beast of a Messy Dataset

Written by Hannah Weber