Messy Data: Oh the Horror!

Published in

Spring 2019 — Information Expositions

7 min readFeb 25, 2019

One thing that I hear almost all the time, from professors, industry experts, and even fellow peers is that most of a data scientist’s job consists of cleaning data. Some claim that cleaning and preparing data for analysis takes up over 80% of a data scientist’s time, while the “fun stuff” is, in reality, a very small portion of their job. In my limited experience, I would say this mostly holds true which makes you an even happier data scientist when you find a clean dataset that needs minimal work before you can start doing your analysis or feeding the data into a machine learning algorithm.

I’d also heard that just by virtue of the world being kind of messy overall, most of the data that comes out of it is also messy. To find a messy dataset, I went straight to the State of Colorado’s Data Portal. Not because Colorado’s public data is particularly messy, but because I knew that government agencies can have some of the messiest data out there. I ended up choosing a dataset detailing statistics about the state’s registered animal shelters for 2016. It was downloaded in CSV format from here. I wasn’t sure what to expect when I opened up this CSV, but I was hoping it was messy, but not too much so.

The First Look

I opened up the dataset in Excel first and then was greeted with quite the horrific display. I didn’t know it yet, but this dataset was quite wide. It had 204 columns and 257 rows. Many designate having a layered personality like an onion as a good trait to have. While this may be true for people, it is definitely not true for datasets. As I was peeling back these “layers” they continued to expose more awful aspects of the data. The first three rows of this dataset were relatively unsuspecting. However, the next three rows revealed that they were not other instances, but final counts in the data and notes about data quality. After looking through the rest of the dataset, it seemed these were the only rows that were like this, so I dropped them. Other things that turned out to be less than ideal about this dataset were a “Footnotes” column that used “**” instead of “True” to indicate the presence of expected error.

First rows of the data as seen after loading it into pandas.

I would also later find out that there were many other columns like the “Adult Dogs” column (for cats, reptiles, etc.) that served one purpose: to have one value in row 3. That value? As seen above, it is simply the name of the column so that the counts to the right can be identified as being for a given animal. So, these columns also got dropped.

drop_cols = ['Adult Dogs','Adult Cats', 'Juvenile Dogs', 'Juvenile Cats','Birds', 'Small Mammal', 'Reptiles', 'Rabbits', 'Other']
shelter_df = shelter_df.drop(drop_cols, axis=1)

I also dropped any column that contained the word “Notes” as these columns were not going to be very useful to me.

drop_cols_notes = [x for x in list(shelter_df.columns) if 'Notes' in x]
shelter_df = shelter_df.drop(drop_cols_notes, axis=1)

Things were looking much better at this point, but there was still a big problem that needed to be addressed. You might have noticed in the above screenshot that there are a mix of blank values and NaNs. Most analysis packages aren’t going to like that, so I changed all the blank values to NaNs:

#fill in blank values with NaNs
#from https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas
shelter_df = shelter_df.replace(' ', np.nan)

Now I was getting somewhere, overall, I ended up with something that looked like this:

Not too awful, but there are still a lot of NaNs. I tried to deal with these in two ways. One way was to leave them in and try to work around them. To save us both time, I’m not going to get into how worked around the NaNs as it ended up being a dead end. The method I actually ended up using was filling these NaNs with zeroes. I justified this by assuming that not all animal shelters take in all types of animals and if a shelter did not report on a certain metric, they most likely did not have any animals/events that fit that metric. Therefore, by simply filling in these NaNs, it makes analysis much easier.

Dealing with NaNs: fill them with 0s

Before I jumped straight into filling all of the NaN values with 0s, I wanted to loop back to that weird “Footnotes” column. According to the description in the dataset, this column denotes whether or not there is a suspected problem with a shelter’s numbers after being reviewed by a statistics review board. By dropping any row that has this marked as “True”, I can ensure that I’m working with data that has been (presumably) validated.

#Drop any problem statistics
shelter_df_no_footnotes = shelter_df.drop(shelter_df[shelter_df['Footnotes:'] == True].index)#Fill with 0s
shelter_df_fillna = shelter_df_no_footnotes.fillna(0)

After dropping those rows, I went from the full 257 rows in this dataset to 197 rows. Another note before I get any further into the analysis: this data is so sparse at the individual shelter level that all of the analysis going forward will be at the county level.

Tidy it up!

To use packages such as seaborn to do visualization as well as make the data easier to work with overall, I wanted to “tidy” it up (make it longer than it was wide). Pandas’ melt function works well for this:

shelter_df_fillna_melted = shelter_df_fillna.melt(id_vars=['PACFA License Number:', 'County'])

Those id_vars keep the data associated with individual shelters and their county for the time being.

As you can see, the data is now stored in the “variable” and “value” columns. Since I was most interested in the actual numerical data, I filtered out data like zip code and Footnotes so that I could focus on the statistics:

#Filter the melted datashelter_df_fillna_melted_filtered = shelter_df_fillna_melted.query("variable != ['Footnotes:','Facility Name:','Zip Code','County']")#Since we're dealing with all numerical data, let's convert the "value" column to all float type values
shelter_df_fillna_melted_filtered['value'] = shelter_df_fillna_melted_filtered['value'].astype(float)

Before I got too much into aggregating, I asked some simple questions of the data:

#On average, how many dogs were adopted at every shelter in 2016?
np.mean(shelter_df_fillna_melted_filtered.query("variable == 'Adult Dogs: Adoption'")['value'])132.31958762886597#What about cats?
np.mean(shelter_df_fillna_melted_filtered.query("variable == 'Adult Cats: Adoption'")['value'])78.50515463917526

On average, it looks like more dogs get adopted than cats in Colorado. How does this compare to the counts at the beginning of the year?

#On average, how many dogs did shelters start the year off with?
np.mean(shelter_df_fillna_melted_filtered.query("variable == 'Adult Dogs: Beginning Count '")['value'])11.81958762886598#And cats?
np.mean(shelter_df_fillna_melted_filtered.query("variable == 'Adult Cats: Beginning Count'")['value'])9.087628865979381

What does the relationship between adoptions and beginning counts look like?

cats_beginning_vs_adoption = pd.pivot_table(shelter_df_fillna_melted_filtered.query("variable == ['Adult Cats: Beginning Count','Adult Cats: Adoption']"), values='value',index='PACFA License Number:', columns='variable').reset_index()sb.lmplot(data=cats_beginning_vs_adoption,
          x='Adult Cats: Beginning Count', y='Adult Cats: Adoption')

As you can see, beginning count and adoption are generally positively related.

County Aggregating

Now I’ve got a dataset that I can use to do some county-level aggregating. To do this, I did a group-by aggregation by county to get the county level statistics:

shelter_df_fillna_gb = shelter_df_fillna_melted_filtered.groupby(['County','variable']).agg({'value':'sum'})shelter_df_fillna_gb_indexReset = shelter_df_fillna_gb.reset_index()

I have to be careful with this aggregated dataset as I used a sum function to aggregate and there are average values present in the dataset. I can’t use the sums of these average values and expect to get accurate results. But, to show what is possible when aggregating by county, I can make plots like these:

#Don't need 'County' at the end of every county name
shelter_df_fillna_gb_indexReset['County'] = shelter_df_fillna_gb_indexReset['County'].apply(lambda x: regex.sub(r' County', '', x))#Plot the data
plt.figure(figsize=(10,6))
sb.barplot(data=shelter_df_fillna_gb_indexReset[shelter_df_fillna_gb_indexReset['variable'] == 'Adult Dogs: Adoption'],x='County',y='value')
plt.xticks(rotation=90)
plt.ylabel('Adopted Dogs')

Unsurprisingly, the Denver metro area and Colorado Springs account for most of the dog adoptions in the state.

Wrapping it up

It takes a long time to get from a messy dataset like this one to succinct metrics and visualizations, but it’s worth it. Now that this data is cleaned and prepped, there are many other relationships and plots that are waiting to be uncovered.