Learning to Clean House

Matthew McDermott
Data & Verse
Published in
7 min readFeb 18, 2020

It seems an impossible task to describe the first two weeks of the General Assembly data science intensive boot camp. Every day from 9 a.m. to 5 p.m. we covered huge ground in Python programming, statistics, probability, and then launched right into practical work through labs and projects. It felt as if we were required to complete work as if we already had the skills to do them. The proverbial cart had gotten ahead of the horse.

There were four people in the class with me. Sean, Kevin, and Rytas, like Brendan, our instructor, were securely in their twenties, with backgrounds in engineering, statistics, and economics. Chris was in his mid-thirties, with a business degree. Me, I remember the 1970s and have a B.A. and M.A. in art and architectural history. I knew that acquiring a large volume of new material might be more difficult in middle age, but it was an abstract caution. Data science hadn’t hit me over the head yet.

Before I reminisce, I have to say that just the jupyter notebooks you take away from the course are worth a healthy sum. I have never doubted the quality of instruction at GA.

When I look back on studying Italian in college, I think that I was slow in picking up syntax. What is syntax? According to the Oxford English Dictionary, it is “the arrangement of words and phrases to create well-formed sentences in a language.” So when we started boot camp and I wrote code like a narcoleptic Sicilian grandpa, it wasn’t all the burden of middle age, I have brain that needs to soak up new knowledge for a long time before things start clicking into place. The only issue is that patience and time are not plentiful in a 12-week boot camp schedule.

Here is an accurate list of what we covered before Christmas break.

· Intro to Python

· Python functions

· Python control flow

· Python list comprehensions

· Python — Pandas

· Intro to probability

· Discrete distributions

· Continuous distributions

· EDA

· Principles of data visualization

· Intro to statistical inference and confidence intervals

· Intro to object-oriented programming

· Intro to ethics in data science

I felt like I was drowning but I new we had 10 days off for Christmas break. My strategy was to absorb what I could and hold on for Christmas. My parents would help with the kids and I would study morning, noon, and night. And that’s what I did.

I got the feeling that Brendan, our classroom instructor, thought I wasn’t serious about my studies or trying hard enough (maybe both). He is a great teacher. A natural, I would say. He was always right there with us throughout each day. His explanations weren’t just lifted from a book. You could tell that he digested the material, loved it, even, and strove to offer explanations that fit each student’s strengths and weaknesses.

Project One

It was when he and reviewed my Project One work over Zoom during Christmas break that I felt Brendan began to understand that I was working hard, that my journey through the course was going to be different from the other students. This was very helpful to me.

Project One helped me to understand why the instructors said that cleaning data is currently 2/3 of a data scientist’s job. Data cleaning is frustrating but in a weird way I like it … once it is done. This is a skill I would like to continue developing after boot camp.

The first step in a data science project for me is to open a new jupyter notebook. We work in the Atom IDE, too, but jupyter notebooks are an emphasized development environment for us. They are web apps that you work on in a browser and use to create and share documents that contain live code, equations, visualizations, and text, an entire project, basically.

Project One — Cleaning Data

Next, I import the code libraries I need to do the work. These are libraries I use for almost everything I do.

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

import pandas as pd

%matplotlib inline

Numpy is for numerical operations, seaborn and matplotlib for data visualization, and pandas for data manipulation. Now we want to see the data. I like to see all the columns and the first 10 rows.

df.head(10)

Then, I want to see the data types of each column (object/string, float, integer), so I can get an idea of what type transformations might be needed.

df.datatypes

I will then want to get more introductory information on the dataset.

data.describe() — gives summary stats about the columns (series) in the dataset

Missing information is a key question at this point. The following command will tell me how many missing or empty values I have in each column.

df.isnull().sum()

What will I do with these pesky null values? There some basic options.

df.dropna(inplace=True)

This is the nuclear option. It deletes all rows that have a null value.

df.dropna(axis=1m inplace=True)

Another nuclear option. It deletes all columns that have a null value.

df.dropna[‘column1’]

This drops the specific column.

df.dropna[[‘column1’, ‘column2’, ‘column42’]]

This drops the three specified columns.

df.drop(df.index[2])

Drops a specific row.

df.drop[51,’Column_name’]

Drops a specific value at a specific row/column point.

df.reset_index(drop=True)

After dropping values, you may want to reset the index.

Since we have spent so much time on dropna(), here is the full syntax reference:

dropna(self, axis=0, how=”any”, thresh=None, subset=None,inplace=False)

axis: possible values are {0 or ‘rows’, 1 or ‘columns’}, default is 0. If 0, drop rows with null values. If 1, drop columns with missing values.

how: possible values are {‘any’, ‘all’}, default ‘any’. If ‘any’, drop the row/column if any of the values is null. If ‘all’, drop the row/column if all the values are missing.

thresh: an integer value to specify the threshold for the drop operation (i.e., thresh=2, drop the row/column if more than 2 nulls are found).

subset: specifies the rows/columns to look for null values.

inplace: a boolean value. If True, the source DataFrame is changed.

Source: JournalDev

Another option when dealing with null values is to fill them with a value, rather than deleting a row, column, or value. For this, pandas gives us the fillna() method.

df[column_name].fillna(“Not applicable”)

This option fills the null with a static word value.

Another option is to replace nulls with zero:

df[column_name].fillna(“0”)

A common practice is also to replace nulls with the mean of the column:

df[column_name].fillna(df[column_name].mean())

Project One — Transforming Types

Another area of data cleaning introduced to me during project one is type transformation. That is, changing columns, rows, values from string/object to float or integers, or changing data from data frames to series (columns), lists, and arrays, and back. Warning. My first insight into this path was to realize that if I am doing a lot of transforming, I could be on the wrong path and making things overly complex. It’s a syndrome I have grown familiar with.

The following are some general approaches to type transformation that came up in project one. I have abstracted them to be applied anywhere.

This simple function changes a column of string data to float:

def string_float(column):

(indent) column_str = str(column)

(indent) column_fix = column_str.replace(‘%’,’’)

(indent) column = column_fix

(indent) return float(column)

The function also contains the replace() method, which is very useful for data cleaning.

Once the function was written, I applied the function to a specific column:

df[‘column_name’] = df[‘column_name’].apply(string_float)

One can transform from int to float (or vice-versa) with a similar process:

def int_to_float(column):

(indent) column_str = str(column)

(indent) return float(column_str)

df[‘column_name’] = df[‘column_name’].apply(int_to_float)

pd.to_numeric() is another useful method for type transformation, as machine learning models require data to be in a numerical type.

During the data cleaning process, I like to frequently check the shape of my data.

shape() is a method that helps me check that my cleaning/transforming is not changing the data in unwanted ways. In particular, I don’t want to end up with the wrong number of columns and rows. Machine learning models will not work then.

If I’m getting errors and can’t progress, I check the type with .dtypes. Often the problem is as simple as my having forgotten to make the changes to the data frame with “inplace=True.” I continue to learn the hard way to check at each step to make sure that the code is executing what I want or think should happen.

Here is another type of transformation. It may be awkward, but it worked. The flow is from column to list, to array, to iterating the list for a list of z-scores, to creating a data frame with the list of z-scores.

list1 = z[‘column_name’]

array1 = np.asarray(list1)

zscore1 = (“\nZ-score for array1 : \n”, stats.zscore(array1))

list1 = list(itertools.chain(*zscore1))

df = pd.DataFrame()

df = pd.DataFrame(list)

As a beginner, I get a lot of coding errors that don’t make sense to me. I have learned that following all of them can be a huge time drain. Checking possible issues with nulls and types has helped me to resolve a lot of issues more quickly.

I can’t end a data cleaning section without mentioning the importance of naming your columns well. This can easily be done with the rename() method. It seems obvious, but you want to have good column titles so you can remember why this data is important as you go through the analytical process.

Data scientists love data visualizations. After cleaning, this is often the next step, but one I will leave until a later post.

--

--

Matthew McDermott
Data & Verse

Matt is a data professional currently enrolled in General Assembly’s data science intensive boot camp. He is also a Dad who writes poetry and plays drums.