Beginner’s guide to Data cleaning

You have the .csv file. Now what?

Anastasia Kaiser
The Startup
5 min readJun 12, 2020

--

In the embryo stage of my journey to data science I was talking to the professionals, who already work in the field. They gave me a lot of tips and shared their stories, but one thing they all (I mean, every single one of them) said is that most of your time as a data scientist you won’t be building fancy models, you will be cleaning your data. The dataset you are working with can be in different stages of ‘dirtiness’. If you are just practicing your skills on Iris or Titanic datasets, they are pretty clean and ready to play with. But if you need to acquire the data first, via web scraping for instance, chances are you will need to spend a minute of you time before you start modeling.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv(‘yourdatafile.csv’)

First things first, we need to look at our data. Use simple commands, such as:

  • df.shape — will give you an idea of how many rows and columns there are in your table
  • df.columns — returns the column names
  • df.isna().sum() — will show how many missing values or NaN’s there are per each column
  • df.describe — will give you basic statistics, such as mean, standard deviation, etc. for columns with numeric values
  • df.dtypes — tells you which data type are the values per each column. Note: what we call strings in vanilla python, pandas is calling ‘object’.
  • df.head() — shows you the first five rows of your table, so you can get the impression of what you’re dealing with. (df.tail() gives you the last five rows, maybe you need them more than the first five)

Now when you are familiar with your dataset, you can start dealing with all the little problems. A good next step is to figure out how many missing values there are and what to do with them. If there are just a couple of rows with missing values, consider yourself lucky. You can probably drop them without messing with the big picture. Use df.dropna(), but read the documentation first, maybe you will find what you’re looking for. If there are a lot of missing values, it is time to get creative. There are a lot of ways to handle this problem, but the ultimate truth is the fact that you really need to know the context.

One way is to replace the NaN’s with zeros. For example, you are dealing with a health survey and the column with a ton of missing values is called ‘Years_Of_Smoking’. We can assume that non-smokers wouldn’t put anything in this column, so we can safely replace it with zero.

Another practice (which is in a lot of cases considered a bad practice) is to replace the missing values with the mean. Again, you need to figure out your context before doing so. If you don’t know the height of a person, you can probably replace it with the mean. But I would say, look at the distribution of the non-missing values: if it is heavily skewed, don’t use mean substitution. Remember, real world data is almost never perfectly bell shaped.

Other ways include getting very creative with replacing NaN’s by building mini regression models to predict those missing values. But what if you are missing strings? Once the interviewer gave me a problem, asking me to handle the missing values in a table of bank transactions record. He asked me, what would I do with the missing address. I said that I would restore this data by using the merchant’s name: if we know that “The Coolest Coffee Shop” is located at 123 Main Street, we can definitely handle the problem if either the address or the merchant’s name is missing.

After deciding what to do with the missing values you will need to make sure that all the data is in correct data types. Dates can be stored as integers or objects, floats and integers can be mixed up, etc. Check out pandas documentation on how to use astype, to_numeric, to_datetime and other super useful commands.

Finally, there is a chance that some of the information is in the wrong column! It is not that big of a problem, if this information can take a limited number of values. Let me give you an example. When I was building a recommendation system for Sephora skincare, I needed something called user features: their skin tone, skin type, eye color, and hair color. After days of scraping and cleaning the user-table I realized that every single word was in its own column, and instead, I needed columns named ‘skin type’, ‘hair color’, etc. with corresponding values.

My weird df

Here’s how I handled the problem for skin tones:

skintones = [‘Porcelain’, ‘Fair’, ‘Light’, ‘Medium’, ‘Tan’, ‘Olive’, ‘Deep’, ‘Dark’, ‘Ebony’]
tonelist = []
for value in split[9]:
if value in skintones:
tonelist.append(value)
else:
tonelist.append(‘No data’)

split[‘Skin_Tone’] = tonelist
split

Similarly, for eye colors:

eyecolors = [‘Brown’, ‘Blue’, ‘Hazel’, ‘Green’, ‘Gray’]
eyelist = []
for value in split[3]:
if value in eyecolors:
eyelist.append(value)
else:
eyelist.append(‘No data’)

split[‘Eye_Color’] = eyelist
split

This is just an example of how you can get creative with your data cleaning when you are familiar with the context.

When you no longer have missing values that will mess with your analysis and modeling, when every value is in correct datatype, when you are sure that every value is in the right column, and you know how your data looks like from a basic statistical point of view, it is good to start looking at the distributions of the values, plot some histograms and box plots, find out the most commonly used words (if you are planning some NLP). But it is already out of the scope of this discussion.

--

--

Anastasia Kaiser
The Startup

Data scientist, Math and Physics enthusiast. Enjoy working on ML projects about beauty products and fine cuisine.