Data Cleaning Checklist for Python and Pandas

Frank Flavell
7 min readSep 23, 2022

--

Data cleaning is sometimes referred to with the more dramatic and action packed “data wrangling,” which conjures up images of Data Scientists as gun slingers, trailblazing new frontiers and taming wild broncos. I’ve found that such fantasies are useful when coping with tedious and time-consuming tasks that in reality cause an existential crisis.

Data cleaning, after data sourcing, is the second and often the most emotionally taxing step of the Data Science process. Which is why it’s very important to master the tricks of the trade so ‘data wrangling’ doesn’t wrangle you. Below is a checklist to help you get started.

This checklist is specifically for Data Scientists using Python and the Pandas package for DataFrames.

Import Data & Review

First things first, we need to import the relevant Python packages as well as your data.

import pandas as pd

Your dataset could be saved in many different file types. Use the following commands to import some of the most common file types into a Pandas DataFrame. This list isn’t exhaustive so, if you’re working with a different file type, you can simply use your favorite search engine or the Pandas documentation to find the right command.

df = pd.read_csv('file_name.csv')
df = pd.read_excel()
df = pd.read_json()
df = pd.DataFrame.from_dict()
df = pd.read_stata('file_name.dta’)
df = pd.read_sql_table('table_name', 'postgres:///db_name')

Of the file types listed above, .csv is probably the most common. Here are several options for importing and filtering the data.

# sets the index of the DataFrame as Column 0
df = pd.read_csv('Yelp_Reviews.csv', index_col=0)
# limits to a preview of n rows
df = pd.read_csv('Data.csv', nrows=100)
# you could also use the skip rows parameter
df = pd.read_csv('Data.csv', skiprows=1, nrows=100)
# header specifies the row where column names are and starts importing data from that point.
df = pd.read_csv('Data.csv', header=1)
# encoding relates to how the strings within the file are formatted
df = pd.read_csv('Data.csv', header=1, encoding='latin-1')
# selecting specific columns
df = pd.read_csv('Data.csv', usecols=[0, 1, 2, 5, 6], encoding='latin-1')

Preview the DataFrame

The commands above save your entire dataset into a simple variable: df (which stands for DataFrame). A DataFrame is a collection object that organizes data into a tabular structure, a fancy way of saying a table with columns and rows. You can only use DataFrames with the Pandas package, which provides a wide variety of methods and functions for viewing, analyzing and manipulating DataFrames to gain insights from your data. The first methods we’ll learn will help you preview your DataFrame.

The purpose of the initial preview is to make sure your data imported correctly and to identify any header formatting errors.

# view the first 5 rows of the DataFrame
df.head()
# view the first 50 rows of the DataFrame
df.head(50)
# view the last 5 rows of the DataFrame
df.tail()
# Get the number of columns and rows
df.shape

You’ll quickly find that if you have many columns, you won’t be able to view all of them. The default view of a DataFrame shows the leftmost columns and the rightmost columns. Use the following code to view all columns and rows. But be careful — if you have a lot of data, then viewing it all at once could crash your computer.

# view all columns
pd.set_option('display.max_columns', None)
# view all rows
pd.set_option('display.max_rows', None)

Header Formatting

Ideally, column names should be in lower case and there should be an underscore _ instead of a spaces between words. Using this standard format will save you a lot of trouble later on when your indexing the DataFrame to isolate specific cells and values.

Use the code below to quickly change names and reformat the header to follow the standard format. Note that if a column is named with a number instead of a word, then you may need to put the number instead of a string with quote marks for the old column name.

# rename columns
df = df.rename(columns={'Long column name' : 'column'})
# make column names lower case and replace spaces with _
df = df.rename(columns=lambda x: x.lower().replace(' ', '_'))

Deeper Review

Your data is imported and the header is reformatted. Now its time to get a deeper understanding of how your data needs to be refined. There are several key issues to consider when cleaning a dataset and you will most likely encounter some or all of them with every project.

Null Values: Missing data is a very common problem with large datasets. Every empty cell will contain a “Null” or “NaN” placeholder value. You will need to deal with these missing values by either updating them or removing them from the dataset.

Data Types: Every cell in a column should be the same datatype and columns should contain similar similar datatypes to each other so you can perform calculations across the DataFrame. This usually means making sure every column, besides key identifiers like name, is turned into a numerical integer or float datatype.

Categorical Data: Categories with string names need to be transformed into numerical values.

Outliers: Sometimes a result of data entry errors and sometimes a result of the real world, outliers can seriously skew your analysis. You must use a systematic strategy for dealing with outliers while also maintaining the integrity of your analysis.

We’re going to use a few standard Pandas methods to help us identify these issues.

# prints a report listing all column names, number of null values & the datatype in the column.
df.info()
# prints the number of nulls per column
df.isnull().sum()
# number of nulls in one column
df.['column'].isnull().sum()
# prints the descriptive statistics including the min and max values, the numbers at each quartile as well as the standard deviation of the data. This is useful in identifying outliers.
df.describe()

One of the major differences between a DataFrame and an Excel spreadsheet is that DataFrames can handle big data in a way that spreadsheets can’t. It also means you need to develop a familiarity with your data without seeing every row in the dataset. Most datasets are simply too large for this approach and it would take too much time.

Using the methods above will help you identify and isolate any glaring issues so you can clean them up.

Dealing with Null Values

Missing data is disappointing because it feels like a missed opportunity. If the evidence is there to support it, you can salvage null values by transforming them into zeros or another value based on the other values in the column. But in most cases, it will be impossible to tell what the value should have been and you will need to remove the entire row. With large datasets, you should expect to remove rows with missing data so you can take advantage of the predictive power of each feature.

Here’s the code to either update or remove null values.

# fills Nulls with zeros
df['column'].fillna(0, inplace=True)
#a method that guesses the value based on the other column data
df['column'].interpolate()
#drops all observations in the DataFrame containing null values.
df.dropna()
#drops all observations in the DataFrame containing null values based on a specific column.
df.dropna(subset=['value_euro'], inplace=True)

Changing Data Types

Using the .info( ) method on your DataFrame, you can determine the datatype in each column, which is listed on the far right. In this case, there are several different datatypes and they should be either “int64” or “float64”.

Use the code below to transform the datatype of a column.

df['column'] = df['column'].astype(int)
df['column'] = df['column'].astype(float)

Dates

If a column contains string values that represent dates and/or time, then you can change the strings into a DateTime object, which is very useful for making calculations involving time.

# Test the update first
pd.to_datetime(df['date_column']).head()

# Apply changes and preview
df['date_column'] = pd.to_datetime(df['date_column'])
print(df['date_column'].dtype)

Dealing with Categorical Data

Sometimes columns contain string values representing the names of categories. Categorical data needs to be transformed into numerical data, meaning numbers that represent each category. This makes possible to compare categories and to use this data for hypothesis testing and predictive modeling.

There are two approaches depending on your end goal. If you simply want to analyze the data through EDA, then you can just turn the values into numbers and keep them in the same column.

# get the unique values in the column
df['column'].unique
# map the categories to numberical values
cats = { 'column': {'': 0, 'cat1':1, 'cat2':2, 'cat3':3,'cat4':4,}}
# replace the old with the new
df.replace(cats, inplace=True)
# check results
df.head()

If you are building a predictive model, then each category needs its own column containing 0s if the observation does not belong to the category and 1s where it does belong to the category. This makes it possible for machine learning models to recognize the categorical data without weighting it more or less than other features.

# identify relevant columns
relevant_columns = ['col1', 'col2', 'col3']
# create dummy variable columns
dummy_dataframe = pd.get_dummies(df[relevant_columns], drop_first=True, dtype=float)
# check result
df.head()

Outliers

It is important to deal with outliers in a systematic way instead of cherry picking the observations you want to remove, which could make your analysis or model appear more effective than it actually is.

I recommend removing a percentage at the top and bottom based on your target variable.

# identify the upper quantile
upper_q = df['value'].quantile(.975)
# remove the upper quantile
e
df = df[df['value'] < upper_q]
# identify the lower quantile
lower_q = df['value'].quantile(.025)
# remove the lower quantile
df = df[df['value'] > lower_q]

These are the most common issues you will encounter when cleaning your data. But challenges abound and you may not even identify all issues until your father along in analysis.

--

--