Data Cleaning in Python with NumPy and Pandas

Mike Flanagan
Published in
5 min readJul 18, 2021

--

The first utilities that an aspiring, python-wielding data scientist must learn include numpy and pandas. All provide an assortment of tools for a data scientist to apply thorough analysis and find deep insights, yet with so many built in functions and attributes, it can be difficult to know which to use at the start.

Once the data scientist has good data in hand, they must first explore and manipulate the data to ensure that it is manageable for analysis, visualization and machine learning. According to a poll conducted by Anaconda, 45% of a data scientist’s time is dedicated to data preparation tasks, such as data cleaning. Your mileage may vary, but you should find that focusing on cleaning will produce significant results, even more than clever feature engineering or tedious hyperparameter tuning will.

Below, I outline a few of the essential functions that can get up to any task in data cleaning. To get started, you first must setup your notebook by importing critical packages and reading in your data.

# import your packages
import numpy as np
import pandas as pd
# read in your data
df = pd.read_csv("data/
dataset.csv")

Essential NumPy and Pandas Tools

First—Checking for Null Values

Photo by Hunter Haley on Unsplash
df.isna().sum()

Checking your data for null values is a priority in any initial EDA. Along with a handful of other pandas methods, such as .head()/.tail(), .describe(), and .info(), the combined use of .isna() followed by .sum() on your DataFrame variable is one of the first tools that should be used to check your data. It counts the null values in each column and reports back the total number of cells with missing values per column, which allows you to see where your attention should be directed first when cleaning.

Manipulating DataTypes

df['date'] = pd.to_datetime(df.date)

Pandas Series have uniform DataTypes—more on this later. DataTypes for the contents of each column of your DataFrame may be checked with df.info(). If you have a column that clearly contains dates, try coercing the data to an appropriate DateTime format using pd.to_datetime().

Similarly, if you notice that a column contains numbers but is showing non-numeric objects as the column’s DataType, try to coerce the column to int64 or float64 DataType by using pd.to_numeric().

Correcting Conditional Values

np.where()

np.where() is the Swiss Army knife multitool for data cleaning. It can be used for practically any corrective action on a dataset. It takes three arguments:

A Swiss Army knife is the iconic multitool.
Photo by Patrick on Unsplash
  1. A condition (or conditions) that must be met for the change that you would like to make.
  2. The value for the change that you would like to make if the above condition(s) are met.
  3. The value that should be used when the condition(s) above are not met.

Replacing null values? np.where() has you covered. One-hot encoding? You can do that with np.where(). Managing outliers, fixing data entry errors or making similar entries uniform? np.where() all day.

Dropping Duplicates and Errors

df.drop(index=21500, inplace=True) # The index number may be any integer corresponding to the index of the row you would like to drop.

.drop() inplace is particularly useful whenever you would like to remove an individual row item. Use this whenever you identify a row of data that is clearly a duplicate, or an erroneous data entry. Beware dropping row items simply because you don’t like the content, or because it doesn’t fit your expectations—removing rows will bias your data.

Indexing Arrays using .apply() and lambda

Imagine a scenario in which a dataset possesses a column where the cell of each row contains an array of items—a collection separated by commas and surrounded by square brackets. You want to reference the items within the array. Unfortunately, doing so isn’t as straightforward as it would seem.

Photo by ThisisEngineering RAEng on Unsplash

Typically with python, you can reference the items of a list or array by returning the list variable name followed by square brackets containing the index number of the item you would like. With pandas, however, DataFrame columns are pandas Series, which lock datatypes to be uniform—i.e., columns that contain text as well as numeric values would list its contents as being ‘objects’. Likewise if a column contains what looks like a python list of items that are numbers. Therefore, pandas would evaluate a cell that appears to contain an array or list as being an object (or string), and trying to index such an array or list in a DataFrame would simply return the character at that index of the string.

If you would like to circumvent this often frustrating feature (and it is a feature, for reasons relating to vectorization), you may manipulate the object column contents as you would with lists using an expression such as the following:

df['kcal'] = df.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[0])df['fat'] = df.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[1])df['sugar'] = df.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[2])

In this example, I have dataset of recipes, with each row representing a unique recipe. The dataset has a ‘nutrition’ column, and each row item would have an array of floating-point numbers. The first number in each row’s array for that column would be the recipe’s calorie count, the second its fat content, etc. If I want to separate each recipe’s nutritional statistics in to their own columns, I would do so in the manner above, creating individual columns for each nutritional information statistic.

Final Thoughts

Photo by Joshua Earle on Unsplash

Organization is key

Breaking your notebook in to sections that correspond to steps in the Cross-Industry Standard Process for Data Mining (CRISP-DM) allow for a progressive workflow and legibility for anyone viewing your notebook. Additionally, ensure that your cleaning process iterates through in the order that you initially make the changes, so as to save yourself from getting python errors from cells that contain data cleaning code later, when you try to rerun your notebook cells after restarting your kernal.

Jupyter Notebook’s nbextensions are very useful for organization—I always work with ToC (Table of Contents) and Collapsible Headings on.

Be actionable

When faced with the dilemma of data that seem incorrect, think of what action makes the most sense, make the appropriate change, and write down your reasoning in the notebook. Typically, the most important thing is to make a reasonable decision and move on to continue cleaning and performing analysis, rather than getting stuck with decision paralysis, or spending hours looking for more info to support your reasoning.

Further Reading

Dean McGrath for Towards Data Science, Data Cleaning Using Python Pandas

Neelutiwari for Analytics Vidhya, Data Cleaning Using Pandas

--

--