DATA CLEANING WITH PYTHON

Omobolaji Popoola

Published in

Machine Intelligence Team

5 min readFeb 17, 2019

INTRODUCTION

Before working on any Data, it is very necessary to clean it. If you plan to seek a career in this field, you should be able to handle messy data which might include missing values, inconsistent values or outliers. A basic understanding of Numpy and Pandas Library is required before you can proceed to clean a Dataset.

Data Cleaning

As the name implies, Data cleaning is cleaning Data, that is the process of identifying and correcting inaccurate records from a dataset, recognising unreliable or irrelevant parts of the data and then restoring or removing the messy data.

Data Cleaning is one of the least spoken about aspect of Data Science because it is not as important as training a neural network or identifying images, but to perform these things data cleaning plays a very important role. Without data cleaning, machine learning prediction models will not be as efficient and accurate as we want them to be.

We will discuss the following Inconsistencies in Data

Missing Data
Need to process columns
Removing characters from Strings

Missing Data

When and Why Data is missed?

When filling surveys, at times people do not share all the required information, and some make use of incorrect datatypes. All these result in missing data which must be accounted for before using the Data to make further predictions.

We will be looking at this dataset about some students who wrote an entrance examination to different levels, their scores, school of preference and other details are contained therein.

As usual, we will first import Pandas and read the dataset…

import pandas as pd
data = pd.read_csv('Responses.csv')

Where necessary, you could change the non standard types of NaN(e.g ‘n/a’, ‘na’, ‘ — ’) to missing values. This is done by:

missing_values = ['n/a', 'na', '--']
data =pd.read_csv('Responses.csv', na_values = missing_values)
data.head()

Knowing the number of missing values in ‘District’ column

Looking at the District column, we want to check if it has any null value and we also want to know the number of null values it its column.

data['District'].isnull().values.any()  #To know if there is any missing values
#Returns Truedata['District'].isnull().sum()
#Returns 16

Replacing all NaN values

We can choose to replace all NaN values with our preferred value, let’s say 14.

data['District'].fillna(14, inplace = True)  #replacing missing #values
data['District']

Replacing a specific NaN value

We can also replace a value in a specific location, say, row index 3.

data.loc[3, 'District'] = 32
# data

Replacing missing values with the median

We can replace the NaN values with the median of the column, the median in this case is 3.5.

median = data['District'].median()
median
data['District'].fillna(median, inplace =True)
data['District']

Drop missing values

If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, which means along row, that is, if any value within a row is NA then the whole row is excluded.

# Drop any rows which have any NaNs
data.dropna()# Drop columns that have any NaNs
data.dropna(axis=1)

Dropping Columns that have more than 90% of its data as NaN

This is one interesting function I recently learnt about. The parameter thresh=N requires that a column has at least N non-NaNs to survive. You only want the records that have 90% of the available features before you consider them as candidates for your model.

# Only drop columns which do not have at least 90% non-NaNs
data.dropna(thresh=int(data.shape[0] * .9), axis=1)#Returns a data with the shape of 117rows and 8 columns
#Recall that the original data 117rows and 12columns

So this means that 4 columns have more than 90% of their data equivalent to NaN. They therefore have little or no impact on the outcome of our result.

Another way of doing the above is to manually scan/read through the columns and drop whichever column will have little impact on the outcome of our result.

to_drop = ['District', 'School Preferences', 'School Assigned'       'Will you enroll there?']data.drop(columns=to_drop, inplace=True)
#We will have the same result as the above

Deleting some characters in strings

Assuming we want to work on a large dataset that has some strings which we don’t want to include in our model, we can use the function below to remove some character of each string.

A screenshot showing how to strip a few characters from a string

The soupsubcategory is the only column with a datatype of ‘object’, so we are selecting select_dtypes([‘object’]) and we are removing ‘images’ from each of the data in that column using the lambda function.

new_dataset = dataset.select_dtypes([‘object’])
dataset[new_dataset.columns] = new_dataset.apply(lambda x: x.str.strip(‘/images’))
print (dataset)

There are several other functions and methods that we could perform on our data which are not covered in this article. You could learn more from this course.

Conclusion

There are ton of resources that could help you gain a deeper understanding of Python for Data Science. The above is a little fraction of what Data Science entails. After Cleaning your data, you will have visualize it (Data Visualization) before processing it and making predictions from the result.

Check out the links below to find additional resources that will help you on your Python data science journey:

The Pandas documentation
The Numpy documentation
Introduction to Python for Data Science. This is a great course for beginners who have no previous knowledge in Data Science.

Thanks for Reading through.