Data cleansing and preparation for analysis with Python and Pandas
Any data usually (always?) contain errors. In order to do accurate analysis and build efficient ML models, data needs to be cleansed prior to digging into it. Python Pandas has powerfull tools for data cleansing and further preparation for modeling. Let’s see how that’s done…
Test dataset
We’re going to use Netflix titles dataset, which contains 8.8k
rows:
import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/practiceprobs/datasets/main/netflix-titles/netflix-titles.csv')
print(df.head())
Columns management
It’s a good idea to check available columns as a first step of data exploration:
print(df.columns)
Which gives us the list of all columns in the dataset:
Renaming columns
We can change names ofcolumns (e.g. for cases of unclear titles):
df.rename(columns = {'show_id': 'id'}, inplace=True)
Here, we rename show_id
column to id
and use inplace=True
to overwrite our existing dataframe object instead of creating new one.
Dropping columns
Removing unncecessary or unclear data is also important, and sometimes we need to drop an entire column:
df.drop(['description', 'listed_in'], axis=1, inplace=True)
Here we remove description
and listed_in
columns (axis=1
is required to remove a column instead of a row). If we need to extract only some columns out of many, we can use a shorter option:
df = df[['id', 'type']]
print(df)
Which will include only id
and type
columns in our final dataset:
Values management
We discover a lot of unclear or broken data during exploratory anaysis in many cases. Sometimes errors are obvious and can be easily fixed by manipulating values.
Updating and creating values
Values can be easily updated using combination of filtering and manipulating operators:
df.loc[df['type'] == 'Movie', 'type'] = 'M'
Here we’ve replaced all Movie
values of type
column with M
value. We can also use math operators as well as functions. Let’s create new column based on calculations from exsiting values:
df['century'] = df['release_year']/100+1
We’ve created new century
column based on release_year
column value and some math:
Replacing text
Text replacement can help fix grammar errors or reduce values variance:
df['title'] = df['title'].str.replace('World War', 'WW')
Here, we have replaced all World War
strings in title
columns with WW
. In order to find values containing certain [sub]strings:
print(df[df['title'].str.contains('WW')])
Which gives us the following:
Updating NaN values
Undefined values are better to overwrite with meaningful values where possible:
df['cast'] = df['cast'].fillna('Nobody')
Now our cast
column has Nobody
values instead of NaN
:
Removing/filtering rows
There are cases, when some rows are meaningless or too corrupted to use in analysis, so it’s better to get rid of them:
df.drop(df.loc[df['release_year'] < 1950].index, inplace=True)
Here we’ve removed all rows where release_year
value is less than 1950
. Shorter form for the same is also available:
df = df[df['release_year'] >= 1950]
So we’ve just filtered all “correct” values and overwritten our df
data frame object.
Managing data types
Sometimes we need to convert between different types. In order to check column types we can use dtypes
property of data frame:
print(df.dtypes)
Which shows types for all columns in our data frame. We find out that century
column is of float64
type:
So let’s convert that column to integer:
df['century'] = df['century'].astype('int')
And we can check dtypes
again to make sure column type has changed:
Managing duplicates
Finding and removing duplicate rows is one of the critical steps during data cleansing, as duplicates can dramatically damange modeling and analysis itself. Finding duplicates is as easy as:
print(df[df.duplicated()])
Which will output a set of rows that have duplicates (all columns have the same value) in the dataset. Single duplicate row has been found in our case:
We can also find duplicates based on specific columns (instead of all columns), e.g. to find duplicate title
column values:
print(df[df['title'].duplicated()])
In order to drop duplicate rows and leave only single unique row out of multiple:
df.drop_duplicates(inplace=True)
This will overwrite df
dataframe with only unique rows from original dataset.
Convert categorical columns to numeric
If we plan to use non numeric columns in ML models, we might want to encode those into numeric values. There are couple of ways to do it.
One hot encoding
One option is to use one hot encoding, which will create multiple binary columns with ones and zeros:
So we literaly take all unique values from a text column and convert that list to [additional] columns. Then we fill those columns with 1
where column name is the same as it’s [text] value in a row. And put 0
in all other cases. Those new columns are also called binary, because they can have one of two possible values — 1
or 0
.
Let’s encode type
column:
df = df.join(pd.get_dummies(df.type, prefix='Type'))
df.drop('type', axis=1, inplace=True)
Here, we have created binary columns prefixed with Type
from values in type
column. We’ve also added those new column to our dataset using join
and then dropped old type
column from dataset. We can see new columns now:
Converting text values to IDs
Another approach to get numeric values out of text is to just convert to IDs, generated for those (unique) values:
df['rating'] = pd.factorize( pd.Series(df['rating']) )[0]
Here, we’ve used factorize
to generate IDs based on rating
column values. Then we’ve written results to rating
column again, which is now numeric:
Summary
Pandas module has powerful tools to cleanse and prepare data for further analysis and modeling which may dramatically improve its accuracy. Cleansing usually includes the following:
- Meaningless columns removal using
df.drop()
; - Erroneous values updates using
df.loc[]
; - Picking correct data types for columns using
df['col'].astype('type')
; - Duplicates removal using
df.drop_duplicates()
; - Text to numeric values encoding (e.g. using one hot encoding).