Pandas data cleansing

Data cleansing and preparation for analysis with Python and Pandas

Denys Golotiuk
DataDenys

--

Any data usually (always?) contain errors. In order to do accurate analysis and build efficient ML models, data needs to be cleansed prior to digging into it. Python Pandas has powerfull tools for data cleansing and further preparation for modeling. Let’s see how that’s done…

Test dataset

We’re going to use Netflix titles dataset, which contains 8.8k rows:

import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/practiceprobs/datasets/main/netflix-titles/netflix-titles.csv')
print(df.head())

Columns management

It’s a good idea to check available columns as a first step of data exploration:

print(df.columns)

Which gives us the list of all columns in the dataset:

Renaming columns

We can change names ofcolumns (e.g. for cases of unclear titles):

df.rename(columns = {'show_id': 'id'}, inplace=True)

Here, we rename show_id column to id and use inplace=True to overwrite our existing dataframe object instead of creating new one.

Dropping columns

Removing unncecessary or unclear data is also important, and sometimes we need to drop an entire column:

df.drop(['description', 'listed_in'], axis=1, inplace=True)

Here we remove description and listed_in columns (axis=1 is required to remove a column instead of a row). If we need to extract only some columns out of many, we can use a shorter option:

df = df[['id', 'type']]
print(df)

Which will include only id and type columns in our final dataset:

Values management

We discover a lot of unclear or broken data during exploratory anaysis in many cases. Sometimes errors are obvious and can be easily fixed by manipulating values.

Updating and creating values

Values can be easily updated using combination of filtering and manipulating operators:

df.loc[df['type'] == 'Movie', 'type'] = 'M'

Here we’ve replaced all Movie values of type column with M value. We can also use math operators as well as functions. Let’s create new column based on calculations from exsiting values:

df['century'] = df['release_year']/100+1

We’ve created new century column based on release_year column value and some math:

Replacing text

Text replacement can help fix grammar errors or reduce values variance:

df['title'] = df['title'].str.replace('World War', 'WW')

Here, we have replaced all World War strings in title columns with WW. In order to find values containing certain [sub]strings:

print(df[df['title'].str.contains('WW')])

Which gives us the following:

Updating NaN values

Undefined values are better to overwrite with meaningful values where possible:

df['cast'] = df['cast'].fillna('Nobody')

Now our cast column has Nobody values instead of NaN:

Removing/filtering rows

There are cases, when some rows are meaningless or too corrupted to use in analysis, so it’s better to get rid of them:

df.drop(df.loc[df['release_year'] < 1950].index, inplace=True)

Here we’ve removed all rows where release_year value is less than 1950. Shorter form for the same is also available:

df = df[df['release_year'] >= 1950]

So we’ve just filtered all “correct” values and overwritten our df data frame object.

Managing data types

Sometimes we need to convert between different types. In order to check column types we can use dtypes property of data frame:

print(df.dtypes)

Which shows types for all columns in our data frame. We find out that century column is of float64 type:

So let’s convert that column to integer:

df['century'] = df['century'].astype('int')

And we can check dtypes again to make sure column type has changed:

Managing duplicates

Finding and removing duplicate rows is one of the critical steps during data cleansing, as duplicates can dramatically damange modeling and analysis itself. Finding duplicates is as easy as:

print(df[df.duplicated()])

Which will output a set of rows that have duplicates (all columns have the same value) in the dataset. Single duplicate row has been found in our case:

We can also find duplicates based on specific columns (instead of all columns), e.g. to find duplicate title column values:

print(df[df['title'].duplicated()])

In order to drop duplicate rows and leave only single unique row out of multiple:

df.drop_duplicates(inplace=True)

This will overwrite df dataframe with only unique rows from original dataset.

Convert categorical columns to numeric

If we plan to use non numeric columns in ML models, we might want to encode those into numeric values. There are couple of ways to do it.

One hot encoding

One option is to use one hot encoding, which will create multiple binary columns with ones and zeros:

So we literaly take all unique values from a text column and convert that list to [additional] columns. Then we fill those columns with 1 where column name is the same as it’s [text] value in a row. And put 0 in all other cases. Those new columns are also called binary, because they can have one of two possible values — 1 or 0.

Let’s encode type column:

df = df.join(pd.get_dummies(df.type, prefix='Type'))
df.drop('type', axis=1, inplace=True)

Here, we have created binary columns prefixed with Type from values in type column. We’ve also added those new column to our dataset using join and then dropped old type column from dataset. We can see new columns now:

Converting text values to IDs

Another approach to get numeric values out of text is to just convert to IDs, generated for those (unique) values:

df['rating'] = pd.factorize( pd.Series(df['rating']) )[0]

Here, we’ve used factorize to generate IDs based on rating column values. Then we’ve written results to rating column again, which is now numeric:

Summary

Pandas module has powerful tools to cleanse and prepare data for further analysis and modeling which may dramatically improve its accuracy. Cleansing usually includes the following:

  • Meaningless columns removal using df.drop();
  • Erroneous values updates using df.loc[];
  • Picking correct data types for columns using df['col'].astype('type');
  • Duplicates removal using df.drop_duplicates() ;
  • Text to numeric values encoding (e.g. using one hot encoding).

--

--