Data Cleaning and Preparing functions in Python

Swapnil Lawand
Analytics Vidhya
Published in
4 min readApr 26, 2020

As every aspiring data scientist is aware about the importance of data cleaning and preparation, let’s dive into some of the methods which we can use for data cleaning using pandas and numpy in python.

First of all you need to import the 2 libraries

import pandas as pd
import numpy as np

We will be creating temporary dataframes and series in this article for demonstration purpose.

  1. To handle missing data
string_data = pd.Series(['affd','fafd','dgbddb',np.nan])
string_data.isnull()
#output
0 False
1 False
2 False
3 True
dtype: bool

Applying .isnull() function on the dataframe returns a series of bool items with null values as true.

2. Filtering out missing data

from numpy import nan as NA
data = pd.Series([1,NA,3.5,NA,7])
data_na = data.dropna()
data[data.notnull()]
# output
0 1.0
2 3.5
4 7.0
dtype: float64

You can use .dropna() function on your dataframe or you can use .notull(). The result is a series with all non null values.

For dataframe dropna() by default drops all rows containing a NA value.

data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
[2,6,3,1],[8,9,NA,NA]])
data
data.dropna()
#output
data
Out[12]:
0 1 2 3
0 1 2 3.0 4.0
1 4 6 4.0 NaN
2 2 6 3.0 1.0
3 8 9 NaN NaN
data.dropna()
Out[13]:
0 1 2 3
0 1 2 3.0 4.0
2 2 6 3.0 1.0

Passing how=”all” will only drop rows with all NAs.

data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
[2,6,3,1],[8,9,NA,NA],[NA,NA,NA,NA]])
data
Out[17]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 4.0 6.0 4.0 NaN
2 2.0 6.0 3.0 1.0
3 8.0 9.0 NaN NaN
4 NaN NaN NaN NaN
data.dropna(how = "all")
Out[18]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 4.0 6.0 4.0 NaN
2 2.0 6.0 3.0 1.0
3 8.0 9.0 NaN NaN

To drop columns in the same way pass axis = 1

data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
[2,6,3,1],[8,9,NA,NA]])
data
Out[22]:
0 1 2 3
0 1 2 3.0 4.0
1 4 6 4.0 NaN
2 2 6 3.0 1.0
3 8 9 NaN NaN
data.dropna(axis = 1)
Out[23]:
0 1
0 1 2
1 4 6
2 2 6
3 8 9
data.dropna(axis = 1, how = "all")
Out[24]:
0 1 2 3
0 1 2 3.0 4.0
1 4 6 4.0 NaN
2 2 6 3.0 1.0
3 8 9 NaN NaN

Suppose you want to keep only the rows containing a certain number of observations.

df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df
Out[28]:
0 1 2
0 0.926414 NaN NaN
1 0.159192 NaN NaN
2 -0.724809 NaN -1.417761
3 0.282927 NaN -1.463186
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035
df.dropna()
Out[29]:
0 1 2
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035

df.dropna(thresh = 2)
Out[30]:
0 1 2
2 -0.724809 NaN -1.417761
3 0.282927 NaN -1.463186
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035

3. Filling Missing values

calling fillna with a constant replaces missing value with that value

df.fillna(0)
Out[31]:
0 1 2
0 0.926414 0.000000 0.000000
1 0.159192 0.000000 0.000000
2 -0.724809 0.000000 -1.417761
3 0.282927 0.000000 -1.463186
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035

Calling fillna() with a dict you can use different fill value for each column

df.fillna({1:0.5,2:0})
Out[32]:
0 1 2
0 0.926414 0.500000 0.000000
1 0.159192 0.500000 0.000000
2 -0.724809 0.500000 -1.417761
3 0.282927 0.500000 -1.463186
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035

4. Removing duplicates

data = pd.DataFrame({'k1':['one','two'] *3 + ['two'],
'k2' : [1,1,2,3,3,4,4]})
data
Out[40]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
data.duplicated() ## returns a boolean series
Out[41]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data.drop_duplicates()
data.duplicated() ## returns a boolean series
Out[41]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data.drop_duplicates()
Out[42]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4

You can also specify a subset to detect duplicates

data['v1'] = range(7)
data

Out[44]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
data.drop_duplicates(['k1'])
Out[45]:
k1 k2 v1
0 one 1 0
1 two 1 1

So, these are some of the functions which we can use for cleaning and preparing data before we go on to do further analysis on that. Will cover some more in the coming parts like transforming your data using maps and lambda functions.

Any suggestions for improvement are welcome and do like the article if it helped you in some or the other way.

--

--

Swapnil Lawand
Analytics Vidhya

MBA in Business Analytics || Passionate about working at the intersection of data, business and technology