Data Cleaning and Preparing functions in Python

Published in

Analytics Vidhya

4 min readApr 26, 2020

As every aspiring data scientist is aware about the importance of data cleaning and preparation, let’s dive into some of the methods which we can use for data cleaning using pandas and numpy in python.

First of all you need to import the 2 libraries

import pandas as pd
import numpy as np

We will be creating temporary dataframes and series in this article for demonstration purpose.

To handle missing data

string_data = pd.Series(['affd','fafd','dgbddb',np.nan])
string_data.isnull()#output
0    False
1    False
2    False
3     True
dtype: bool

Applying .isnull() function on the dataframe returns a series of bool items with null values as true.

2. Filtering out missing data

from numpy import nan as NA
data = pd.Series([1,NA,3.5,NA,7])data_na = data.dropna()
data[data.notnull()]# output
0    1.0
2    3.5
4    7.0
dtype: float64

You can use .dropna() function on your dataframe or you can use .notull(). The result is a series with all non null values.

For dataframe dropna() by default drops all rows containing a NA value.

data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
                    [2,6,3,1],[8,9,NA,NA]])
data
data.dropna()#output
data
Out[12]: 
   0  1    2    3
0  1  2  3.0  4.0
1  4  6  4.0  NaN
2  2  6  3.0  1.0
3  8  9  NaN  NaNdata.dropna()
Out[13]: 
   0  1    2    3
0  1  2  3.0  4.0
2  2  6  3.0  1.0

Passing how=”all” will only drop rows with all NAs.

data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
                    [2,6,3,1],[8,9,NA,NA],[NA,NA,NA,NA]])data
Out[17]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  4.0  6.0  4.0  NaN
2  2.0  6.0  3.0  1.0
3  8.0  9.0  NaN  NaN
4  NaN  NaN  NaN  NaNdata.dropna(how = "all")
Out[18]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  4.0  6.0  4.0  NaN
2  2.0  6.0  3.0  1.0
3  8.0  9.0  NaN  NaN

To drop columns in the same way pass axis = 1

data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
                    [2,6,3,1],[8,9,NA,NA]])data
Out[22]: 
   0  1    2    3
0  1  2  3.0  4.0
1  4  6  4.0  NaN
2  2  6  3.0  1.0
3  8  9  NaN  NaNdata.dropna(axis = 1)
Out[23]: 
   0  1
0  1  2
1  4  6
2  2  6
3  8  9data.dropna(axis = 1, how = "all")
Out[24]: 
   0  1    2    3
0  1  2  3.0  4.0
1  4  6  4.0  NaN
2  2  6  3.0  1.0
3  8  9  NaN  NaN

Suppose you want to keep only the rows containing a certain number of observations.

df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NAdf
Out[28]: 
          0         1         2
0  0.926414       NaN       NaN
1  0.159192       NaN       NaN
2 -0.724809       NaN -1.417761
3  0.282927       NaN -1.463186
4  0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506  1.419397 -0.758035df.dropna()
Out[29]: 
          0         1         2
4  0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506  1.419397 -0.758035

df.dropna(thresh = 2)
Out[30]: 
          0         1         2
2 -0.724809       NaN -1.417761
3  0.282927       NaN -1.463186
4  0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506  1.419397 -0.758035

3. Filling Missing values

calling fillna with a constant replaces missing value with that value

df.fillna(0)
Out[31]: 
          0         1         2
0  0.926414  0.000000  0.000000
1  0.159192  0.000000  0.000000
2 -0.724809  0.000000 -1.417761
3  0.282927  0.000000 -1.463186
4  0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506  1.419397 -0.758035

Calling fillna() with a dict you can use different fill value for each column

df.fillna({1:0.5,2:0})
Out[32]: 
          0         1         2
0  0.926414  0.500000  0.000000
1  0.159192  0.500000  0.000000
2 -0.724809  0.500000 -1.417761
3  0.282927  0.500000 -1.463186
4  0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506  1.419397 -0.758035

4. Removing duplicates

data = pd.DataFrame({'k1':['one','two'] *3 + ['two'],
                     'k2' : [1,1,2,3,3,4,4]})data
Out[40]: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4data.duplicated() ## returns a boolean series
Out[41]: 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: booldata.drop_duplicates()
data.duplicated() ## returns a boolean series
Out[41]: 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: booldata.drop_duplicates() 
Out[42]: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

You can also specify a subset to detect duplicates

data['v1'] = range(7)
data
Out[44]: 
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
5  two   4   5
6  two   4   6data.drop_duplicates(['k1'])
Out[45]: 
    k1  k2  v1
0  one   1   0
1  two   1   1

So, these are some of the functions which we can use for cleaning and preparing data before we go on to do further analysis on that. Will cover some more in the coming parts like transforming your data using maps and lambda functions.

Any suggestions for improvement are welcome and do like the article if it helped you in some or the other way.

Data Cleaning and Preparing functions in Python

Written by Swapnil Lawand