Data Cleaning and Preparing functions in Python
As every aspiring data scientist is aware about the importance of data cleaning and preparation, let’s dive into some of the methods which we can use for data cleaning using pandas and numpy in python.
First of all you need to import the 2 libraries
import pandas as pd
import numpy as np
We will be creating temporary dataframes and series in this article for demonstration purpose.
- To handle missing data
string_data = pd.Series(['affd','fafd','dgbddb',np.nan])
string_data.isnull()#output
0 False
1 False
2 False
3 True
dtype: bool
Applying .isnull() function on the dataframe returns a series of bool items with null values as true.
2. Filtering out missing data
from numpy import nan as NA
data = pd.Series([1,NA,3.5,NA,7])data_na = data.dropna()
data[data.notnull()]# output
0 1.0
2 3.5
4 7.0
dtype: float64
You can use .dropna() function on your dataframe or you can use .notull(). The result is a series with all non null values.
For dataframe dropna() by default drops all rows containing a NA value.
data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
[2,6,3,1],[8,9,NA,NA]])
data
data.dropna()#output
data
Out[12]:
0 1 2 3
0 1 2 3.0 4.0
1 4 6 4.0 NaN
2 2 6 3.0 1.0
3 8 9 NaN NaNdata.dropna()
Out[13]:
0 1 2 3
0 1 2 3.0 4.0
2 2 6 3.0 1.0
Passing how=”all” will only drop rows with all NAs.
data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
[2,6,3,1],[8,9,NA,NA],[NA,NA,NA,NA]])data
Out[17]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 4.0 6.0 4.0 NaN
2 2.0 6.0 3.0 1.0
3 8.0 9.0 NaN NaN
4 NaN NaN NaN NaNdata.dropna(how = "all")
Out[18]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 4.0 6.0 4.0 NaN
2 2.0 6.0 3.0 1.0
3 8.0 9.0 NaN NaN
To drop columns in the same way pass axis = 1
data = pd.DataFrame([[1,2,3,4],[4,6,4,NA],
[2,6,3,1],[8,9,NA,NA]])data
Out[22]:
0 1 2 3
0 1 2 3.0 4.0
1 4 6 4.0 NaN
2 2 6 3.0 1.0
3 8 9 NaN NaNdata.dropna(axis = 1)
Out[23]:
0 1
0 1 2
1 4 6
2 2 6
3 8 9data.dropna(axis = 1, how = "all")
Out[24]:
0 1 2 3
0 1 2 3.0 4.0
1 4 6 4.0 NaN
2 2 6 3.0 1.0
3 8 9 NaN NaN
Suppose you want to keep only the rows containing a certain number of observations.
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NAdf
Out[28]:
0 1 2
0 0.926414 NaN NaN
1 0.159192 NaN NaN
2 -0.724809 NaN -1.417761
3 0.282927 NaN -1.463186
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035df.dropna()
Out[29]:
0 1 2
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035
df.dropna(thresh = 2)
Out[30]:
0 1 2
2 -0.724809 NaN -1.417761
3 0.282927 NaN -1.463186
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035
3. Filling Missing values
calling fillna with a constant replaces missing value with that value
df.fillna(0)
Out[31]:
0 1 2
0 0.926414 0.000000 0.000000
1 0.159192 0.000000 0.000000
2 -0.724809 0.000000 -1.417761
3 0.282927 0.000000 -1.463186
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035
Calling fillna() with a dict you can use different fill value for each column
df.fillna({1:0.5,2:0})
Out[32]:
0 1 2
0 0.926414 0.500000 0.000000
1 0.159192 0.500000 0.000000
2 -0.724809 0.500000 -1.417761
3 0.282927 0.500000 -1.463186
4 0.170544 -0.490909 -0.974798
5 -0.770864 -0.087966 -0.125960
6 -1.874506 1.419397 -0.758035
4. Removing duplicates
data = pd.DataFrame({'k1':['one','two'] *3 + ['two'],
'k2' : [1,1,2,3,3,4,4]})data
Out[40]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4data.duplicated() ## returns a boolean series
Out[41]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: booldata.drop_duplicates()
data.duplicated() ## returns a boolean series
Out[41]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: booldata.drop_duplicates()
Out[42]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
You can also specify a subset to detect duplicates
data['v1'] = range(7)
data
Out[44]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6data.drop_duplicates(['k1'])
Out[45]:
k1 k2 v1
0 one 1 0
1 two 1 1
So, these are some of the functions which we can use for cleaning and preparing data before we go on to do further analysis on that. Will cover some more in the coming parts like transforming your data using maps and lambda functions.
Any suggestions for improvement are welcome and do like the article if it helped you in some or the other way.