Pandas — EDA: Smart Way to replace NaN
In this article, I have shown one small trick to replace Null Values / Missing values from the dataset. This is part of the EDA.
Many analysts use to either drop the NaN or replace all the NaN with variable mean or another statistical measurement. However, it’s not always the right method.
In this following example, I have chosen the dataset, where a few of the columns have NaN values, and I have shown how to replace NaN with mean. Wait for a second!! It’s not the whole mean of the column.
Here, I have selected only the BasePay variable to replace the NaN.
Now, what’s the total number of missing values in a particular feature is shown as follows. There are 609 NaN in the BasePay variable.
Now, one option could be to replace NaN with the mean of the variable, which is 66325.44. However, here we are showing the method to replace the NaN with mean, but with respect to a particular JobTitle.
There is a total of 2154 unique JobTitle ranging from General Manager to Cashier. Now, for example, If we replace a missing value for the General Manager’s BasePay with mean [feature mean] or Cashier’s BasePay with mean [feature mean]. It would be injustice and false.
Hence, in this method, we have shown the technique to replace missing values with mean, but with respect to a particular JobTitle. Meaning, General Manager’s mean will only apply to missing values where the JobTitle of the BasePay is General Manager.
And means for a particular JobTitle are as follows,
Here, in the following method, we have used groupby() and transform() functions to replace NaN.
sal[‘BasePay’] = sal[‘BasePay’].fillna(sal.groupby(“JobTitle”)[‘BasePay’].transform(‘mean’))
And now, after the replacement of NaN for BasePay, you can see the clear difference in the mean value, which is changes from 66325.44 to 66287.40
However, still there are 8 missing values! How?
These missing values are from the data, where the data is not provided in JobTitel (refer to the first screenshot)
Now, in this case, we are not replacing the missing values, but we are removing the entire rows.