Appropriate ways to Treat Missing Values
As we know most of the time of data oriented industries goes in Data Preparing and Data Cleaning. In some cases the time consumption for data preparation for data professionals can go upto 90%. Dealing with missing data is one of the most difficult parts in the data preparation phase.
They are often represented as NaNs, blanks or 0 in the data.
If the missing values are not handled properly then the results may end up drawing an inaccurate inference about the data and can produce biased estimates.
The Missing Values can be counted in each column with the command:
data.isnull().sum()
How do Missing Value occur in Data?
Missing data can be anything from missing sequence, incomplete feature, files missing, information incomplete, data entry error, data collection problem etc.
Types of Missing Values:
MCAR: Missing completely at random
The missing values are randomly distributed across all observations.
Example: A Blood Sample gets damaged in the Lab.
MAR: Missing at random
the missing values are not randomly distributed across observations but are distributed within one or more sub-samples.
Example: A child does not appear for Examination because he is sick.
NMAR: Not missing at random
When data are missing not at random, the missingness is specifically related to what is missing.
Example: A person did not take English proficiency test due to his poor English language skill.
Here are the most common methods to deal with Missing data:
1. Discard Data-
This is one of the most intuitive and simple methods.If missing value count is not large enough from a large dataset then the rows can likely be dropped by using the following command:
df.dropna()
If large number of observations are missing from a single variable then the variable should be dropped in such a case.
This is not the recommended method though as it might lead to a significant decrease in the sample size.
2. Mean, Mode Imputation-
Imputation is the act of replacing missing data with statistical estimates of the missing values.
The imputation method should be decided after considering the distribution of data: normal distribution and skewed distribution (be it right-skewed or left-skewed).
Mean imputation works better if the distribution is normally-distributed or has a Gaussian distribution, while median imputation is preferable for skewed distribution(be it right or left).
The data imputation purely depends on the datatype.
If the datatype of the column is numerical then replace the values with Mean in case of Normal distribution and Median imputatation in case of Skewed distribution. If the datatype of the column is Categorical then the Mode Imputation method is the appropriate method.
3. K-Nearest Neighbour Imputation (KNN)-
This method uses k-nearest neighbour algorithms to estimate and replace missing data. The k-neighbours are chosen using some distance measure and their average is used as an imputation estimate. One should try different values of k with different distance metrics to find the best match. The advantage of using KNN is that it is simple to implement. But it suffers from the curse of dimensionality. It works well for a small number of variables but becomes computationally inefficient when the number of variables is large.
4. Regression Imputation
This approach replaces missing values with a predicted value based on regression line.
Regression is a statistical method which shows the relationship between a dependent variable and independent variables. It’s expressed as y = mx + b
For Example:
Triceps skinfold thickness is one of the variables where we see some missing values. The missing values in this variable can be imputed by using all other variables information as predictors.
5. Filling Missing Values
One of the ways to fill missing values is to replace NA with a scalar value by the below method:
df.fillna(0)
Using the same filling arguments as reindexing, we can propagate non-NA values forward or backward:
df.fillna(method=’pad’)
Conclusion:
There are different approaches to deal with missing values which is heavily dependent on the nature of data. Therefore, the more attentively you treat the missing values the better accuracy you can expect after training your model.