Missing values in Data Science
Almost every dataset contains missing data, which should not be considered easily, since their presence is one of the most important problems. The reasons of being problematic are that the results obtained during calculations can mislead and no best way of dealing with them is presented.
In this article, some general description of missing data and then the probable solutions will be provided.
What is missing data?
A missing value is a value which is not stored in dataset during observations. The classification of missing values was done in 1976 by D.B. Rubin. He reckoned that every data point has a possibility to be lost. The classes are as follow:
· Missing Completely at Random (MCAR)
· Missing at Random (MAR)
· Missing Not at Random (MNAR)
Missing Completely at Random (MCAR):
For this case, the missing values are unrelated to the observations. If the possibility of being missed is equal for all cases, then the data is missed completely at random. MCAR data, nevertheless, are highly unusual in practice.
For example, during the survey of a population, if any responds are lost, then they are missed completely at random.
Missing at Random (MAR):
This is the case when the missing variable can be defined by another variable, not from the missing values themselves.
For instance, in the survey about the depression levels between two sexes, males are less likely to respond the questions on the depression level in comparison with females. Therefore, the missing value depends on gender only. This case is MAR.
Missing Not at Random (MNAR):
The other calling of this class is “Not Missing at Random (NMAR)”. Here, the missing values are lost for the unknown reasons. One of the reasons for these values to be lost can be the refuse of the respondents. MNAR is a complex case, because the handling with this case is tougher than others. There is no way to drop or impute these values without introducing bias to the dataset, which can change the results and mislead us in the future.
A good illustration of MAR could be the questionnaires in the workplace, when the employees or employers would not answer the questions regarding their salaries.
The identification of missing values is easy with Python because it is straightforward. First the apt libraries should be imported and the dataset should be read.
import pandas as pd
df = pd.read_excel(r'...dataset.xlsx')
The first method of missing data identification is:
which returns boolean output (“True” or “False) for the columns. In case of being “True”, a column contains missing values. On the other hand, “False” shows nonexistence of missing values.
The second method of missing data identification is:
which returns the number of missing values in columns.
Various methods were presented for dealing with the missing values. There are two main methods of dealing with missing data, namely: easy and professional.
· Ignore tuples with missing values: This method is appropriate when the given dataset is large and several values are missed.
· Drop missing values: Only appropriate when the data is large.
Deletion, itself, is divided into 4 different categories:
· Listwise: It an easy solution for the large dataset with the presence of MCAR, which is also referred as “complete case analysis”. For the small size dataset, it can create a bias and mislead the results. In this case, the entire variable is going to be deleted.
· Pairwise: the deletion occurs when some missing data exists. The subsets with complete cases should be considered, because it preserves more information.
· Entire variables: If one column contains of 60% missing values, then this column can be deleted entirely.
· Dropping: is the process of deleting whole row of data.
Despite being one method of handling with missing data, dropping has a very important disadvantage. Because of one missing value, entire data will be deleted, which is valuable during the solutions. That is why, instead of dropping, imputing (filling) the missing values is a better choice.
Imputation is the best strategy for handling missing values. Different methods for imputation have been presented, ranging from the simple to complex.
One of the methods of imputation is using mean or median. The mean and median of the particular column should be calculated and then filled in place of missing data. For the categorical data, nonetheless, the mode function is used.
#impute by mean
missing_col = ['GPA']
for i in missing_col:
df.loc[df.loc[:,i].isnull(),i]=df.loc[:,i].mean()#impute by median
missing_col = ['IELTS']
for i in missing_col:
· It is quick and easy.
· When the mean is imputed, the mean of the whole column does not change.
· Usage of mean makes sense, because it a reasonable estimate for randomly selected observation.
· It can give poor results for categorical features.
· The variance is reduced and distorts the covariance among the remaining variables.
Linear regression can be used in the following way. Via existing variables, the predicted values will be calculated and imputed in the dataset. The relationship between the variables will be preserved. However, the drawbacks of this method are that the linear regression should be considered for every case and the standard error will be reduced.
KNN Nearest Algorithm
This method requires k most similar observations and mean/median/mode of the neighbours to impute the missing values. It is an algorithm which is used for simple classification and uses ‘feature similarity’ to predict the values of a new dataset. The distance between the variables is one of the essential parameters for using kNN algorithm. But this algorithm is computationally expensive, that is why not very recommended.
In this case, the calculated predicted value and residual errors are added. It resembles the linear regression method, but adds random component, which can also be one more advantage of this method.
Multiple imputation by Chained equations (MICE):
One of the aspects to be considered is the distribution of the dataset and several calculations. The datasets should be created individually to obtain parameter estimates. In comparison with other methods, this method can approach better values.
Several other methods for imputing missing values also exist. Nevertheless, despite all these methods, the nature of the dataset is much more important, since the imputation cannot give the exact values.