Imputation Methods in Data Preprocessing
Note: The entire article is available on the imputation methods page of our site.
Alright, let’s start.
Imputation is a technique used for replacing (or imputing) the missing data in a dataset with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extent, which not only raises concerns for biasing the dataset but also leads to incorrect analysis.
Let’s look at different types of imputation generally used in machine learning
a. Frequent Category Imputation:
- We use this technique with categorical variables.
- Frequent category imputation — or mode imputation — consists of replacing all occurrences of missing values (NA) within a variable with the mode, or the most frequent value.
- You can use this method when data are MCAR, and no more than 5% of the variable contains missing data.
Here’s an example:
b. Mean or Median Imputation
- This method is suitable for numerical variables.
- Mean or median imputation consists of replacing all occurrences of missing values (NA) within a variable with the mean or median of that variable.
- Example:
c. Multiple Imputation: MICE
Multiple Imputations (MI) is a way to deal with nonresponse bias — missing research data that happens when people fail to respond to a survey. The technique allows you to analyze incomplete data with regular data analysis tools like a t-test or ANOVA. Impute means to “fill in.”
With singular imputation methods, the mean, median, or some other statistic is used to impute the missing values. However, using single values carries with it a level of uncertainty about which values to impute.
Multiple imputations narrow uncertainty about missing values by calculating several different options (“imputations”). Several versions of the same data set are created, which are then combined to make the “best” values.
MICE operates under the assumption that given the variables used in the imputation procedure, the missing data are Missing At Random (MAR), which means that the probability that a value is missing depends only on observed values and not on unobserved values.
Multiple imputations by chained equations (MICE) have emerged as one principled method of addressing missing data. The chained equations approach is very flexible and can handle variables of varying types (e.g., continuous or binary) as well as complexities such as bounds.
The chained equation process can be broken down into the following general steps:
Step 0: The initial dataset is given below, where missing values are marked as N.A.
Step 1: A simple imputation, such as imputing the mean, is performed for every missing value in the dataset. Temporarily setting any missing value equal to the mean observed value for the variables of columns: age, income, and gender.
Step 2: Start Step 2 with the variable with the fewest number of missing values. Then in the next step, the imputed mean values of age would be set back to missing (N.A).
Step 3: “age” is the dependent variable in a regression model and all the other variables are independent variables in the regression model. In the next step Bayesian linear regression of age-predicted by income and gender would be run using all cases where age was observed.
Step 4: The missing values for “age” are then replaced with predictions (imputations) from the regression model. Prediction of the missing age value would be obtained from that regression equation and imputed. At this point, age does not have any missingness.
Step 5: Moving on to the next variable with the next fewest missing values, steps 2–4 are then repeated for each variable that has missing data. The previous steps would then be repeated for the income variable. The originally missing values of income would be set back to missing (N.A).
The cycling through each of the variables constitutes one iteration or “cycle.” At the end of one cycle, all of the missing values have been replaced with predictions from regressions that reflect the relationships observed in the data.
Step 6: A linear regression of income predicted by age and gender would be run using all cases with income observed and Imputations (predictions) would be obtained from that regression equation for the missing income value.
Then, the previous steps would again be repeated for the variable gender. The originally missing values of gender would be set back to missing and logistic regression of gender on age and income would be run using all cases with gender observed. Predictions from that logistic regression model would be used to impute the missing gender values.
Here’s a summary of the above steps:
If you like this article, then you’ll definitely like articles written on important data science topics by our team on our site ml-concepts.com