Using scikit-learn’s Iterative Imputer

Published in

Analytics Vidhya

4 min readFeb 23, 2020

I have quite the fascination about data patterns and the outcome of any analysis depends on the robustness of the dataset. But achieving the robustness in dataset is near impossible in the current age due to so many reasons — sensor troubles, survey respondent biases, missing values, incorrect data entry or logging, and so forth.

One of my pet peeves is missing values in a dataset. With meticulously laid out assumptions of a given dataset, I believe with certain algorithms you can intelligently fill missing values.

The book, “Statistical Analysis with Missing Data” by Rubin and Little discusses several techniques in detail. I recommend checking it out, but overall they touch upon three main ideas about dealing with missing values:

Drop missing values. It’s acceptable as long as they are very small in number compared to the records available.
Fill all missing values with a statistic derived from other values in a given field/column. I discussed some of the short comings over here: https://medium.com/swlh/practical-technique-for-filling-missing-values-in-a-data-set-f8d541492b1f
Impute missing values through regression.

And of course, modern software has made it simple.

I learnt about sklearn’s interative imputer and found out it’s quite impressive. You can learn about implementation of sklearn’s experimental iterative imputer over here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

This class implements some of the algorithms discussed in the book and can be quite useful. I wanted to test it as outlined below:

Fetch a full dataset, assume available data is clean.
Pick some columns at random and knock off some data points at random.
Use the defiled dataset on the iterative imputer to fill missing values.
Compare the new dataset to the original data set to assess sklearn’s iterative imputer’s performance.

I used the Electric Motor Temperature data set from Kaggle to demonstrate this. This is a numerical dataset with around 1 million data points. I loaded the dataset and dropped the column ‘Profile ID’ which is a key. The remaining columns are as below:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998070 entries, 0 to 998069
Data columns (total 12 columns):
ambient           998070 non-null float64
coolant           998070 non-null float64
u_d               998070 non-null float64
u_q               998070 non-null float64
motor_speed       998070 non-null float64
torque            998070 non-null float64
i_d               998070 non-null float64
i_q               998070 non-null float64
pm                998070 non-null float64
stator_yoke       998070 non-null float64
stator_tooth      998070 non-null float64
stator_winding    998070 non-null float64
dtypes: float64(12)
memory usage: 91.4 MB

Distribution of each column in the dataset

I wrote the function below to choose 40% of columns in the dataset at random and knock of anywhere between 15% to 50% of values in each of those columns.

def defile_dataset(df, col_selection_rate=0.40):
    cols = np.random.choice(df.columns, int(len(df.columns)*col_selection_rate))
    df_cp = df.copy()
    for col in cols:
        data_drop_rate = np.random.choice(np.arange(0.15, 0.5, 0.02), 1)[0]
        drop_ind = np.random.choice(np.arange(len(df_cp[col])), size=int(len(df_cp[col])*data_drop_rate), replace=False)
        df_cp[col].iloc[drop_ind] = np.nan
    return df_cp, cols

Result after calling the above function on the data frame:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998070 entries, 0 to 998069
Data columns (total 12 columns):
ambient           509016 non-null float64
coolant           998070 non-null float64
u_d               998070 non-null float64
u_q               998070 non-null float64
motor_speed       998070 non-null float64
torque            768514 non-null float64
i_d               998070 non-null float64
i_q               998070 non-null float64
pm                628785 non-null float64
stator_yoke       998070 non-null float64
stator_tooth      998070 non-null float64
stator_winding    628785 non-null float64
dtypes: float64(12)
memory usage: 91.4 MB

You have to make sure to enable sklearn’s Iterative Imputer before using the class like below:

from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer

Once I was set, then I used the function below to with mostly default arguments for Iterative Imputer, but with number of iterations (n_iter) to be 100 to make sure I give enough for the function to converge.

def impute_once(df_orig):
    df_miss, cols = defile_dataset(df_orig)
    df_orig_slice = df_orig[cols]
    imputer = IterativeImputer(max_iter=100)
    df_stg = df_miss.copy()
    imp_arr = imputer.fit_transform(df_stg)
    return df_orig_slice, df_miss[cols], pd.DataFrame(imp_arr[:,[df_orig.columns.get_loc(i) for i in cols]], columns=cols), imputer.n_iter_

Then I call the function and check some information:

df_og, df_def, df_imp, n_iter = impute_once(df)
print(df_og.columns)
print(df_imp.columns)
print(n_iter)Index(['i_q', 'stator_winding', 'u_q', 'stator_tooth'], dtype='object')
Index(['i_q', 'stator_winding', 'u_q', 'stator_tooth'], dtype='object')
23

Seems like it converged in 23 iterations.

To compare Iterative Imputer with the most basic technique of filling all missing values with one statistic such as the mean (Simple Imputer in the case of sklearn or fillna in the case of pandas), I filled missing values using Simple Imputer on a copy of the dataset and evaluated the mean squared error in both cases.

Simple Imputer:

for i in range(len(df_og.columns)):
    print("Simple Imputer: MSE for {} is {:.4f}.".format(df_og.columns[i], mean_squared_error(df_og[df_og.columns[i]], df_simimp[df_simimp.columns[i]])))Simple Imputer: MSE for i_q is 0.3498.
Simple Imputer: MSE for stator_winding is 0.1499.
Simple Imputer: MSE for u_q is 0.3909.
Simple Imputer: MSE for stator_tooth is 0.3498.

Iterative Imputer:

for i in range(len(df_og.columns)):
    print("Iterative Imputer: MSE for {} is {:.4f}.".format(df_og.columns[i], mean_squared_error(df_og[df_og.columns[i]], df_imp[df_imp.columns[i]])))Iterative Imputer: MSE for i_q is 0.0016.
Iterative Imputer: MSE for stator_winding is 0.0023.
Iterative Imputer: MSE for u_q is 0.0724.
Iterative Imputer: MSE for stator_tooth is 0.0009.

Conclusion: You can see that the MSE is far lower when using Iterative Imputer. What I will experiment with next and share is running the above multiple times to see the distribution of MSE, try tuning other arguments for the Iterative Imputer class especially change the estimator. The default is BayesianRidge().

I encourage you to try this out and share your thoughts in comments!

Using scikit-learn’s Iterative Imputer

Written by Krish