NoNa: Missing Data Imputation Algorithm

My first open source product.

Timur Abdualimov
5 min readJan 9, 2023

GitHub — AbdualimovTP/nona: library for filling in missing values ​​using artificial intelligence methods

In real datasets, missing values ​​create a problem for further processing. Substituting or filling in missing values ​​is of great value. Unfortunately, standard “lazy” methods, such as simply using the column median or mean, do not always work as expected.

In 2021, the idea came to me to create an algorithm based on machine learning methods with prediction for each column with gaps. I first embodied this idea schematically on paper.

Author’s image

The essence of the algorithm is to fill in the gaps with various machine learning methods. We loop through all the columns, if the column has missing values, we stop and make this column target. We divide the previous columns by X_train, X_test. X_test will match missing values. We take Y_train values ​​from the column (on which we stopped in the loop) in which there are no gaps. We train X_train and y_train on the method of our choice, for example Ridge regression. We make predictions on X_test, fill in the missing values ​​in the column with the predicted values. We do this for all columns with gaps.

I regularly used this algorithm in practice to fill in missing values.

In 2023, I decided to write a Python library based on it and compare it with other gap filling methods.

Main Features

  • Simple and fast filling of missing values.
  • Customization of used machine learning methods.
  • High prediction accuracy.

Where to get it?

The source code is currently hosted on GitHub at: GitHub — AbdualimovTP/nona: library for filling in missing values ​​using artificial intelligence methods

Binary installers for the latest released version are available at the Python Package Index (PyPI)

# PyPI
pip install nona

Dependencies

Quick start

Out of the box, use ridge regression to fill in the gaps with the regression problem, and RandomForestClassifier for the classification problem in columns with missing values.

# load library
from nona.nona import nona

# prepare your data with na to ML
# only numerical values ​​in the dataset

# fill the missing values
nona(YOUR_DATA)

Accuracy improvement

You can insert other machine learning methods into the function. They should support a simple implementation of fit and predict.

Parameters:

  • data: prepared dataset
  • algreg: Regression algorithm to predict missing values ​​in columns
  • algclass: Classification algorithm to predict missing values ​​in columns
# load library
from nona.nona import nona

# prepare your data with na to ML
# only numerical values ​​in the dataset

# fill the missing values
nona(data=YOUR_DATA, algreg=make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=0.1)), algclass=RandomForestClassifier(max_depth=2, random_state=0))

Comparison of accuracy with other gap filling methods

Compared methods:

Baseline — mean padding.

KNN — Imputation for completing missing values using k-Nearest Neighbors. Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

MICE — A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

MISSFOREST — imputes missing values using Random Forests in an iterative fashion.

NONA — my algorithm for “column-wise” filling in the gaps using various machine learning techniques.

For work, I took the Framingham heart study dataset available on Kaggle.

On this dataset, we will simulate gaps in the amount of 10%, 20%, 30%, 40%, 50%, 70%, 90% of missing values. Fill them out using the described methods (each separately). And compare the results with the true values. The output is the root mean square error (RMSE).

for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9]:
# we create a random matrix the size of a dataset, randomly fill it with gaps and zeros
randomMatrixNA = np.random.choice([0,np.NaN], (data.shape[0], data.shape[1]), p=[1-i, i])
# fill dataset with missing values
dataWithNA = data + randomMatrixNA
# create datasets with filled gaps

# fill in the middle
Baseline_1_mean = dataWithNA.fillna(dataWithNA.mean())
print(f'Baseline_MEAN, {i*100}, RMSE:' , np.round(mean_squared_error(data, Baseline_1_mean, squared=False), 2))
dataFrameRMSE.loc['Baseline_MEAN'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, Baseline_1_mean, squared=False), 2)

# KNN
imputer = KNNImputer(n_neighbors=15)
KNN = imputer.fit_transform(dataWithNA)
dataFrameRMSE.loc['KNN'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, KNN, squared=False), 2)
print(f'KNN, {i*100}, RMSE:' , np.round(mean_squared_error(data, KNN, squared=False), 2))

# MICE
mice = IterativeImputer(max_iter=10, random_state=0)
MICE = mice.fit_transform(dataWithNA)
dataFrameRMSE.loc['MICE'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, MICE, squared=False), 2)
print(f'MICE, {i*100}, RMSE:' , np.round(mean_squared_error(data, MICE, squared=False), 2))

# MISSFOREST
missforest = MissForest(random_state=0, verbose=0)
MISSFOREST = missforest.fit_transform(dataWithNA)
dataFrameRMSE.loc['MISSFOREST'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, MISSFOREST, squared=False), 2)
print(f'MISSFOREST, {i*100}, RMSE:' , np.round(mean_squared_error(data, MISSFOREST, squared=False), 2))

# nona_Base
dataWithNA_NonaBase = dataWithNA.copy(deep=True)
nona(dataWithNA_NonaBase)
dataFrameRMSE.loc['NONA'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, dataWithNA_NonaBase, squared=False), 2)
print(f'NONA, {i*100}, RMSE:' , np.round(mean_squared_error(data, dataWithNA_NonaBase, squared=False), 2))

Results

Author’s image

Not bad results for the algorithm to work out of the box.

At 30%, 40%, 50%, 70%, 90%, the NONA algorithm showed the best RMSE on this dataset with simulated gaps. By 10%, 20% second place, first for MICE.

In the future, I plan to check the accuracy of filling on other datasets. I also see opportunities to improve the quality of forecasting. I plan to implement in the next versions of the library.

--

--