NoNa: Missing Data Imputation Algorithm
My first open source product.
In real datasets, missing values create a problem for further processing. Substituting or filling in missing values is of great value. Unfortunately, standard “lazy” methods, such as simply using the column median or mean, do not always work as expected.
In 2021, the idea came to me to create an algorithm based on machine learning methods with prediction for each column with gaps. I first embodied this idea schematically on paper.
The essence of the algorithm is to fill in the gaps with various machine learning methods. We loop through all the columns, if the column has missing values, we stop and make this column target. We divide the previous columns by X_train, X_test. X_test will match missing values. We take Y_train values from the column (on which we stopped in the loop) in which there are no gaps. We train X_train and y_train on the method of our choice, for example Ridge regression. We make predictions on X_test, fill in the missing values in the column with the predicted values. We do this for all columns with gaps.
I regularly used this algorithm in practice to fill in missing values.
In 2023, I decided to write a Python library based on it and compare it with other gap filling methods.
Main Features
- Simple and fast filling of missing values.
- Customization of used machine learning methods.
- High prediction accuracy.
Where to get it?
The source code is currently hosted on GitHub at: GitHub — AbdualimovTP/nona: library for filling in missing values using artificial intelligence methods
Binary installers for the latest released version are available at the Python Package Index (PyPI)
# PyPI
pip install nona
Dependencies
- NumPy — Adds support for large, multi-dimensional arrays, matrices and high-level mathematical functions to operate on these arrays
- Pandas — pandas 1.5.2 documentation)
- Scikit-Learn — machine learning in Python
- GitHub — tqdm/tqdm: A Fast, Extensible Progress Bar for Python and CLI
Quick start
Out of the box, use ridge regression to fill in the gaps with the regression problem, and RandomForestClassifier for the classification problem in columns with missing values.
# load library
from nona.nona import nona
# prepare your data with na to ML
# only numerical values in the dataset
# fill the missing values
nona(YOUR_DATA)
Accuracy improvement
You can insert other machine learning methods into the function. They should support a simple implementation of fit and predict.
Parameters:
- data: prepared dataset
- algreg: Regression algorithm to predict missing values in columns
- algclass: Classification algorithm to predict missing values in columns
# load library
from nona.nona import nona
# prepare your data with na to ML
# only numerical values in the dataset
# fill the missing values
nona(data=YOUR_DATA, algreg=make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=0.1)), algclass=RandomForestClassifier(max_depth=2, random_state=0))
Comparison of accuracy with other gap filling methods
Compared methods:
Baseline — mean padding.
KNN — Imputation for completing missing values using k-Nearest Neighbors. Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.
MICE — A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.
MISSFOREST — imputes missing values using Random Forests in an iterative fashion.
NONA — my algorithm for “column-wise” filling in the gaps using various machine learning techniques.
For work, I took the Framingham heart study dataset available on Kaggle.
On this dataset, we will simulate gaps in the amount of 10%, 20%, 30%, 40%, 50%, 70%, 90% of missing values. Fill them out using the described methods (each separately). And compare the results with the true values. The output is the root mean square error (RMSE).
for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9]:
# we create a random matrix the size of a dataset, randomly fill it with gaps and zeros
randomMatrixNA = np.random.choice([0,np.NaN], (data.shape[0], data.shape[1]), p=[1-i, i])
# fill dataset with missing values
dataWithNA = data + randomMatrixNA
# create datasets with filled gaps
# fill in the middle
Baseline_1_mean = dataWithNA.fillna(dataWithNA.mean())
print(f'Baseline_MEAN, {i*100}, RMSE:' , np.round(mean_squared_error(data, Baseline_1_mean, squared=False), 2))
dataFrameRMSE.loc['Baseline_MEAN'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, Baseline_1_mean, squared=False), 2)
# KNN
imputer = KNNImputer(n_neighbors=15)
KNN = imputer.fit_transform(dataWithNA)
dataFrameRMSE.loc['KNN'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, KNN, squared=False), 2)
print(f'KNN, {i*100}, RMSE:' , np.round(mean_squared_error(data, KNN, squared=False), 2))
# MICE
mice = IterativeImputer(max_iter=10, random_state=0)
MICE = mice.fit_transform(dataWithNA)
dataFrameRMSE.loc['MICE'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, MICE, squared=False), 2)
print(f'MICE, {i*100}, RMSE:' , np.round(mean_squared_error(data, MICE, squared=False), 2))
# MISSFOREST
missforest = MissForest(random_state=0, verbose=0)
MISSFOREST = missforest.fit_transform(dataWithNA)
dataFrameRMSE.loc['MISSFOREST'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, MISSFOREST, squared=False), 2)
print(f'MISSFOREST, {i*100}, RMSE:' , np.round(mean_squared_error(data, MISSFOREST, squared=False), 2))
# nona_Base
dataWithNA_NonaBase = dataWithNA.copy(deep=True)
nona(dataWithNA_NonaBase)
dataFrameRMSE.loc['NONA'][f'{int(i*100)}%'] = np.round(mean_squared_error(data, dataWithNA_NonaBase, squared=False), 2)
print(f'NONA, {i*100}, RMSE:' , np.round(mean_squared_error(data, dataWithNA_NonaBase, squared=False), 2))
Results
Not bad results for the algorithm to work out of the box.
At 30%, 40%, 50%, 70%, 90%, the NONA algorithm showed the best RMSE on this dataset with simulated gaps. By 10%, 20% second place, first for MICE.
In the future, I plan to check the accuracy of filling on other datasets. I also see opportunities to improve the quality of forecasting. I plan to implement in the next versions of the library.