Imputing Missing Values Smartly with DataWig
Data, and especially high quality data (alongside other things), plays a key role in the success of a machine learning model. Unfortunately, real world datasets are far from perfect and are often stained with noise and missing values. Missing values, in particular, can be problematic. Dropping them altogether results in a smaller dataset — not the ideal situation when your dataset is small. And even if you try to fill those annoying ‘NaNs’, the imputed data might not be representative of the missing data.
This is where DataWig comes in. As you might have guessed by the name, it is a wig for bald patches in your data. And look, while there are many methods to impute your data, DataWig belongs right at the top.
We’ll first see what DataWig is and how it works and then we will compare it against some most popular methods to see how well it performs.
DataWig was developed by AWS Labs around 3 years back. It tries to understand your data and uses that learning to do the imputation. So if you have 3 columns — ‘X’, ‘Y’ & ‘Z’ — and we want to impute ‘Z’, DataWig learns the contents of the other two columns to do its magic.
DataWig supports imputation of both categorical and numerical columns. A lot of imputation approaches are only catered towards numerical imputation, while those that cater to categorical aren’t so good or scalable in production. So this becomes extremely useful when you want to impute categorical columns.
Let’s dive a bit deeper into how DataWig works under the hood. This is based on its research paper, so if you don’t want to get into the technicalities, feel free to skip to the next section on how to use it. Though I’d recommend you read this part, its super interesting in my opinion.
The image shows several columns, with the column ‘Color’ to be imputed. DataWig first determines the type of each column. After that, each column is converted into a numeric representation (so that the machine understands it). Categorical columns are one hot encoded while sequential (text) columns are converted into a sequence based on the length and the characters in the string. Next comes the most important step — Featurizing. One-hot encoded data is passed through an embedding layer while sequential data is passed either via an LSTM layer or n-gram hashing is done on it. Finally, all the features are merged and passed through a logistic layer (since ‘Color’ is categorical) for the imputation.
Well that’s the theory, let’s see how you can use DataWig. It can be installed as a Python package.
pip install datawig
Note: When I last tried, there was an installation issue regarding the dependencies. Look at this link and replace ‘==’ with ‘>=’ for the dependencies.
Using DataWig is pretty straightforward as well. It provides you with two types of imputers — SimpleImputer & Imputer. Use the SimpleImputer when you don’t care how DataWig does the work underneath. But if you want to customize the working according to your requirements, use Imputer. For example, you can select if you want to use the LSTM or n-gram hashing for featurizing. Both imputers come with the easy to use fit() and predict() methods (similar to sklearn). Below we’ll do some coding where you’ll get to know more about these functionalities. But if you want to become a DataWig geek, refer to their documentation.
Practical
Let’s begin with the example. Here we will do numerical imputation using DataWig and compare it to other popular approaches. We first load a flights dataset which contains the fields - Year, Month and Passengers - for each flight.
import seaborn as sns
flights = sns.load_dataset("flights")
flights['month'] = flights['month'].astype(str)
flights['year'] = flights['year'].astype(str)flights.head()
Let’s randomly select some cells in the passengers field to hide.
flights_train, flights_test = datawig.utils.random_split(flights)
Now it’s showtime. Let’s use DataWig to predict the values in flights_test.
imputer = datawig.SimpleImputer(
input_columns=['year', 'month'],
output_column='passengers'
)imputer.fit(train_df = flights_train)imputed = imputer.predict(flights_test)
Here we are using the SimpleImputer. We provide it with the input and output columns, fit it on the train data and predict the missing values in test.
I also compared two other popular approaches — Mean and KNN Imputation. The code for these two approaches can be found here on my GitHub. After using both these imputation strategies, I compared them with DataWig on the RMSE metric. Here are the results:
Here DataWig performs the best, closely followed by KNN.
Performance of imputers can also vary depending on whether the data is missing randomly or non-randomly. For example, mean imputation assumes data is missing at random while arbitrary value imputation assumes the converse. So I did another comparison with data not missing at random. From our flights dataset, I masked the summer months of May and June and compared DataWig to KNN and Arbitrary value imputer.
The results are similar to what we observed before, with DataWig coming out on top and KNN in close pursuit. FYI, just because arbitrary value imputation didn’t perform well here does not mean you should never try it. Its performance depends heavily on the arbitrary value you pick, so if you happen to pick the right value (but it’s a bit down to luck) you could get good results.
I’ve also compared categorical imputation using DataWig to other approaches in this Jupyter Notebook, so do have a look if you’re interested. One disadvantage to remember with imputers like DataWig or KNN is that they can be slow when working with lots of categorial and sequential data. In such cases, simple methods like Most Frequent or Random Sampling imputation could be much faster (but maybe not as accurate).
As an interesting activity to further see how good DataWig is, why not impute your data using different techniques and then compare the performance of the ML model you use for prediction; you’ll be amazed to see the results.
Thanks for reading this article and I hope you can counter missing values even better now.