To Sparse or not to Sparse?

Published in

SBC Group Blog

4 min readDec 17, 2017

The title of this post could be the question of Hamlet, the prince of Denmark, if he were a Data Scientist :-)

Back to the serious matters, using dense vs. sparse datasets in model training/prediction is one of the important choices Data Scientists make when preparing Machine Learning solutions. Moreover, the necessity to make such a choice early in the project implementation illuminates a fundamental theoretical limitation in a machine learning project workflow perspective.

Problem Statement

In both the academic courses and post-graduate trainings in Data Science and Machine Learning, the typical break-down of a Data Science and Machine Learning project is described with the phases below

Exploratory Data Analysis (EDA)
Data pre-processing and feature engineering
Model training and forecasting
(Optionally) Packaging the ML solution as a data product
Preparing a research report with key discoveries and ML solution description

In reality, things are more complex than explained in academia. Data pre-processing / feature engineering and Model training / forecasting are tightly interconnected since

Some steps in data pre-processing and feature engineering are consciously performed with keeping in mind specific predictive algorithms to be used down the road
Suboptimal forecasting accuracy of generally known-to-be-accurate ML algorithms on a particular project alerts on actions needed to take in order to change data pre-processing and feature engineering

Using (or not using) sparse datasets in model training is one of the tangible highlights of such a collision. Sparse datasets can strongly amplify forecasting accuracy of some algorithms while drastically degrading it for other types of algorithms.

I have made a case study of the impact of sparse vs. dense dataset inputs on the forecasting accuracy of predictive models for Kaggle competition of Recruit Restaurant Visitors forecasting (https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting).

Machine Learning Experiment Design

In the course of this case study, two variants of data pre-processing and feature engineering have been implemented. Both variants generated the same set of numeric features as well as shared the same missing data imputation and target variable transformation steps. The difference between them was in a way categorical features were encoded.

For each of data pre-processing and feature engineering schema, the same set of machine learning algorithms has been trained on 4-fold CV as follows

xgboost (using Python)
lightgbm (using Python)
GBM (using Python)
KNNRegressor (using Python)
CNN (using R wrapper for h2o and the local instance of h2o server)

Brief description of each of the pre-processing and feature engineering is described in the following subsections.

Variant 1

Variant 1 facilitated label encoding for categorical features (leading to dense datasets as a result). The code for GMB model implementing such an approach is presented below

Variant 2

Variant 2 performed one-hot encoding of categorical features (leading to more sparse datasets as a result). The code for GMB model implementing such an approach is presented below

Results

It proved Variant 2 of data pre-processing and feature engineering improved performance of gradient boosting-type algorithms as follows

xgboost — PubLB score 0.495 (vs. 0.498 with Variant 1 pre-processing and feature engineering)
GBM — PubLB score 0.505 (vs. 0.509 with Variant 1 pre-processing and feature engineering)
lightgbbm — PubLB score 0.501 (vs. 0.502 with Variant 1 pre-processing and feature engineering)

The additional effects of using sparse datasets with gradient boosting machine-style algorithms were

reduced complexity of models
decreased time to train the most optimal model (this was especially noticeable for lightgbm)

However, Variant 2 degraded performance of two other algorithms I tried

KNNRegressor dropped it performance from 0.524 (on my best attempt) to below-0.53 level
The similar situation was with CNN based on H20

Dataset Sparsity and Performance of ML Solutions

Proper handling sparse data may affect performance of your Python scikit-learn-based ML solutions. If your data sparsity ratio is huge, you may switch from Pandas dataframes to sparse matrixes (as suggested in http://scikit-learn.org/stable/modules/computational_performance.html)

You can calculate the sparse ratio of your input dataset with the simple code fragment below

Summary

In the machine learning experiment performed for this case study, it has been illustrated that

Sparse training datasets are helpful in improving performance of gradient boosting machine-style algorithms (gbm, lightgbm, xgboost)
Sparse training datasets does not work well for algorithms operating with multi-dimensional geometry concepts (like knn regression) as well as deep learning algorithms (various types of Neural Networks) — dense datasets will work better for them

As a rule of thumb, you should consciously decide whether you use dense or sparse datasets as inputs to your ML solutions.