The title of this post could be the question of Hamlet, the prince of Denmark, if he were a Data Scientist :-)
Back to the serious matters, using dense vs. sparse datasets in model training/prediction is one of the important choices Data Scientists make when preparing Machine Learning solutions. Moreover, the necessity to make such a choice early in the project implementation illuminates a fundamental theoretical limitation in a machine learning project workflow perspective.
In both the academic courses and post-graduate trainings in Data Science and Machine Learning, the typical break-down of a Data Science and Machine Learning project is described with the phases below
- Exploratory Data Analysis (EDA)
- Data pre-processing and feature engineering
- Model training and forecasting
- (Optionally) Packaging the ML solution as a data product
- Preparing a research report with key discoveries and ML solution description
In reality, things are more complex than explained in academia. Data pre-processing / feature engineering and Model training / forecasting are tightly interconnected since
- Some steps in data pre-processing and feature engineering are consciously performed with keeping in mind specific predictive algorithms to be used down the road
- Suboptimal forecasting accuracy of generally known-to-be-accurate ML algorithms on a particular project alerts on actions needed to take in order to change data pre-processing and feature engineering
Using (or not using) sparse datasets in model training is one of the tangible highlights of such a collision. Sparse datasets can strongly amplify forecasting accuracy of some algorithms while drastically degrading it for other types of algorithms.
I have made a case study of the impact of sparse vs. dense dataset inputs on the forecasting accuracy of predictive models for Kaggle competition of Recruit Restaurant Visitors forecasting (https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting).
Machine Learning Experiment Design
In the course of this case study, two variants of data pre-processing and feature engineering have been implemented. Both variants generated the same set of numeric features as well as shared the same missing data imputation and target variable transformation steps. The difference between them was in a way categorical features were encoded.
For each of data pre-processing and feature engineering schema, the same set of machine learning algorithms has been trained on 4-fold CV as follows
- xgboost (using Python)
- lightgbm (using Python)
- GBM (using Python)
- KNNRegressor (using Python)
- CNN (using R wrapper for h2o and the local instance of h2o server)
Brief description of each of the pre-processing and feature engineering is described in the following subsections.
Variant 1 facilitated label encoding for categorical features (leading to dense datasets as a result). The code for GMB model implementing such an approach is presented below
Variant 2 performed one-hot encoding of categorical features (leading to more sparse datasets as a result). The code for GMB model implementing such an approach is presented below
It proved Variant 2 of data pre-processing and feature engineering improved performance of gradient boosting-type algorithms as follows
- xgboost — PubLB score 0.495 (vs. 0.498 with Variant 1 pre-processing and feature engineering)
- GBM — PubLB score 0.505 (vs. 0.509 with Variant 1 pre-processing and feature engineering)
- lightgbbm — PubLB score 0.501 (vs. 0.502 with Variant 1 pre-processing and feature engineering)
The additional effects of using sparse datasets with gradient boosting machine-style algorithms were
- reduced complexity of models
- decreased time to train the most optimal model (this was especially noticeable for lightgbm)
However, Variant 2 degraded performance of two other algorithms I tried
- KNNRegressor dropped it performance from 0.524 (on my best attempt) to below-0.53 level
- The similar situation was with CNN based on H20
Dataset Sparsity and Performance of ML Solutions
Proper handling sparse data may affect performance of your Python scikit-learn-based ML solutions. If your data sparsity ratio is huge, you may switch from Pandas dataframes to sparse matrixes (as suggested in http://scikit-learn.org/stable/modules/computational_performance.html)
You can calculate the sparse ratio of your input dataset with the simple code fragment below
In the machine learning experiment performed for this case study, it has been illustrated that
- Sparse training datasets are helpful in improving performance of gradient boosting machine-style algorithms (gbm, lightgbm, xgboost)
- Sparse training datasets does not work well for algorithms operating with multi-dimensional geometry concepts (like knn regression) as well as deep learning algorithms (various types of Neural Networks) — dense datasets will work better for them
As a rule of thumb, you should consciously decide whether you use dense or sparse datasets as inputs to your ML solutions.