Is Deep Learning facing a reproducibility crisis?
Over the last few years, some papers observed that the progress seen with the advent of Deep Learning in some research fields, including Recommender Systems and Information Retrieval, may not be as strong as we thought.
Deep Learning in Recommender Systems
During the RecSys ’19 Conference, two papers have been presented in this regard.
The first one¹ compares 4 neural approaches in Session-based recommendations with simpler and older baselines, based for instance on the nearest neighbors. Almost all of them actually performed worse than some of the baselines for all the 7 datasets used, except for the NARM algorithm, which obtained better scores than the baselines but only for 1 out of the 7 datasets.
In the second paper², researchers have systematically evaluated the latest deep learning algorithms for the top-n recommendation problem. It turns out that only 7 out of 18 techniques published the code, or provided it upon request. For the remaining 7, they all weren’t able to outperform older baselines.
So, is Deep Learning facing a reproducibility crisis?
No. The entire field of Machine Learning is.
Even if both the papers focused on Deep Learning models, the authors’ observations on what were the causes of these metrics overestimations were independent of the model being “neural” or not.
We will see why in the next part.
Reproducibility issues for non-neural approaches
Another paper³ from 2009 performed a similar analysis in the field of Information Retrieval. Back then, Deep Learning had not come to light yet, thus, all the evaluated algorithms were not based on Neural Networks. The authors observed that almost all the new techniques proposed from 1998 to 2008 that emphasized good results on the TREC datasets were tested on baselines whose performance was far below the state of the art.
Weaker baselines were chosen because they were easier to be implemented and improved with the new techniques, rather than spending a lot of effort into implementing the state of the art. However, the authors showed that the improvements were not always additive. For instance, in the case where a new technique targeted a baseline’s weakness that had already been solved on a more advanced model and, thus, couldn’t be applied to the latter.
There could be several factors that lead to an overestimation of the performance and reproducibility issues. Open-sourcing the code used for the pre-processing, training, hyperparameter tuning and evaluation stages are critical on these terms.
Weak baseline evaluation
The chosen baselines are not advanced enough to represent a real challenge in the task to solve, or sometimes they are but they are not fine-tuned enough to be competitive. Moreover, the new, badly evaluated, techniques are used as new baselines, making the whole evaluation process unreliable. The hyperparameter tuning process is extremely important, and it is often not provided in the codebase.
One evaluation dataset
Different datasets can have very different characteristics. For instance, it seems that the best approach for the top-n recommendation task for the Epinions dataset is to suggest the n most popular items², independently on the users’ history and their possible preferences. This variation in a model’s behavior depending on the data highlights the need to evaluate it over multiple datasets.
Not-so-random test set sampling
It is crucial to have a random test set whose distribution represents the training set’s one. Releasing the training and test sets is useful, but we need further checks on the validity of the split. Some papers with strong results actually used a test set which was very unlikely to be coming from a random sampling². When the method was run on a new, truly random test set, the performances were obviously worse.
Using a fixed training and test set could be helpful, as long as they are correctly obtained, but the problem is often more complex since it is not always possible to release the data, e.g., for privacy or medical reasons⁴.
For sure, publishing the code for pre-processing, can be extremely helpful for reproducibility.
Training/tuning on the test set
This is an important concept that is never stressed enough.
Use the test set only for testing!
The more you run your version of the model on the test set, the more you will overfit it and, thus, obtain unreliable results. Some papers measure the performance on the test set at each epoch and use it to decide the best epoch for early-stopping². Never do that. The number of epochs is a hyperparameter and should be chosen based on the validation error.
A concept similar to the previous one, but it is more subtle and harder to spot.
Data leakage is when information from outside the training dataset is used to create the model.⁵
One clear example is features normalization.
It is recommended to independently normalize your numerical features so that they have the same importance for the computation of the error. A common mistake is to fit two distinct scalers, one for the training set and one for the test set.
This leads to data leakage because you are modifying the test set based on information from the test set itself, but you should exploit it only for testing, and not for pre-processing. The correct way to normalize it is to fit the scaler on the training set and then transform both the training set and the test set based on the computed parameters of the scaler.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train = scaler.fit_tranform(train)
test = scaler.transform(test)
This is also important in a real-case scenario. Your samples to predict may not come altogether. If you have only one sample, how can you fit a scaler on it?
Using a subset of the test set
If we are using a big and complex model, we may not have enough time and computational resources to train it on the whole dataset. For this reason, a subset of the test data may be used. Since the test data is different from the ones on the baseline, a comparison with models evaluated on the entire test set is less reliable.
All the factors mentioned are independent of the type of technique considered. The only point that could be attributed more towards Deep Learning methods is the last one because they tend to require more resources.
This highlights the fact that reproducibility issues are not limited to Deep Learning and a huge effort is being and will have to be, put on incentivizing code publishing and checking erroneous evaluations.
Journals are pushing these guidelines and many tools and platforms are being developed to support researchers in implementing and comparing models.
Hopefully, one day every newly published model will have public code with a Docker image to be just downloaded and run… Hopefully.
This is a blog post published by the PoliMi Data Scientists community. We are a community of students of Politecnico di Milano that organizes events and write resources on Data Science and Machine Learning topics.
If you have suggestions or you want to come in contact with us, you can write to us on our Facebook page.
¹Ludewig, et al. “Performance comparison of neural and non-neural approaches to session-based recommendation.” Proceedings of the 13th ACM Conference on Recommender Systems. ACM, 2019.
²Dacrema, Cremonesi, Jannach. “Are we really making much progress? A worrying analysis of recent neural recommendation approaches.” Proceedings of the 13th ACM Conference on Recommender Systems. ACM, 2019.
³Armstrong, et al. “Improvements that don’t add up: ad-hoc retrieval results since 1998.” Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009.
⁴McDermott, et al. “Reproducibility in machine learning for health.”
⁵Jason Brownlee. Data Leakage in Machine Learning