ML Code Completeness Checklist
Last year Joelle Pineau launched the Reproducibility checklist to facilitate reproducible research presented at major ML conferences (NeurIPS, ICML, …). Most of the items on the checklist focus on components of the paper.
One item on that checklist is “provide a link to source code”, but little guidance has been given beyond this. We at Papers with Code host the largest collection of paper implementations in one place, so we collated the best practices we’ve seen used by the most popular research repositories.
We summarized these best practices into ML Code Completeness Checklist that is now part of the official NeurIPS 2020 code submission process, and will be available to reviewers to use at their discretion.
ML Code Completeness Checklist
With the goal of enhancing reproducibility and enabling others to more easily build upon published work, we introduce the ML Code Completeness Checklist.
The ML Code Completeness Checklist assesses a code repository based on the scripts and artefacts that have been provided within it. It checks a code repository for:
- Dependencies — does a repository have information on dependencies or instructions on how to set up the environment?
- Training scripts — does a repository contain a way to train/fit the model(s) described in the paper?
- Evaluation scripts — does a repository contain a script to calculate the performance of the trained model(s) or run experiments on models?
- Pretrained models — does a repository provide free access to pretrained model weights?
- Results — does a repository contain a table/plot of main results and a script to reproduce those results?
Each repository can get between 0 (has none) and 5 (has all) ticks. More details on criteria for each item can be found in our github repository.
What’s the evidence that checklist items encourage more useful repositories?
The community commonly uses GitHub stars as a proxy for repository usefulness. Therefore, our expectation is that repositories scoring higher on the ML Code Completeness Checklist should also tend to have more GitHub stars.
To verify this hypothesis we selected the 884 GitHub repositories submitted as official implementations to NeurIPS 2019 papers. We randomly selected a 25% subset of these 884 repositories and manually scored them on the ML Code Completeness Checklist.
We grouped this sample of NeurIPS 2019 GitHub repositories by how many ticks they have on the ML Code Completeness Checklist and plotted the median GitHub stars in each group. The result is below:
NeurIPS 2019 repositories with 0 ticks had a median of 1.5 GitHub stars. In contrast, repositories with 5 ticks had a median of 196.5 GitHub stars. Only 9% of repositories had 5 ticks, and most repositories (70%) had 3 ticks or less.
We also ran the Wilcoxon rank sum test, and found that the number of stars in the 5-tick class is significantly (p.value < 1e-4) higher to all other classes except 5 vs 4 (where the p.value is borderline at 0.015). You can see the data and code behind this figure in our github repository.
To examine if this relationship holds more broadly, we created a script to automate the checklist calculation from the repository README and its associated code. We then repeated the analysis on the whole set of 884 NeurIPS 2019 repositories, and also on a wider set of 8926 code repositories for all ML papers published in 2019. In both cases, we got qualitatively the same result, with median stars monotonically increasing with ticks in a statistically significant way (p.value < 1e-4). Finally, using robust linear regression, we found that pretrained models and results have the largest positive impact on GitHub stars.
We feel this is useful evidence that encouraging researchers to include all the components stipulated by the ML Code Completeness Checklist will lead to more useful repositories, and that the checklist score is indicative of higher quality submissions.
At this time, we are not claiming that the suggested 5 checklist items are the only or even the biggest contributors to repository’s popularity. Other factors are likely to influence popularity, such as: size of scientific contribution, marketing (e.g. blog posts and Twitter), documentation (comprehensive READMEs, tutorials and API documentation), code quality and previous work.
Some example NeurIPS 2019 repositories with 5 ticks are:
We acknowledge that while we aimed to make the checklist as general as possible, it might not be fully applicable to all types of papers, e.g. theoretical or dataset papers. However, even if the primary goal of a paper is to introduce a dataset, it can still benefit from releasing baseline models, with training scripts, evaluation scripts and results.
Start using it
To make it easier for reviewers and users to understand what is included in the repository and for us to correctly score it, we provide a collection of best practices around writing README.md files, specifying dependencies, releasing pretrained models, datasets and results.
Our recommendation is to clearly lay out these 5 elements in your repository, and to link to any external resources, such as papers and leaderboards to provide more context and clarity to your users.
These are the official code submission recommendations at NeurIPS 2020.
Let’s work together to improve reproducibility in our field and help advance science!