The Reproducibility Challenge as an Educational Tool

Published in

PapersWithCode

10 min readSep 22, 2020

The ML Reproducibility Challenge is a global challenge to reproduce papers published in 2020 in top machine learning, computer vision and NLP conferences. The challenge provides a unique opportunity for researchers from around the world to participate in elevating the quality, visibility, and reliability of machine learning research results. It is also a fantastic opportunity for young scientists to learn about state-of-the-art results, while contributing to scientific knowledge and our research community.

Over the past 3 years, several course instructors have incorporated the ML reproducibility challenge as a component of their course, typically as the final course project. In this blog post we provide a detailed description of how to set this up, as well as several useful complementary resources.

Although the challenge is open to everyone, we found that participation was particularly beneficial to Masters and PhD students at universities. By reproducing recently accepted papers, students learn about state-of-the-art methods, but also learn about how to read a paper, reproduce results, build on other people’s research, and write a technical report. These are foundational skills in both industry and academia, which can be acquired and perfected through practical work. Beyond the advantages listed above, course instructors appreciate the fact that the project “renews” itself every year with a new batch of papers, and students appreciate the fact that they are concretely contributing to progress in AI research, and going beyond their academic studies.

We are writing this blog post to share our experiences that we hope will be useful for other instructors who would like to incorporate the reproducibility challenge as part of their course. In addition to this blog post, we are also happy to answer any questions in person or over email at: reproducibility.challenge@gmail.com.

The ML Reproducibility Challenge

For this year’s machine learning reproducibility challenge, we expand the scope to cover 7 top AI conferences in 2020 across machine learning, natural language processing, and computer vision: NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR and ECCV. Similar to previous years, participants will select a published paper, and attempt to reproduce its central claims. The objective is to assess if the conclusions reached in the original paper are reproducible; for many papers replicating the presented results exactly isn’t possible, so the focus of this challenge is to follow the process described in the paper and attempt to reach the same conclusions. In addition to providing a collection of reproductions available to the research community, this provides an opportunity for students to walk through the process of writing a paper guided by published research.

One goal of publishing scientific work is to enable future readers to build upon it, or to take lessons learned and apply them to a new problem of interest. This challenge emulates that process, where often the first step to build on past work is to successfully implement it.

In essence, a successful reproduction acts as a proof-of-concept for the original paper’s ability to be built upon. This also provides an excellent educational opportunity for junior scientists, where they can be guided through the research process by an existing publication, to practice walking through what it takes to run experiments and write a paper. Indeed, reports written as class projects for previous reproducibility challenges have turned into submissions at major AI conferences.

Additional benefits of submitting reports to the official challenge website

While it’s possible to ask students to reproduce papers just as a course assignment or project and keep the results of this work within the confines of the classroom, we believe there are benefits to participating in the official challenge. All submissions to the challenge will be peer-reviewed, so this is an opportunity for students to experience this part of the publication process as well. The reports will be made publicly available; the collection of reproduction reports will be publicly shared on OpenReview, and act as a complementary resource. A subset of the reports will be selected to appear next to the original paper on Papers with Code. In addition, some of the reports will be published in a special issue of the journal ReScience. Through our support from conference organizers, we also encourage original paper authors to be part of the reproducibility challenge so students have an opportunity to interact with them, ask them questions, and help improve papers.

Introducing the first-page abstract [page 1 here]

This challenge differs from previous years in that we are requiring reports to begin with a one-page structured abstract, summarizing the results. This is partly motivated by the success of the structured abstracts found as the first page of submissions to many journals in the clinical, behavioral, and biological sciences. This acts as a short summary of what is found in the rest of the report; the scope of the reproduction and the methodology place the reproduction in context, and the sections describing the results provide a glimpse of the conclusions. A reader might find this report when searching for the original paper and this title page will provide information about whether the report contains pertinent information for them.

Introducing the optional report template [pages 2–4 here]

We also propose an optional template for the rest of the reproducibility report. Parts of this template replicate the structure of a paper; as an educational tool, this provides an opportunity for students to gain practice describing the setup of and results from their experiments. This template also provides locations to include the items from the ML reproducibility checklist to remind authors of some scientifically relevant items to include.

In addition to the educational benefits, this template will allow readers of a report to quickly look up information they’re interested in. For example, a practitioner interested in estimating if they have enough compute to use a given model can check Section 3.5, and once they decide to use the model on some task of interest they can check Section 3.3 for hyperparameter values (which may or may not have been forgotten in the original paper). In a classroom setting, this also facilitates easy grading — it’s simple to check if a report includes a particular item when you know where to look. The proposed report template was piloted in a UW NLP class, received positive feedback, and has been adapted for this broader challenge.

Some components of the template might be surprising. For example, we explicitly decouple the main claims in the reproduction from the evidence which supports those claims. Clearly articulating the contributions of the original work acts as practice reading and summarizing a paper, and also as practice appropriately describing ideas so they are testable with appropriate scope (this will be useful when students write their own papers). It has been noted that there is widespread “failure to distinguish between explanation and speculation” [Lipton and Steinhardt, 2018]; when claims are formulated as something akin to scientific hypotheses (“model X outperforms model Y on dataset Z”) it can be straightforward to see if the evidence presented supports them (evaluations show higher performance for X than Y).

In previous iterations of the challenge some students found that the paper they chose to reproduce underspecified some of the details needed to run the experiments. For example, even when code was released, some hyperparameters like the batch size or dropout probability might not be reported. To recreate the results from the original paper, it can be necessary to try a few experimental setups, and the optional template has a separate section for reporting the results of any additional experiments beyond the original paper (such experiments may be useful to a reader). This section was also useful for helping normalize the amount of work for each project; some papers are naturally easier to reproduce than others, in which case students could run additional experiments which provide additional evidence supporting the main claims of the original paper. Some examples include varying the amount of training data, evaluating on an additional dataset, or performing ablations.

Communication with paper’s original authors

We strongly encourage participants in the challenge to contact the original authors of the paper and work with them both to clarify details but also help improve the original paper. In the past we found this communication was beneficial to students to interact with some of the top scientists in the field, but also for the original authors to get someone motivated to reproduce their paper and therefore help improve the paper’s readability and reproducibility.

Detailed description and timeline of the challenge

Here we give an example description of this challenge as a course project. Each of the three numbered points can be deliverables in the course. The pilot UW NLP class was an11 week course, and we chose to have the paper selection due in week 3, first draft due in week 7, and the final report due in week 11. We describe these three in detail below. See instructions for the challenge here.

Paper selection: Participants choose a paper that was published this year at one of the seven conferences covered by this challenge. This involves writing a “Reproducibility Plan” which briefly describes how the project will proceed.
* A few useful considerations for students to consider when selecting a paper include: a) the topic of the paper falls within the scope of their course, b) they understand the technical content of the paper, c) they have access to the data needed to train and evaluate their models, and d) they have access to enough of a computational budget to successfully rerun the experiments.
First draft: Part way through the project, students should fill out as much of the optional template as they can (pages 2–4 here). If they have preliminary results, they should be included in the results section, with a (brief) description of what experiments they will run. At this point, the students should be able to show that they have access to whatever data and models required, and they have made some progress on any implementations.
* They can describe the scope of the reproduction in Section 2, provide a description of any models used in Section 3.1, a description of the datasets in Section 3.2, and a description of how the models will be evaluated in Section 3.4. Any preliminary results can go in Section 4; some student groups won’t have results until the end, in which case they can include a description of what they have done and still need to do. It’s recommended that participants estimate (roughly) what the computational cost (e.g. GPU hours) will be.
* This stage simulates good practice when doing your own research — starting a draft of a paper before all results are in can help an author clearly articulate the ideas they’re interested in testing, and gives perspective to make sure their experiments appropriately test those ideas.
* Here it should be evaluated if the students have access to resources they need such as pretrained models and training data, and if code is available or if they’ll need to reimplement it. If instructors think the project is too narrow or too large in scope the students can adjust.
Final report: Finish code, run experiments, and finish the full report.
* Often students will find the original paper underspecified some details necessary to reproduce the original experiments. It’s likely some details don’t make it into the original paper — this is an opportunity for the reproduction report to include that information. Any decisions that the students had to make that weren’t clear from the paper should be described (again, think about what future practitioners would want to know). Any experiments run should be reported (there is no need to only report experiments which “work”).
* Remember that the goal of science is to produce knowledge, so the discussion should focus on what would be useful to a reader. Typically this would be framed in terms of evidence (the experimental results), and whether or not they support the claims.

Additional resources

Some resources from previous courses:
McGill University PhD-level Applied Machine Learning course, Fall 2017
University of Washington PhD-level NLP course, Winter 2020

Lessons learned from classroom participation to previous reproducibility challenges

We have found group sizes of around three students worked well. It is helpful to articulate to the students the motivation behind each part of the project — explicitly pointing out that when writing a research paper they will have to describe their models and datasets in the same way as for this project motivates them to view this as useful practice.

Students are still learning how to do science, and so one area that we found required further explanation was writing the main claims separately from the evidence for those claims from the original paper. One common mistake here was that students would write a statement about what experiments they would run (e.g. “We will evaluate model X on dataset Y”), or they would write a full paragraph explaining the results they expected to get. Instead, this should be similar to a list of “contributions” of a paper, but the statements are summarizing the original paper (e.g. “Pruning a model using algorithm X uncovers trainable subnetworks that reach test accuracy comparable to the original networks.”). Then, in the results section, an experiment could show this worked (or didn’t!) for a particular model and dataset.

The optional template can also act as a grading rubric. One approach is to assign roughly half the points to the Section 4 (Results), and half to the other sections (as the project is about reproducibility, clearly describing the setup is important). Some papers are naturally more work to reproduce (for example, some papers have fully-automated code repositories, while others release no code at all); to help equalize the amount of work for different projects, we recommend having students run ablations or other additional experiments if necessary.

Remember, if a project correctly follows the original paper but finds different results, that’s still a successful project! Showing that with good effort the original results were not replicable is valuable.

This is a guest post written by Jesse Dodge. He would also like to thank Joelle Pineau, Koustuv Sinha and Robert Stojnic for their feedback and help on this post.