How you could’ve got a silver medal in Kaggle’s 2022 Jigsaw competition

Don’t overfit it!

This could’ve been you
Photo by Lucas Alexander on Unsplash

Arrogant forecasters don’t exist

The interesting thing about forecasting at Gousto is that every week I get a new test set. Just because a model worked well on last week’s data doesn’t mean it’ll work well on next week’s. The challenge lies in properly validating models before putting them into production.

How do you learn the subtleties of cross-validation? With practice, and this is where Kaggle comes into the picture. In the words of Competitions Grandmaster Jean-Francois Puget:

unbiased evaluation is the single most important feature of Kaggle. It teaches kagglers that properly evaluating predictive model performance is key.

Once the test score has been revealed, that’s it, you can’t go back and change you model. If you overfitted, it’s too late. Let’s talk about overfitting in the context of the 2022 Jigsaw Competition.

Don’t you know that you’re toxic?

The goal of the 2022 Jigsaw competition is, given a pair of sentences, to determine which one is more toxic. Because toxicity isn’t objective, the ground truth for this competition has been determined by human annotators. Solutions are evaluated in terms of Average Agreement with Annotators.

For example, suppose we have the sentences “you’re so stupid!” and “shut up!”, and 2 human annotators thought the first one is more toxic, while a third one thought the second one is more toxic. If you give a toxicity score of .1 to the first sentence and a toxicity score of .2 to the second one, then your Average Agreement with Annotators for these two sentences would be 2/3. You goal is to maximise this number based on the toxicity scores you assign to each sentence you’re provided with.

This subtitle overfits

By inspecting the most popular public notebooks, you could notice a couple of red flags. Some of them contained really arbitrary scalings such as

for i in range(801, 1200):
df_test['score'][i] = df_test['score'][i] * 1.34

They evidently came up with some magic numbers which happened to score high on the public leaderboard (approx. 5% of the final test set) without regard to generalisation. Overfitting doesn’t get much more overt than this.

But even many of the public notebooks without these magic scalings were overfitting. You could conclude this by forking them and calculating their validation scores, which were much lower than their public leaderboard scores. This kind of overfitting is harder to spot — if you want to detect it, cross-validation is a must.

Keep it simple

Based on the above, I had a strong feeling that most competitors were heavily overfitting to the public leaderboard, probably without even realising it. So I figured I’d try the following:

  • make a simple solution
  • check that its validation score beats that of the overfitted public notebooks
  • submit

This meant ignoring the public leaderboard, which is never easy.

Model-wise, I tried out Detoxify, an open-source pre-trained model which, given some texts, predicts the number values corresponding to: toxic, severe_toxic, osbcene, threat, insult, identity_hate. I then found the linear combination of such values which would maximise my local validation score. That’s it, that was the whole solution.


My model:

  • validation score: 0.691
  • public leaderboard: 0.753

By contrast, a very highly-upvoted public notebook achieved:

  • validation score: 0.672
  • public leaderboard: 0.873

Bingo. As long as my validation score was higher, I wasn’t bothered by being low on the public leaderboard. This ended up paying off — when the private leaderboard was revealed, my model scored 0.801, while the aforementioned public notebook only scored 0.762.

The takeaway, as with all competitions, is simple: trust your CV.

Want more?

Kaggle’s great, but what if you want to apply your Data Science skills in practice in a truly awesome organisation? If you’re based in the UK, the good news is that Gousto’s hiring! Message me, or apply directly at




Gousto Engineering & Data Blog

Recommended from Medium

Introduction To Quandl

4 Super Useful Python Features

XLM — Enhancing BERT for Cross-lingual Language Model

Projects of Data Science for the Public Good Program

«Weekly Report» The Change of AIDUS QTS Profit Rate (October 16, 2020)

Basic Sampling concepts and terminology - All that you need to know!

Chapter 1: Apricity

A Visualization of the Catastrophic Fires in Australia using the Google Earth Engine Cloud Platform

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Marco Gorelli

Marco Gorelli

Data Scientist, pandas maintainer, Kaggle competitions expert, Univ. of Oxford MSc

More from Medium

Data Science and Software Engineering Process Models

How to test ML Models? (2/n): categorical data drift

Internships — by Virginie Marelli

Transforming Talent Management Experience with Job Market Data — GoodPeople