How you could’ve got a silver medal in Kaggle’s 2022 Jigsaw competition

Don’t overfit it!

This could’ve been you
Photo by Lucas Alexander on Unsplash

Arrogant forecasters don’t exist

The interesting thing about forecasting at Gousto is that every week I get a new test set. Just because a model worked well on last week’s data doesn’t mean it’ll work well on next week’s. The challenge lies in properly validating models before putting them into production.

How do you learn the subtleties of cross-validation? With practice, and this is where Kaggle comes into the picture. In the words of Competitions Grandmaster Jean-Francois Puget:

unbiased evaluation is the single most important feature of Kaggle. It teaches kagglers that properly evaluating predictive model performance is key.

Once the test score has been revealed, that’s it, you can’t go back and change you model. If you overfitted, it’s too late. Let’s talk about overfitting in the context of the 2022 Jigsaw Competition.

Don’t you know that you’re toxic?

The goal of the 2022 Jigsaw competition is, given a pair of sentences, to determine which one is more toxic. Because toxicity isn’t objective, the ground truth for this competition has been determined by human annotators. Solutions are evaluated in terms of Average Agreement with Annotators.

For example, suppose we have the sentences “you’re so stupid!” and “shut up!”, and 2 human annotators thought the first one is more toxic, while a third one thought the second one is more toxic. If you give a toxicity score of .1 to the first sentence and a toxicity score of .2 to the second one, then your Average Agreement with Annotators for these two sentences would be 2/3. You goal is to maximise this number based on the toxicity scores you assign to each sentence you’re provided with.

This subtitle overfits

By inspecting the most popular public notebooks, you could notice a couple of red flags. Some of them contained really arbitrary scalings such as

for i in range(801, 1200):
df_test['score'][i] = df_test['score'][i] * 1.34

They evidently came up with some magic numbers which happened to score high on the public leaderboard (approx. 5% of the final test set) without regard to generalisation. Overfitting doesn’t get much more overt than this.

But even many of the public notebooks without these magic scalings were overfitting. You could conclude this by forking them and calculating their validation scores, which were much lower than their public leaderboard scores. This kind of overfitting is harder to spot — if you want to detect it, cross-validation is a must.

Keep it simple

Based on the above, I had a strong feeling that most competitors were heavily overfitting to the public leaderboard, probably without even realising it. So I figured I’d try the following:

  • make a simple solution
  • check that its validation score beats that of the overfitted public notebooks
  • submit

This meant ignoring the public leaderboard, which is never easy.

Model-wise, I tried out Detoxify, an open-source pre-trained model which, given some texts, predicts the number values corresponding to: toxic, severe_toxic, osbcene, threat, insult, identity_hate. I then found the linear combination of such values which would maximise my local validation score. That’s it, that was the whole solution.


My model:

  • validation score: 0.691
  • public leaderboard: 0.753

By contrast, a very highly-upvoted public notebook achieved:

  • validation score: 0.672
  • public leaderboard: 0.873

Bingo. As long as my validation score was higher, I wasn’t bothered by being low on the public leaderboard. This ended up paying off — when the private leaderboard was revealed, my model scored 0.801, while the aforementioned public notebook only scored 0.762.

The takeaway, as with all competitions, is simple: trust your CV.

Want more?

Kaggle’s great, but what if you want to apply your Data Science skills in practice in a truly awesome organisation? If you’re based in the UK, the good news is that Gousto’s hiring! Message me, or apply directly at




Gousto Engineering & Data Blog

Recommended from Medium

Sparkify Churn Rate Analysis

Emulating Twitter Users

Let’s Stop Calling Them Data Science “Projects”.

Shadowy stairs leading to an unknown destination

Good Night, Little Blue Truck pdf Download

Data Engineer, Patterns & Architecture The future

Add New Column to Hugging Face Datasets

Timeshare exchange system failures exposed by 2020–22 world events

Top Qualities Hiring Managers Look For In Data Scientist Candidates

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Marco Gorelli

Marco Gorelli

Data Scientist, pandas maintainer, Kaggle competitions expert, Univ. of Oxford MSc

More from Medium

How AI & auto MLOps modernize the Manufacturing sector to establish product quality & optimize…

How AI & auto MLOps modernize the Manufacturing sector to establish product quality & optimize operations

Transforming Talent Management Experience with Job Market Data — GoodPeople

Mr. Wolf Fools the Data Science Team Again — Data Leakage Scam 🐺

Recommendation Engines — A breakthrough in AI