Photo by Clay Banks on Unsplash

Machine Learning Competitions Are Unfair

And what can be done about it

Adam Cohn
Adam Cohn
Aug 30, 2020 · 5 min read

Participating in Data Science competitions is one of the main ways Data Scientists can get experience working with “real world” datasets without working professionally in the field. Winning a competition is a huge achievement and can be the ticket to lucrative offers from top AI companies. Starting with competitions is often recommended for entry-level Data Scientists and students to build experience with the craft of Machine Learning.

Kaggle is the most popular and well known of these platforms, and since launching in 2010 it has exploded in popularity alongside the of ML and Data Science boom. It currently boasts over 3 million registered users, 50,000 public datasets and 400,000 public notebooks.

The basic format of a Data Science competition — on Kaggle or elsewhere — is pretty standardized: an organization with a business problem publishes a relevant dataset, a metric to judge submissions, and a competition deadline. Data Scientists compete with each other to build models and get the highest score. The top 3 submissions (usually) at the end of the competition win share of the prize money, while everyone else who competed gets nothing.

For reference, Kaggle competitions routinely attract 1000+ teams, meaning thousands of people can be competing for any given prize, including a large number of PhDs and other advanced practitioners. At the end, the winning model is handed over to the competition organizer in exchange for the prize money, which can range from $100 to $1 million. The competitions routinely run for several months and it’s only the final score that matters.

If this strikes you as unfair to all of the Data Scientists who invest considerable time and effort into competitions but walk away empty-handed, you wouldn’t be off base. Let’s do a thought experiment and consider the benefits to companies of using Kaggle, as opposed to developing a solution in-house:

The average DS salary ranges from $80K-$150K a year, and most competitions run for 3 months. Contracting 1000 Data Scientists for the same work at the mean yearly salary of $115K would end up costing:

1000 x 115,000 x 1/4 = $28,7500,000

That’s nearly $30 million just in salary costs — where to find 1000 Data Scientists in the first place is a different headache. Instead, by going through Kaggle they can pay a fraction of that money and get a custom developed, state-of-the-art solution all while only risking one year’s salary of an average Data Scientist.

What do participants get if they don’t win? Some experience working with neatly preprocessed datasets, a chance to try out or show off new techniques and skills, and work on data they wouldn’t otherwise have access to, and… that’s pretty much it.

Considering the time and effort put in by competitors compared to the benefit reaped by the company, this goes pretty strongly against most people’s idea of fairness. When you also factor in that Kaggle, like other competition platforms, is a for-profit corporation that makes its money from helping the companies implement the winning solution, the unfairness seems even more on-the-nose (Kaggle was bought by Google in 2017).

It’s true that participating in a Data Science competition is better than doing nothing if you’re just starting out in the field. It can also be useful for keeping skills sharp while in a non-technical position, or if you just enjoy Data Science.

But the reality is that any way you cut it, the basic structure of these competitions is not one that is built around compensating Data Scientists fairly for the effort or value they contribute and instead relies on other kinds of rewards, such as prestige, status, and access to data. These are all nice to have but are not great for teaching Data Scientists the craft of turning ML code into production-ready solutions.

Although there are competing, but less popular, platforms out there like DrivenData, CodaLab, or CrowdANALYTIX, they all have the same incentive structure, and the only benefit is that there is less competition for first place.

One platform stands out with a different approach, one that aims to distribute rewards among participants more fairly. Telesto.ai is a competition platform that also runs ML competitions for companies, but with key differences:

  1. the top model gets crowned each week for the duration of the competition, not just at the end.
  2. each week’s winning model gets deployed on the platform until the following weeks winner is announced and deployed.

This means that many more participants stand a chance of seeing some kind of money from their participation, and have the opportunity for their model to be tested in a production environment, something most Kagglers can only dream of. Although only in their Alpha launch, they already have a competition running on classifying images of cells as being infected with COVID-19, and the leading model is deployed on their platform

The other big benefit of this iterative process is that any shortcomings of a deployed model in a production environment will quickly become apparent. Smart and disciplined competitors can also sample the deployed model to find its blind spots and figure out ways to best it.

This approach is a positive evolution for the world of competitive Data Science. It shifts the focus away from narrow optimization of a single metric to actual production performance, something that a lot of highly engineered Kaggle models struggle with. It also makes the competition format more engaging and dynamic, and drives higher engagement throughout the competition lifecycle, since winning or losing isn’t only decided at the end but throughout the competition duration.

In an ideal world competition platforms should also enable any participant to deploy their models to production as an API endpoint regardless of whether they win or not. I think this would give a huge boost to engagement and participation. Obviously the winning model — or models in Telesto.ai’s case — gets the reward, but having the ability to integrate a production-ready model in a web app or on their personal page would be a real achievement, especially for those entry-level Data Scientists that most need to showcase their work.

While Kaggle has a public API where it’s possible to interact with hosted notebooks and kernels, the experience is far from smooth and not geared toward production use-cases at all. On that front Telesto.ai shows a lot more promise by providing a how-to-guide for uploading their model as a Docker image.

I hope that more platforms take a cue from Telesto and move toward bringing Data Science competitions closer to the real world. That is, after all, where Data Science and Machine Learning bring true value.

The Startup

Get smarter at building your thing. Join The Startup’s +799K followers.

Adam Cohn

Written by

Adam Cohn

Love working at the intersection of Data, Business & Code. Fascinated by AI, Philosophy, Strategy & History. Fear is the mind-killer

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

Adam Cohn

Written by

Adam Cohn

Love working at the intersection of Data, Business & Code. Fascinated by AI, Philosophy, Strategy & History. Fear is the mind-killer

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store