Machine Learning Competitions Are Unfair
Participating in Data Science competitions is one of the main ways Data Scientists can get experience working with “real world” datasets without working professionally in the field. Winning a competition is a huge achievement and can be the ticket to lucrative offers from top AI companies. Starting with competitions is often recommended for entry-level Data Scientists and students to build experience with the craft of Machine Learning.
Kaggle is the most popular and well known of these platforms, and since launching in 2010 it has exploded in popularity alongside the of ML and Data Science boom. It currently boasts over 3 million registered users, 50,000 public datasets and 400,000 public notebooks.
The basic format of a Data Science competition — on Kaggle or elsewhere — is pretty standardized: an organization with a business problem publishes a relevant dataset, a metric to judge submissions, and a competition deadline. Data Scientists compete with each other to build models and get the highest score. The top 3 submissions (usually) at the end of the competition win share of the prize money, while everyone else who competed gets nothing.
For reference, Kaggle competitions routinely attract 1000+ teams, meaning thousands of people can be competing for any given prize, including a large number of PhDs and other advanced practitioners. At the end, the winning model is handed over to the competition organizer in exchange for the prize money, which can range from $100 to $1 million. The competitions routinely run for several months and it’s only the final score that matters.
If this strikes you as unfair to all of the Data Scientists who invest considerable time and effort into competitions but walk away empty-handed, you wouldn’t be off base. Let’s do a thought experiment and consider the benefits to companies of using Kaggle, as opposed to developing a solution in-house:
The average DS salary ranges from $80K-$150K a year, and most competitions run for 3 months. Contracting 1000 Data Scientists for the same work at the mean yearly salary of $115K would end up costing:
1000 x 115,000 x 1/4 = $28,7500,000
That’s nearly $30 million just in salary costs — where to find 1000 Data Scientists in the first place is a different headache. Instead, by going through Kaggle they can pay a fraction of that money and get a custom developed, state-of-the-art solution all while only risking one year’s salary of an average Data Scientist.
What do participants get if they don’t win? Some experience working with neatly preprocessed datasets, a chance to try out or show off new techniques and skills, and work on data they wouldn’t otherwise have access to, and… that’s pretty much it.
Considering the time and effort put in by competitors compared to the benefit reaped by the company, this goes pretty strongly against most people’s idea of fairness. When you also factor in that Kaggle, like other competition platforms, is a for-profit corporation that makes its money from helping the companies implement the winning solution, the unfairness seems even more on-the-nose (Kaggle was bought by Google in 2017).
It’s true that participating in a Data Science competition is better than doing nothing if you’re just starting out in the field. It can also be useful for keeping skills sharp while in a non-technical position, or if you just enjoy Data Science.
But the reality is that any way you cut it, the basic structure of these competitions is not one that is built around compensating Data Scientists fairly for the effort or value they contribute and instead relies on other kinds of rewards, such as prestige, status, and access to data. These are all nice to have but are not great for teaching Data Scientists the craft of turning ML code into production-ready solutions.
Why Kaggle will NOT make you a great data-scientist
There is no doubt that Kaggle is a great place to learn data science. There are many data scientist who invest a lot of…
Although there are competing, but less popular, platforms out there like DrivenData, CodaLab, or CrowdANALYTIX, they all have the same incentive structure, and the only benefit is that there is less competition for first place.
One platform stands out with a different approach, one that aims to distribute rewards among participants more fairly. Telesto.ai is a competition platform that also runs ML competitions for companies, but with key differences:
- the top model gets crowned each week for the duration of the competition, not just at the end.
- each week’s winning model gets deployed on the platform until the following weeks winner is announced and deployed.
This means that many more participants stand a chance of seeing some kind of money from their participation, and have the opportunity for their model to be tested in a production environment, something most Kagglers can only dream of. Although only in their Alpha launch, they already have a competition running on classifying images of cells as being infected with COVID-19, and the leading model is deployed on their platform
The other big benefit of this iterative process is that any shortcomings of a deployed model in a production environment will quickly become apparent. Smart and disciplined competitors can also sample the deployed model to find its blind spots and figure out ways to best it.
This approach is a positive evolution for the world of competitive Data Science. It shifts the focus away from narrow optimization of a single metric to actual production performance, something that a lot of highly engineered Kaggle models struggle with. It also makes the competition format more engaging and dynamic, and drives higher engagement throughout the competition lifecycle, since winning or losing isn’t only decided at the end but throughout the competition duration.
In an ideal world competition platforms should also enable any participant to deploy their models to production as an API endpoint regardless of whether they win or not. I think this would give a huge boost to engagement and participation. Obviously the winning model — or models in Telesto.ai’s case — gets the reward, but having the ability to integrate a production-ready model in a web app or on their personal page would be a real achievement, especially for those entry-level Data Scientists that most need to showcase their work.
While Kaggle has a public API where it’s possible to interact with hosted notebooks and kernels, the experience is far from smooth and not geared toward production use-cases at all. On that front Telesto.ai shows a lot more promise by providing a how-to-guide for uploading their model as a Docker image.
I hope that more platforms take a cue from Telesto and move toward bringing Data Science competitions closer to the real world. That is, after all, where Data Science and Machine Learning bring true value.