How to scale the machine learning community to 1 Million researchers

Mostapha Benhenda
The AI Lab
Published in
4 min readJul 13, 2018

The machine learning community is getting huge. Online courses are training a new generation of data scientists. For example, 1.8 million people have enrolled in Andrew Ng’s Machine Learning class on Coursera since 2011. Nvidia is training 100,000 developers in deep learning per year.

There are also more publications in AI:

However, the research community is not prepared for these newcomers. The main platform for research papers is Arxiv, a plain repository of PDF files. Arxiv is as old as the web, dating back to 1991. It didn’t change much since the beginning, while the rest of the Internet changed a lot, in order to scale to the 4 Billion users of today.

Arxiv in 1994
Arxiv in 2014

Arxiv seems archaic today, but it was a revolution back in 1991. Before Arxiv, things were even worse. Communications were dependent on peer-reviewed journals and conferences. The editorial process required several months of work, before making a result available. Arxiv democratized and accelerated research. Nowadays, by the time an article appears in a peer-reviewed venue, it’s sometimes outdated, and improved by a newer preprint on Arxiv.

Arxiv server disrupted peer-reviewed journals, a quarter of century ago

The next big thing will probably democratize and accelerate research even further. I think the next leap will incorporate ideas of massive collaborative writing, like Wikipedia and Github. These kind of platforms are able to scale to millions of contributors. Many such platforms are already available, like Authorea and Manubot.

By switching to a collaborative writing format, participation becomes wider, because micro-contributions are possible. For newcomers, the barrier of entry to Arxiv is high, because it’s a hassle to write a paper end-to-end. On the other hand, in a system where papers can be edited and forked, it becomes easier to build upon them. Iteration cycle is shorter, and idea dissemination is accelerated.

Iteration cycle is shorter with open collaborative writing

A Wiki/github format can also make science more cumulative. In the current situation, there are a lot of papers that ignore the previous literature (more or less voluntarily). This slows down progress. For example, I wrote about papers that got published at the peer-reviewed conferences ICLR 2018 and ICML 2018. It’s harder to ‘forget’ citations in a more open system.

Moreover, in this new system, it’s possible to avoid “idea stealing” within the (potentially large) community of co-authors, by using public and timestamped communication channels (GitHub, messaging apps like Telegram…). Collaboration is possible without the need of mutual trust. These communication records can be used to attribute credit individually, and to settle priority claims.

However, a question remains: if collaborative writing is so great, why is it not adopted more broadly? For example, Authorea is around since 2012, but it’s still small. Technology is here, so my explanation is that incentives for adoption are not strong enough, which is a problem because the research community is conservative on this issue (although less in machine learning than in other disciplines: for example, the Arxiv for biology, Biorxiv, launched only in 2013, and the Arxiv for chemistry, Chemrxiv, in 2017).

In the current funding system, research grants are attributed to lab teams, and then funds are dispatched to individuals. Collaboration outside of the team is not encouraged, because of the tough climate of competition between teams.

Competition for funding does not facilitate wider cooperation, and absorption of newcomers

However, this situation can change if research sponsors (agencies, foundations, private companies....) set different rules of the game. For example, in a collaborative article, sponsors can nominate a jury of experts, who split the money among individuals, proportionally to contributions, and irrespective of affiliations.

In conclusion, the growth of ML research population will transform the way research is done. Collaborative formats will get traction, accelerated by research sponsors (public or private). If you want to experiment these formats for your research projects, get in touch.

--

--