Solving the AI Data Divide


Cowritten with Nick White

There is an unspoken threat to innovation in AI.

AI is fueled by three principal ingredients: computing power, learning algorithms, and data. Most media revolves around the massive server farms of companies like Microsoft and Amazon, and the latest groundbreaking algorithms discovered in AI research labs like Google Deepmind. Meanwhile, the third ingredient to the recipe for effective AI, data, is hardly mentioned unless it’s embroiled in some public controversy.

Democratization of two ingredients have accelerated innovation of AI. Of the three ingredients, computational resources and algorithms have radically opened up in the last decade. Services like AWS and Microsoft Azure have made access to computing power far simpler and cheaper than ever in the past. At the same time, it has become common for top AI research groups to release their latest papers publicly along with their source code, making the cutting-edge in AI algorithms available to all. One research lab, OpenAI, was founded explicitly for this purpose. These two changes have made AI development easier than they’ve ever been and are why AI leaders like Antoine Blondeau have recently stated “The ability to impact the world is much higher now than it was 30 years ago.

“The ability to impact the world is much higher now than it was 30 years ago”

Data, the last ingredient, is a different story. While there may be a steadily growing number of public datasets available, of which many have been generously donated by the likes of Microsoft, Google and Quora, democratization of data has not taken place. Most of the world’s data sit on the servers of big tech companies, large corporations, or government organizations. Most corporations have little incentive to share their data as they benefit from the competitive advantage of maintaining their data monopolies, and due to legal issues of privacy and copyright, most of this data cannot be given out publicly.

At this current point, there are two ways to gain entrance to the data-vaults of large corporations and organizations. The first is to obtain a proof of concept (POC) contract, whereby a startup is given limited access to a potential client’s data in order to demonstrate the viability of their product. Such a contract can take an enormous amount of time and effort to secure. The other option is to pay for the data. Some datasets are relatively cheap, but most of the powerful datasets are extremely expensive, meaning only large companies can realistically afford them. For example, Uber recently paid $500M for better maps data.

As a result, most of the data-poor AI community resort to scrounging together what little private data they can afford or whatever public data they can find, often going through great lengths to organize data into a useable dataset. A large amount of duplicated effort is spent between smaller companies and research organizations each independently trying to wrangle together the data they need to test their ideas. These data silos and the fragmentation of public data create the Data Divide that hinders AI innovation.

The Data Divide creates a compounding effect in talent. As the Data Divide widens, there will be more incentive for the best and brightest in the community to jump on board with companies that have data, thus increasing the divide. Innovation within these organizations will outstrip that of the wider AI community and any innovations made on the outside could be copied but applied more effectively by the these companies due to their larger data resources.

Put simply – this Data Divide is the greatest threat to innovation in AI.

There may be no way to convince the big tech companies to reduce the data divide of their own volition, but it would be extremely powerful if the rest of the AI community came together to address this problem and take steps towards overcoming it. There are several ways in which we can begin to move forward and lessen the divide from the outside.

First, we can begin to collaborate to form and share datasets publicly in a more organized way. We can establish standards for certain types datasets so that data contributed by multiple organizations can be stitched together into a larger and more robust dataset. Perhaps such a collaborative data platform could be open-sourced, allowing the AI enthusiasts around the world to work together to build a shared public database. It could be the database equivalent of Wikipedia.

Second, we can approach data-rich corporations that lack the technical expertise to develop their own AI and develop a platform that would streamline the process of securing a proof of concept contract. This way entrepreneurs and researchers can gain access to the datasets they need to prove their ideas and these companies would gain access to the technical expertise they need to learn how to use their data to improve their business.

We are committed to try and solve this problem at Zeroth. We want to open up the discussion and get opinions and feedback from the community. We want to develop a basic version of this platform and test it out on the companies in our cohort as well as companies in our network to see how it works. Eventually we will open up the platform to the wider community, so it can be shared by everyone.

If you are interested in this idea or have any suggestions, please reach out to me at tak@zeroth.ai. We look forward to creating a democratic future for data.

Thanks Kailash Ahirwar, James Boyden, Bofu Chen, Adam Huang, Tom Murtagh, Rahul Vishwakarma, and Tammy Yang for contributions to this