AI’s “Data Gap”

Read my lips: 17,428–51=OMG!

Kenneth Cukier
the self-driving company
2 min readJan 22, 2017

--

TO GET a sense of the huge difference in artificial intelligence abilities among the best and the rest, consider this: the data set from a Google DeepMind and Oxford University program to train its lip-reading system contained 17,428 unique words — while a rival Oxford project called LipNet had just 51.

The two initiatives released papers just a week apart in November 2016. But they were worlds apart: a match stick compared to a massive, white-hot industrial furnace. And industry and academia better get used to this. The same benefits of scale seen in talent, income, salaries and computing power exists in data too.

Artificial intelligence relies on data for training the algorithms, and the larger, better funded players have an incredible advantage. Often, they lead by tapping the information flowing across their platforms. Google knows what people are interested in; Facebook knows who they know; Amazon knows what they want to buy; etc.

Yet they also have a big advantage in being able to tailor raw data into usable training data. That was the case with Oxford’s and DeepMind’s lip-reading project. They tapped 5,000 hours of BBC videos, but because the audio and video was sometimes out of synch by as much as a second, they first needed to adjust this. That took incredible know-how, time and resources. Few other than the majors are able to make that investment.

The solution of course is openly-available data-sets and consortia. Such was the case for ImageNet, for image-recognition. And to its credit, Oxford and DeepMind are making the synched-up BBC video data openly available to others, notably the rival LipNet researchers.

But such corporate generosity may not last. So universities, startups, venture-capital funds and organizations like OpenAI need to make the case to first-party organizations that hold the actual data — groups like the BBC, hospital systems, etc — that the primary resource of AI needs to be shared on an open basis.

If not, some entities will prosper as others find it difficult to compete. Antitrust regulators have already taken notice. The data, as a raw material, will be a barrier to entry for markets — and we will have a less competitive, vibrant AI sector in the long term. This, even if we can sing the praises of the incredible accomplishments of the majors.

--

--