Data : The new liquid gold

AI revolution is causing a paradigm shift in what used to be the barrier to entry for companies of the internet era. With machine learning algorithms being made open source and easily available through various optimized frameworks (Tensorflow, Caffe, Theanos, Keras, Torch etc.) and powerful hardware options abundantly available at the edge (embedded on-device GPU SOCs) or in the cloud (nVidia Grid, GPU instances on AWS, Azure or Google Cloud Platform) companies in an AI driven world are shifting their focus from protecting hardware & algorithms to instead protecting the ownership of data, since a company aspiring to achieve Artificial Intelligence would gain leverage over others only by its unique access to high quality labelled data.

This is because the best performing option from the currently available options to approach Artificial Intelligence — Deep Learning, relies on training a hypothetical model with millions of annotated input-output data combinations. The iterative training is continued until the model learns to predict the right output for a given set of inputs. In most generic cases, the larger the training data set is, easier and quicker is the convergence to an accurately performing model. While the algorithmic innovations to reduce dependency on huge data sets might eventually catch up someday and help achieve just as good results with lesser training data, a faster time to market today still requires a very large training data set.

This starts making more sense if we look into the first ones to deliver meaningful Artificial Intelligence, for instance let us look at Google, Apple, Facebook, Uber and Tesla.

Gmail offered 1GB of free storage space at its launch in 2004 which then seemed crazy as compared to the mere 2 megabytes offered by it’s competitor — Hotmail. It doesn’t seem so crazy today. The crisp spam classifier that gmail has today would not have been possible if not for the huge training set provided by more than a billion active users. Billions of search queries helped Google train auto-complete on search results. It is almost like each software update or new feature was introduced as a mechanism to collect more and more data on user behavior which then laid the foundation for the next better service update. The Google search engine, Gmail and Maps paved the way for Google now. OK google helped expand the voice-to-transcript data set and opened the doors for even better voice assistance in the form of Google Home. While so heavily relying on Google in our daily lives, we have awarded it a privileged position to collect data about our likes & dislikes, habitual behavior such as the places we frequent, what we read, what we are curious about, our schedules, our circles, the highways we take and loads more. Google has evolved into a personal assistant who, while wanting to help you in the most seamless of ways, has learnt everything about you. Having a dedicated app for each of your needs, which serves you when asked for— which was the revolution of the past decade — is just not enough today. Digital users of this age want more. The desire is to have a digital assistant of sorts with the Artificial Intelligence of serving and premeditating your needs without having to ask for it. This is no easy feat to achieve by conventional methods and that is where Deep Learning has stepped in. Building something of this magnitude is possible if user-behavior data collected over days, weeks, months or perhaps years is available to train an accurate prediction model. That is the exact reason why this space is unbreachable for a startup and Google had to be the first one to crack this space.

It is going to be tough for a new player to win the credibility and trust among users to the extent Google has over the last decade, and hence ever enjoy similar privilege of penetrating our personal lives to snoop on our data. Access to good quality annotated data would be the biggest challenge for any start-up in AI space because Machine Learning algorithms in themselves can do only so much and are most often limited in performance by the scale and quality of the training data fed to them. Probably why Ray Kurzweil, one of the champions of Machine Learning & Artificial Intelligence, an entrepreneur and thought leader, decided to join Google in contrast to setting up an independent company because his mission to build a true artificially intelligent brain is plausible only if backed by Google-scale data.

Other giants enjoying such coveted data-collecting position in our lives and using it to make the most of the AI revolution are Apple, Facebook, Uber and Tesla. With each software update on my iPhone or the Facebook app, it is always fun to see the terms of agreement seek more permissions each time. It always makes me wonder, what they could be rolling out next to make our lives better, such as the below pop-up on my iPhone from last month:

It always excites me to see how these companies turn on their data collection tap points. For example, voice to transcript conversion has been a challenging task to accomplish. With one of the iOs updates last year, we saw the Transcription Beta on iPhones, where it was using Voicemails as validation data to probably train voice to transcript neural networks.

The task to get annotated training data set would have been herculean (well, company like Apple can just pay for it) but adding the “Was this transcription useful or not useful?” could have been an attempt to create a validation data set.

Probably same applies to Facebook’s facial recognition algorithm — they had to beat anyone else to it because, after all, they had the biggest data set of tagged/labeled pictures.

True Self-Driving Cars are yet to see the light of the day but I won’t be surprised to see Uber become the first one to win the race to market with millions of drivers already on the road all over the world, who could train the machines much faster, that too in diverse conditions.

Tesla started shipping out all cars with the hardware support for autonomous driving last year without turning the feature on. The earlier roll-out of the hardware on the road before actually enabling the feature is to collect miles of annotated training data for self-driving cars — to create training set with driver’s steering/braking/accelerating behavior in response to different input conditions gathered by sensors/cameras/RADAR/LIDAR on the cars.

While the established giants with access to the biggest data sets or those with the capability to acquire the best data sets would be the first to achieve Artificial Intelligence successfully, it is exciting to see the creation of newer challenges which startups could aim to solve. For instance, solutions to address the biggest problem — rendering labeled high quality Data in large quantity. We already have some candidates who have sprung up to grab the opportunity.

CrowdAi is enabling high quality Image annotation. Lattice Data (Apple just acquired Lattice for $200M) is working on processing unstructured data into more meaningful data which can be fed to machine learning algorithms. Amazon Mechanical Turk is an attempt to solve the problem of labeling data using brute human force and Nexla offering data operations as a service just announced $3.5M in funding.

The time is ripe for businesses which have been collecting and storing big data to finally monetize it because like Andrew Ng puts it, it is not those who have the best algorithm who will win, but those who have the best data.