The Fuel for Artificial Intelligence: Data

NISHESH AWALE
3 min readDec 9, 2018

--

“It is quite not important which machine learning algorithm you use; what’s important is how much data you have.”

A lot has been written in the advancement of Artificial Intelligence (AI) and its applications in the past few years but one element that is often not emphasized is the importance of data in allowing AI to function. The most notable application of AI is the self-driving cars. Building a self-driving car requires a huge amount of data ranging from signals from infrared sensors, digital images from cameras and high resolution maps. NVIDIA estimates that one self-driving car generates 1 TB per hour of raw data. All that data is used in the development of AI models that actually drive the car.

Machine learning (ML) is a sub-field of AI. In machine learning, the models are the engine and data is the oil. An average model trained on a huge amount of data will vastly outperform a great model trained on a small amount of data. Similar to humans, AI also improves with experience. More training examples from the real world will help the AI system to make correct predictions. For example, the availability of data through ImageNet transformed the computer ability to image understanding and reach human-level performance.

We can think of an AI application as a three-legged stool. The first leg of the stool is the algorithm itself. The second leg of the stool is the computing power, both in the form of raw CPU power and large scale data storage solutions. The last leg of the stool is data. So, for any company building AI products, it is worth considering how much data they have. Google is one of the best AI and ML companies in the world. Why? Peter Norvig, Research Director at Google, famously stated that “We don’t have better algorithms. We just have more data.”

Now consider an analogy, for creating the perfect meal, it’s great to know the taste of your diners. Similarly, data is essential in tailoring an AI model to the needs of specific users. For instance, by knowing what type of products people buy and how expensive a product is, Amazon recommends similar products to its users. Furthermore, techniques such as collaborative filtering, which makes recommendations based on the similarity between users improve with access to more data; the more user data it has, the more likely that the algorithm finds a similar user.

Having more data certainly do not hurt but unfortunately it does not always help as much as you might hope. Let’s dive more into some technical details to understand it. There are two possible cases in which the machine learning model might not perform well. In the first case, you might have a complicated model for your training set such that the training error is much less than the test error. Such situation is called high variance which leads to overfitting. High variance problems can be addressed by increasing data in the training set. In the second case, you might have a model that is too simple to explain the data we have. Such case is called high bias which leads to underfitting. High bias problem cannot be fixed by increasing the number of training examples. This problem is solved by increasing the number of relevant features in the model.

In summary, both quantity and quality of data matters. Think of machine learning data like survey data. If the data sample is not big enough, then it will not capture all the discrepancies or take them into account, and your machine may reach inaccurate conclusions, learn patterns that do not actually exist, or not recognize patterns that do. Also, an AI system can only perform correctly based on what it has learned from good quality data. But overall, what we need is good approaches that help us understand how to interpret data, models and the limitations of both in order to produce the best possible output.

--

--