Artificial Intelligence — How Computers Really Learn

Published in

The Startup

7 min readMay 9, 2020

When research on artificial intelligence began sixty years ago, computers were millions of times too weak to do anything useful. However, due to Moore’s Law, which states the computational power of computers increases exponentially with respect to time, modern computers are more than capable of handling tasks that artificial intelligence experts had hoped. Applications of artificial intelligence are scattered through every industry, including self-driving cars, YouTube recommendations, and automatic manufacturing. Artificial intelligence does not only control the present; artificial intelligence is the future of this world.

How Artificial Intelligence Becomes Intelligent

Artificial intelligence is a large category that encompasses many different algorithms to solve different problems. However, none of these algorithms can do more than randomly generate guesses in the beginning. The first step towards a model becoming intelligent is a process called machine learning.

The purpose of machine learning is to use existing data to form new hypotheses on unseen data. Machines cannot think on their own, but they can do numerical computation and data processing extremely efficiently.

Linear regression is one of the most common algorithms in machine learning. Linear regression attempts to fit a continuous line through the existing data set to minimize the distance from points in the data to points on the line, or the error. This line is also called a line of best fit. The line of best fit does not have to be a straight line; it could be a polynomial expression, such as x²-2x+1. However, for the sake of this example, it is a straight line.

Using the line of best fit, the model can generalize new y-values based on a given x-value. Source: Wikipedia

Another way that machines can learn to map inputs to outputs is through a neural network. Inspired by the human brain, a neural network consists of several unidirectional layers. Each layer passes information to the next, and generates its own higher-level thoughts based on the input it receives. By the end, the model is able to imitate the complexity of human thought.

For example, consider a neural network designed to classify an image as a cat, dog, or human. The first few layers will only be able to make out simple shapes and patterns, like circles or textures. As more layers are involved, the model is able to work with more complex information, such as identifying teeth, fur, or legs. By the end, the model predicts one of the animals.

Simple information from the “input” layer is passed to the “hidden”. The “hidden” layer makes a more complex hypothesis and passes it to the “output” layer to make its final decision. Source: Wikipedia, CC-BY-SA-3.0 Glosser.ca

Although all of the principles discussed apply to both linear regression and neural networks, for the sake of simplicity, I will use a linear regression model to demonstrate examples.

Bias and Variance

As its name suggests, a computer is supposed to “learn” while performing machine learning. But machines cannot learn; they can only do math. The way a model finds the best linear regression line is by using calculus to find the minima, or the lowest point, of a function, in this case, the error. The linear regression model wants to minimize the error between the line and the data set, so it chooses to fit a line that passes very close to each of the data points, thereby making the error very small.

However, doing calculus does not mean the machine is really thinking. In fact, there can be many problems for a machine, that a human with common sense could avoid. One of these problems is called overfitting, or variance. This occurs when there is not enough data, or the model uses high-degree polynomial terms, so the model can fit a perfect line through the data, but does not generalize real data well.

An nth degree polynomial can be manipulated to pass through any n+1 points. If we have five data points, and we want to fit a fourth degree polynomial, the model will fit a line that perfectly passes through all five points perfectly, but is not be able to generalize well.

A fourth degree polynomial passing through all five points. Source: Desmos

While this polynomial expression does pass through all five points, we instinctively say that it is a rather bad line, since it shoots upwards at x=1 and x=10, and changes direction between the data points.

Another common problem that machine learning runs into is underfitting, or bias. This occurs when we have a complicated answer, but not enough information to find the answer.

Consider the case where you are trying to use linear regression to map the previous day’s temperature to the current day’s temperature. The data might look something like this:

x-axis: previous day’s temperature; y-axis: current day’s temperature. Source: Desmos

Clearly, it seems that there is no correlation, and it is impossible to predict the current day’s temperature given the previous day’s temperature. There is no way to fit a line that will generalize well on unseen data. This is because we do not have enough information to predict the current day’s temperature. Temperature depends on a variety of meteorological conditions, and to predict it, we need to gather more information on weather conditions. This illustrates the problem of underfitting, where we do not have enough data to make an accurate guess.

Learning Curves

These two issues are very common reasons why artificial intelligence models do not perform as well as we would expect them to. To debug these issues, we use a strategy called learning curves. Learning curves help us visualize how well our model is doing compared to the data we give it.

To implement this, we split our data set into two groups: a training set, and a test set. We then create a graph, with the x-axis being the size of the training set we are allowed to use to fit the line of best fit, and the y-axis measuring how poorly the line of best fit is doing, or the error. Specifically, for every x-value, we take x random data points from our training set, and fit a line of best fit through it. Then, we test how well that line of best fit does for the x training data points we chose, and the whole test set. We make training curves and testing curves.

If we are only allowed to use two examples to fit our line, our training error is zero, since a straight line can fit those two examples perfectly. However, since we did not have enough data to make a line that generalizes, and fits other unseen data well, the testing error is very high. Therefore, when we are only allowed to use a low number of data points, the training error is very low, but testing error is very high. This is true for almost every data set.

The black points are the two random points we chose for our training data set, and the blue points are our test data points. The black line fits the training data perfectly, but does poorly on the test data. Source: Desmos

As we increase the number of test data points we are allowed to use, a good data set would bring the train and test error closer and closer, since we have more data to form a better approximation of the ideal line of best fit. In other words, as we increase the amount of data, overfitting decreases. If at the end of our graph, where we have used every training example to fit a line of best fit, and there is still a large gap between our training curve and our testing curve, this means that our model is overfitting the data set, and we need more data.

Blue line: training curve; Red line: testing curve; Black line: desired error. If there is a gap between the two lines, and the testing curve is not at our desired error yet, this means the problem is overfitting. Visually, it seems like if we have more data, the testing error will be below our desired error. Source: Desmos

If we have the problem of underfitting, we should not be able to fit a line through our test data, similar to how we could not predict the current day’s temperature based on only the previous day’s temperature. As we add more data points, since the model cannot find a correlation between them, its error will still be very high.

If there is very little gap between the training and testing curve, and both are above the desired error, meaning the model is suffering from underfitting. Visually, it seems like if we have more data, neither of the lines will decrease below our desired error value. Source: Desmos

Summary

Machine learning is the process of using existing data to find patterns and making hypotheses about new data.
Computers cannot learn; they can only do math and use calculus. Therefore, they run the risk of overfitting and underfitting.
Overfitting is doing very well on training data, but failing to generalize, resulting in poor performance on unseen data.
Underfitting is not having enough information to make the right decision.
To debug these issues, we plot learning curves, with training set size on the x-axis, and error on the y-axis.
Assuming the testing curve error is greater than the desired error, if the training curve is below the desired error, the model is overfitting. If the training curve is above the desired error, the model is underfitting.

Conclusion

This article gives intuition about machine learning algorithms, and common problems that machine learning users face. There are so many more applications of artificial intelligence not covered in this article, such as content creation with generative adversarial networks, and natural language processing with recurrent neural networks. Artificial intelligence is becoming less artificial, and more intelligent in our world every day. It is an extremely disruptive technology, with immense power. The human brain is only a physical object, and machines are definitely capable of surpassing us, and impacting the world in ways we would have never thought possible.