Terms like “machine learning,” “deep learning,” “neural networks,” “artificial intelligence” or “A.I.,” “data science,” and more have been the buzzwords of the last few years in technology. Because of advances in computing power and an increase in the amount of data available, techniques that have been known about for decades can now be put into meaningful practice.

But what do they actually mean?

Mathematical equations give machines a basic understanding of something we have learned as humans.

Most of us are aware of the 10,000-foot explanation along the lines of “It’s all about teaching* *computers to solve problems for us,” but many people probably aren’t aware of what is actually going on under the hood. The basics of machine learning are simple enough, intuitive enough, and, more importantly, interesting enough to be picked up by anyone in a relatively short amount of time.

This simple explanation of how machine learning is used to teach a computer to solve* *a problem is targeted toward those with no knowledge of machine learning or those who want to start from the ground up.

### The Line of Best Fit

Many of us might remember something from school called the “line of best fit” in reference to data points plotted on a graph. The line of best fit is a line drawn through points in such a way that it represents what the data is showing. It might look like this:

This concept is actually machine learning at its most basic. Instead of plotting these points ourselves and trying to draw our own line of best fit, we can give the data to a computer.

For example, we can imagine that the data shown in the graph above is the shoe size and height for a number of people. The point in the bottom left represents a person shorter than the others who has smaller feet, and the point in the top right represents a person who is taller and has larger feet. Because shoe size and height are not completely correlated, not all of the points fit the statement that “taller people have larger feet,” but the line of best fit suggests it is true in general.

With the line of best fit, we can make educated guesses about new data. Suppose you find a shoe. You can determine what size it is and then refer to the graph to make an educated guess about the height of the shoe’s owner:

Simple, right? In machine learning, this is known as “linear regression.” Don’t let the name scare you. If you understand everything above, then you understand linear regression. It’s a simple machine-learning technique used to help make predictions about data sets that have a linear shape.

The process of linear regression for machine learning goes like this:

- Collect data points.
- Give the data points to a program that can apply linear regression to them to give a line of best fit.
- Use the line of best fit to make predictions about new data.

It’s called “machine learning” because the computer (or machine) has learned (or created a mathematical equation: in this case, one for a line) about how shoe size and height are related. The mathematical equation gives the machine a basic understanding of something we have learned as humans: In general, taller people have larger feet.

Other scenarios where you could use linear regression are guessing the cost of a house based on how many rooms it has or guessing how many aunts and uncles a child has based on how many presents they have under their Christmas tree.

### The Problem With Linear Regression

Linear regression is great when the data is shaped a bit like a line as in the example above. But how well does it learn about the shapes of data sets that don’t look like lines? Maybe the data looks something like this:

Adding a line of best fit to this data might look like this:

The line of best fit does an okay job of matching the data, but it seems like it could do a lot better. Since the shape of the data is not quite a straight line, the line of best fit drawn doesn’t properly fit the data. This is a problem in machine learning known as “underfitting”: The line of best fit doesn’t really fit the data well. But if we change the line of best fit to be curved, it may do a better job.

We can more easily imagine using this curve to make accurate, educated guesses in the same way we did with a straight line of best fit. This is a simple extension of linear regression known as “polynomial regression.” Again, don’t let the name scare you. If you understand why curved lines can be more useful than straight lines in working out the shape of a data set, then you understand how polynomial regression is useful.

Now that we can create a model—that is, find a best-fit line or curve for a set of data points—for data that has either a straight or curved shape, we’re done, right? Not even close. But we can build on these concepts.

### Choosing the Accuracy of a Model

The problem with polynomial regression is that we have to decide how much flexibility to give it before we use it, which can be tricky. Let’s return to our first set of data:

We previously used linear regression to put a straight line through these data points. But instead, we could have used polynomial regression to put a curve of best fit through the data that would work better than a straight line. It might look something like this:

One thing about polynomial regression is that we can tell it exactly how curvy we want the best-fit curve to be. The curvier it is, the more flexibility it has in describing the data set. The curve of best fit above is fairly simple, but we could have gone further and found a best-fit curve like this one:

Or we could have made the best-fit line even curvier for something like this:

Each of the best-fit curves above seems to do a better and better job of describing the data set, but something feels a bit wrong, especially in the last example. By giving the polynomial regression lots of flexibility in deciding the curviness of the best-fit curve, it has tried too hard to go directly through many of the data points. The result is a curve that seems less useful for predicting than a straight line of best fit.

For example, if we apply the shoe size and height example again, we can see by adding some guesses to the graph that the very curvy best-fit curve gives the same height for two different shoe sizes:

This problem in machine learning is called “overfitting” and is the opposite of underfitting. It means the best-fit curve we’ve created doesn’t generalize very well. It does a great job of matching the data we have, but it doesn’t help make sensible guesses for any new data. One of the main concerns in machine learning is finding a best-fit line or curve that is just curvy enough to mimic the general shape of a data set but isn’t so curvy that it can’t be generalized to allow for good guesses about new data points.

This is where polynomial regression falls over. We have to explicitly tell polynomial regression how curvy we want the best-fit curve to be before we use it, and that isn’t an easy thing to decide, especially when the data is more complicated.

In the examples so far, our data points have been in only two dimensions—such as a value for shoe size and another for height—which means we have been able to plot them on two-dimensional graphs. In doing that, it is fairly easy to see the general shape of the data. But this is not often the case in machine-learning problems that have more than two dimensions. And if we don’t know what shape the data is, we can’t really tell polynomial regression how curvy to make a best-fit line.

With a little extra effort, neural networks can also be used to answer yes/no questions about the data instead of returning numbers.

One option is to try polynomial regression many times with different levels of flexibility and see which one works best. But what we really need is a machine-learning technique that has the flexibility to be as curvy as it needs to be but also limits its curviness to be able to do well in generalizing new data.

This flexibility issue is when data scientists generally move on from linear and polynomial regression to use a neural network** **instead. On its own, a neural network is very much like polynomial regression in that it is able to learn data sets that have very curvy shapes. They don’t solve the problem of overfitting on their own, but when combined with a technique called regularization, everything tends to work out.

The implementation details of how neural networks and regularization work aren’t really important to understanding the basics of machine learning. The key things to remember are that neural networks are very good at learning the shapes of complicated data sets—more so than linear or polynomial regression—and that regularization helps prevent the neural network from overfitting the data.

### Getting Computers to Answer Questions

For the techniques covered so far—linear regression, polynomial regression, and neural networks—we’ve only looked at how we can train computers to give us a number depending on the data we give it. The shoe size and height model gives us a height number when we give it a shoe size number; similarly, the model for house cost according to the number of rooms gives us a cost number when we give it a room number.

But a number output isn’t always what we want. Sometimes we want a machine-learning model to answer a question instead. For example, if you are selling your house, you may not care only about a machine working out how much your house is worth, but you might want to also find out whether the house will sell within six weeks.

The good news is that there are machine-learning techniques available—similar to the ones we’ve already seen—that answer a specific question instead of giving a number. A machine-learning model can be set up to give a yes/no answer to the selling-within-six-weeks question when we supply some basic input data, like the number of rooms, the cost, and the square footage. Obviously, it could never be a perfect model because the housing market doesn’t follow exact rules, but machine-learning models are used to give answers to these types of questions with high levels of accuracy (depending on the quality of the data).

For linear regression, the analog involved would be a linear form of logistic regression. (Again, don’t let the terminology put you off. The underlying methods are actually very intuitive.) It could answer questions like “Is this email spam?” or “Will it rain tomorrow?” Both methods—linear and logistic regression—would calculate a line of best fit, but they differ in how they use that line. As a reminder, here’s the linear regression model we used for the data set to predict another number for us:

Logistic regression works similarly but finds a best-fit line that separates the data into two groups. This line can then be used to predict whether a new data point lies in one group or the other depending on what side of the line it is on.

Just like with linear regression, logistic regression can be extended to use a curvy-lined polynomial model that has more flexibility in fitting the shape of the data. With a little extra effort, neural networks can also be used to answer yes/no questions about the data instead of returning numbers.

If we want to answer questions more complicated than those looking for yes/no responses, we could either use a technique known as multinomial logistic regression, or we could adapt neural networks to be able to handle these cases as well. Models created in this way would be able to answer a question like “Will tomorrow be rainy, sunny, or snowy?” The “multinomial” part just means that the answer can be one of many options. In the example, the three possible answers would be rainy, sunny, or snowy.