From Linear Regression to Neural Networks: Why and How

Part 4 of the “Getting Started in Deep Learning” Series

Inês Pedro

Published in

Deep Learning Sessions Portugal

9 min readApr 28, 2021

Motivation

In the first blog post of this series, we motivated the use of Machine Learning models to address complex problems, such as identifying animals in images or transcribing a speech to text, that are infeasible to solve with typical rule-based computer programs (how do you recognize a dog from a set of pixels?). Such tasks are also easy for humans — we learn to recognize objects, and to speak and write at a very young age—, but very hard to formalize. Similarly to children, machine learning models learn by example and by trial and error.

In the previous blog post, we implemented a particular machine learning model, called linear regression. This model assumes that

the output is linear on the inputs
the input variables are independent

which makes it inappropriate for a lot of real-world problems. Thus, we need a more powerful model that does not impose these kinds of assumptions and is able to solve more complex problems.

From linear to nonlinear models

Our goal is to find a function

that models the relationship between the input and the output. In our previous example, f would map an image (set of pixels) to the animal it contains.

The only machine learning model we’ve seen so far assumes that the output is a linear combination of the inputs, but recognizing objects from a set of pixels seems too much of a complicated task to be solved with a simple linear model. To overcome this, we can compute a linear combination not of the input, but of nonlinear transformations of the input. We can obtain such a transformation of the input by linearly combining its values and then applying a nonlinear function f, as illustrated below.

*a is a nonlinear transformation of x, where* θ are the linear weights, b is the bias and f is a nonlinear function.

We can compute different representations of x, say k, in a similar way, as depicted below.

k nonlinear transformations of the input.

These new representations take into account nonlinear relations between the input features and are thus inherently more complex. Maybe now we can model the output as a simple linear combination of these representations.

Moving from a linear to a nonlinear model.

Vectorization

To make notation simpler (and the code more efficient when you implement this), let us vectorize our computations. Let us define

the input vector x of dimension n as

the linear weights as the W matrix of dimension k x n (we replaced θ with W, as it is more common in the literature):

the bias as the vector b of dimension n x 1:

the nonlinear functions that map the input into a k-dimensional vector as

In our vectorized version, we first compute the k linear combinations of the input as

Then we apply nonlinearities as

We can now compute our prediction by applying a linear combination to the resulting vector a.

Going from a linear to a nonlinear model: vectorization of the computations at the top of the image.

What if our model is still not complex enough?

Bearing in mind our object recognition example, if we still can’t identify the animal from these nonlinear transformations of the image’s pixels, we can continue to add nonlinear transformations in a similar fashion until we build a function that has enough capacity:

Modeling the output as a linear combination of highly nonlinear features of the input.

We introduced subscript indexes in vectors to denote the i-th nonlinear vector transformation.

By iteratively composing simple nonlinear functions, we end up with a highly complex model that can approximate about any function.

Feedforward neural networks are based on this idea of composing several nonlinear functions, creating at each time step richer representations of the input, until it is relatively easy to compute the output (with a linear combination, for example).

Why Feedforward Neural Networks?

These models are called feedforward because they start from the input, then compute some features, from which they compute some other more complex features, and so on until they get to the prediction. So the information flows from the input to the output, without any feedback loops.

Illustration of how information flows from input to output in feedforward neural networks.

The name network comes from the fact that these models are a composition of multiple functions. The result of each function (that maps a vector to another vector) is a layer. Using our previous example, we start with the input layer, then compute a transformation resulting in the first hidden layer,

from which we compute a second transformation, obtaining the second hidden layer

and so on until we reach the output layer that has our prediction. All layers between the input and output are called hidden because we don’t provide them to the model. Instead, the model learns them as intermediate representations of the input, so that it can compute the predicted output. The beauty of neural networks is that we don’t need to engineer these nonlinear transformations of the inputs ourselves (which can be very time-consuming) — instead, they learn which features are more suitable to predict the output.

The number of functions stacked together, or the number of hidden layers gives us the depth of the model. The more layers we have, the more complex is our model, which is related to the term “deep learning”.

Neural networks are composed of input and output layers, with hidden layers in between.

You’ve probably heard about how neural networks resemble the brain. We showed how you can go from a linear to a nonlinear model by computing nonlinear transformations of the input. Each of these transformations is called a unit and is computed based on all the units from the last layer, which mimics the behavior of a neuron.

How a unit in neural networks resembles the behavior of a neuron.

All of this and much more is explained in detail in Chapter 6 of the Deep Learning¹ book.

How Neural Networks Work

At this point, we understand the motivation behind neural networks but we still need a more rigorous definition of how they make predictions and how they learn from their mistakes.

Forward Pass: Prediction

The neural network makes its prediction in what is called the forward pass, where information flows from the input to the output. At each layer we perform 2 steps:

1 — Compute multiple linear transformations of the units in the previous layer.
Assuming we are in the i-th layer, we compute linear combinations of the previous layer’s output as

2 — Apply a nonlinear function to each of the obtained linear transformations.
Typically, at each layer we apply a single nonlinear function to every value of the z vector:

Below we show the most popular nonlinear functions in neural networks, although they are out of the scope of this blog post.

Most frequent nonlinear functions used in neural networks.

With these two steps, we can create multiple nonlinear transformations of the input layer. Now we repeat for the next layers!

Backward Pass: Learning

When we are training a machine learning model, the first step is to make a prediction, which translates to perform the forward pass in the case of a neural network. Then, the model adjusts its parameters in order to decrease its prediction error, with a gradient-based algorithm such as Gradient Descent. Let ℒ be loss function that computes the model’s error based on its prediction and the true output. The gradient descent update rule tells us how to adjust the weights of each layer

and similarly, the bias as

As the loss depends on the output of the neural network, that is a composition of multiple functions, the derivative of the loss w.r.t. the parameters can be very hard to compute. Let us look at the computation graph of a neural network with 2 hidden layers.

Computational graph of a neural network with 2 hidden layers.

As the prediction of the model is a composition of several functions, we can take advantage of the chain rule for computing the partial derivatives of the loss w.r.t. each parameter in an efficient way.

Chain rule: suppose that a variable z depends on y, which in turn depends on x. Then

Starting at the end of the neural network, the last parameters used were the weights and biases of the output layer. Using the chain rule, we get

and, similarly,

Illustration of the gradient being propagated backward.

The red arrows represent the gradient multiplying from the end of the network to the start. As we can see, gradient computations made to update the parameters in deeper layers of the network are useful, using the chain rule, to compute other gradients of parameters of shallower layers. This efficient way of computing gradients is called backpropagation, and can be used in gradient descent.

Putting it All Together

To wrap it up, we designed a very high-level pseudo-code with the basic building blocks to train a neural network. First, we randomly initialize our parameters. Then, for a given number of steps, we iteratively:

Take a forward step to compute the model’s prediction, followed by the loss;
Take a backward step to compute the gradients.
Finally, we update the parameters using gradient descent.

And we keep doing this until some stopping criterion is met.

Very high-levek pseudo-code on how to train a neural network.

To test your understanding, we dare you to implement a simple (vectorized!) neural network without fancy libraries!

References

[1] Deep Learning book (2016), by Ian Goodfellow, Yoshua Bengio and Aaron Courville.

Acknowledgments

This publication was written with the help of André Pinto, also an organizer of the Deep Learning Sessions Lisboa.

Up next

Why Deep Learning? Some cool applications, Part 5 of the “Getting Started in Deep Learning” Series