Understanding Neural Networks — Part 1/3: Intuition of Forward Propagation

Bennett Cohen
9 min readJul 28, 2022

--

Neural networks (NNs) are incredibly powerful and complex ML algorithms, but they are also much more intuitive than people think.

This series of blog posts is meant to be less of a gentle introduction to NNs, and more of a “lightbulb-over-the-head-AHA! moment for you. We will move from the high-level intuition of NNs, down through the mathematical intricacies, before finally building one from scratch in python!

If you’re a data science student, professional, or just trying to understand more about how our robotic overlords are going to be operating in the future, keep reading.

This part will focus on Forward Propagation.

Btw, I’m going to assume you have some familiarity with basic data science and machine learning concepts such as linear regression, train/test, vectors, etc, but I won’t assume you understand anything about neural networks. Let’s get into it.

What the hell is a neural network?

Basically, it’s just a type of ML algorithm that was built to emulate connections in a brain. It can be used for classification and regression tasks. Today, we’re going to go over a classification task. The big thing about NNs is that they are “universal function approximators,” meaning they can approximate any function (duh). Compare this with linear regression which ONLY can approximate linear functions. All you need to know is this:

  • NNs are made up of neurons (the circles in the picture below). Sometimes we call them units/nodes. It’s all the same thing.
  • Neurons are just linear functions! Do you y = mx + b from middle school? That’s it.
  • Neurons are grouped into vertical layers (explanations below)
The architecture of a simple neural network with 4 input neurons, 1 hidden layer w/ 2 neurons, and 1 output layer with 2 neurons.

The first layer is called the input layer and has as many neurons as we have features in our data. We don’t actually consider it the first layer because it’s just a starting point for our data. We can label each node here with a lowercase x1, x2, x3, and x4 for our four features. These lowercase x values are just scalars.

The last layer is called the output layer. In classification tasks, it will have as many neurons as we have possible outcomes (classes). There is a special case in binary classification where we can use one output node, but it isn’t necessary so we are going to be using two neurons for better intuition.

All layers in between the input and output are called hidden layers. You can have as many hidden layers with as many nodes as you’d like. Here, we have one hidden layer with two nodes. In the picture above, we have these values labeled as y1 and y2 (scalars) to represent the output of this layer.

The red lines represent connections between neurons, which have some weight (or coefficient) associated with them. Like all coefficients, we multiply our x values by these values as we move through the network. We’ll get deeper into it in a minute. The notation we’ll use for weights is that wij refers to the weight from neuron i in one layer to neuron j in the next layer.

In most neural networks, we call layers Dense or Fully-Connected if every neuron in one layer connects to every neuron in the next layer. All of our layers here are dense. Can you see why?

Remember, neurons are just linear functions, and layers are just groupings of neurons.

Let’s zoom in on the hidden layer. In the first node, we know the output is going to be some scalar value y1. Neurons are linear functions, so we need to create a linear equation, but how do we do that? From the larger diagram, we saw that all the values x1, x2, x3, and x4 had connections going towards this first neuron within the hidden layer (shown below).

Zooming in on the first node of the hidden layer. We see how the weights and feature data multiply together to create a scalar y.

Our knowledge from earlier also tells us that there are weights on the coefficients that we multiply each of these values by. We can then just add them up. Just like in linear regression, we don’t always want to go through origin so we’ll add a bias (intercept) term called b. NOTE: We are just talking about a single node still. Everything is a scalar value. The weight of the bias term can just be thought of as 1.

The inside of the neuron in the diagram above shows this operation. Don’t be scared…this is just an equation of a line!

Okay, but now what is the sum of the product of all these x and w values? Well, let’s just collect all the values for x and w and put them into vectors, and then multiply the matrices together. REMEMBER: we are just getting a scalar out of this so let’s be careful with our dimensions.

Our w vector is just all the weights that lead into neuron 1 and has dimensions 1x4. We then can write our X vector as a column vector of 4x1. Multiplying these two matrices is just the dot product. We then add our b, which is still a scalar. Let’s add some subscripts to this to represent we are talking about the first neuron.

The same thing applies to the second neuron. Just swap the subscripts on y, w, and b to be 2. We don’t need to change the X subscript because the X is the same for neurons 1 and 2 so its subscript actually refers to the layer (not the node). It will become clearer in a minute.

Everything in neural networks is done with matrices.

In the end, we want the output of this little linear function operation to be a column vector with [y1, y2]. To do this, let’s just stack our equations above on top of each other.

First, let’s combine the w1 and w2 vectors into a matrix W of size 2x4. Remember X1 stays the same. We then just stack b1 and b2 into vector B. That’s it! The operation to go from the input to the hidden layer is just represented by this simple equation. Then, we just slap some subscripts on there.

Again, W is now a matrix, X is the same vector of inputs, and B is a vector too. The subscript of 1 on capital letters refers to a LAYER not a node anymore.

Intuition Break: Read that math again ^

The key points from that math section are that neurons are just linear functions in which we multiply each scalar in our X vector by some weight that connects it to the neuron, and then we add a bias term. Then we just combine those scalar operations into a simple matrix equation.

Okay, so we have the values for the hidden layer. What’s next?

Now, we repeat this for the output layer. The progression from single neuron scalars, to combining into vectors, finally out to the matrix form is the same as with the first layer.

That’s it. That’s forward propagation…except not really.

What we’ve just done is show how to take an observation/row of data, and then multiply it through the NN to make a prediction, but two key issues arise:

Problem #1: What the hell is X2?

Recall that X2 is the input into the second layer of our NN.

The input into the second layer.

Wait a minute 🤔…that’s just the output of the layer before!!

In other words X2 = Y1. Look at the first, zoomed-out diagram to see how the input into one layer is just the output of the previous layer. Let’s rewrite Y2 with this knowledge.

That’s great, right? Well, it creates another issue.

Problem #2: Y2 is a linear transformation of Y1.

Linear transformations formally mean any operation that “preserves scalar multiplication and vector addition,” but basically means all we do is multiply by something and then add some other something. No exponents, no logs, nada.

Clearly, we see that once we plug in Y1 for X2, all we do is multiply Y1 by some matrix and add another vector. Why is this even an issue? Let’s break it down by replacing Y1 with the equation for it. I’m not gonna do more vector stuff than what we’ve already done so don’t worry.

Let’s disregard matrix notations and just think intuitively for a moment. If we multiply two matrices filled with coefficients by each other, we are just creating another weight matrix, right? Similarly, if we multiply some weight matrix by a bias vector, aren’t we just creating some other bias vector? If we represent these products generally as W and B, we have the following:

Y2 = (W)(X1) + B + B2

Y2 = WX1 + B

B and B2 combine to just make some other bias vector, so all we’ve done is create linear regression! You just wasted all your time reading this just to do basic linear regression. You must be so embarrassed.

Also, to my linear algebra professor, I’m sorry you had to watch that math. I told you I wanted to build intuition.

Linear terms can only combine to make other linear terms.

Fortunately, the solution is super simple. If we can make Y1 non-linear, then we can’t distribute out and combine those terms shown above so we will have a final output that isn’t linear. Then, we can make super complex functions to approximate anything. I hope that intuition sticks.

This is where we introduce the Activation Function. All we do is pass the output of a layer to some function that makes it non-linear.

Literally, just create some Z = F(Y)!…where F is some non-linear function.

The most widely used activation function for hidden layers is called the Rectified Linear Unit (ReLU), which is just R(x) = max(0, x).

I am not going to get into why it’s so widely used now, but I think it’s very cool that applying such a simple function works. ReLU is clearly not a line so we can’t just combine terms as we did earlier. This is a good thing!

A plot of the ReLU = max(0,x) function.

Wait, wait, wait, what about the output layer? I thought we were doing classification.

In regression tasks, we don’t need to pass the output values (Y2) here through any activation function. But, for classification, we do. We need to pass the outputs through something that will create a probability distribution…basically, we need to squash the number into something that adds up to 1 and represent the probability of being in each class.

Note: If you’ve ever done logistic regression, this is the same exact idea as the sigmoid function. Remember, we pass the linear regression equations to the sigmoid function to squash the values? If you haven’t done logistic regression, carry on.

To do this, we use the softmax function, which will take an array of values and then normalize them so that higher values reflect higher probabilities, and all the probabilities add up to one. Take a look at the picture below and see if you can understand how it works.

Finally, we can rewrite our functions from Y1 and Y2 into Z1 and Z2. Remember, just pass our Y1 to the ReLU, and then pass that output as the input into the next layer.

Okay, that’s it. That’s forward propagation…for real this time.

Brief recap: We learned how to take a row of data and pass it through a neural network by creating these linear functions using weight matrices (which are just a bunch of coefficients) and applying activation functions to make them non-linear so that we can do more complex things than just linear regression.

Here’s a quick peak at how the NN algorithm actually works:

  • Initialize the W matrices and Bvectors with random values
  • Perform forward propagation
  • Calculate the error using any error metric
  • Use backpropagation to adjust the W matrices and B vectors
  • Repeat until some stopping condition is met

In the next part, we are going to discuss how NNs actually learn and improve through backpropagation. It is a little bit more mathematically intensive, but we’ll work through the intuition and math tandem.

Cheers,

BC

--

--