The Magic of Feed-Forward. Deep Learning with PyTorch Part #3

Conquering the Foundations of Deep Learning

Jinal Shah
Apr 25 · 6 min read
Photo by Ahmed Hasan on Unsplash

It is one thing to know what an artificial neural network (ANN) is and it is another thing to know how it actually works. If you ever want to utilize deep learning to build cool projects, it is essential that you understand ANNs. The concepts from ANNs can be applied to nearly every other type of neural network. When it comes to ANNs, there are 2 major concepts to understand: Feed-Forward and Back-propagation. Both of these concepts require a lot of in-depth understanding, so I will split them up into different articles. In this article, we will get our hands dirty with the feed-forward concept. Fair warning, in order to understand feed-forward, you will need to understand basic matrix/vector operations. In case you need a refresher, here is some reference material: Basic Matrix Operations Review.

Photo by Uriel SC on Unsplash


Feed-forward is a rather simple concept. Essentially, all you are doing is sending your data through your network. In other words, you are sending your data through the input layer to the hidden layers and, finally, to the output layer. Here is a great visualization that illustrates this concept:

A Feed-Forward Neural Network, Image by Stanford University, Public Domain

As the illustration above shows, feed-forward is essentially the process of taking your inputs (your features) and putting it through your ANN to get your output(s). However, there is still one question that remains unanswered in this discussion: what happens at each node? Let’s answer this question next.

What Happens at Each Node?

The calculations that occur at each node are very, very important. Each node takes a weighted summation of its inputs & then puts the sum through an activation function. Confused? Don’t worry let’s break it down.

Examples of Cat & Dog Images, Image by Adrian Rosebrock on PyImageSearch

Each input has a weight applied to it. This is because some inputs may be more/less important than other inputs. For example, if I was building a neural network to classify images of cats & dogs, I would be more concerned with the shape of the eyes, ears, etc. rather than whether the image is indoors or outdoors, etc. I hope you don’t determine something to be a cat or a dog soley based on location (indoors or outdoors). I imagine you determine something to be a cat or a dog based on features that define each animal such as eyes or ears. Clearly as you can see, it is not justifiable to give each input the same weight. Some inputs definitely deserve higher weights than others.

In order to apply the weights to a given node’s inputs, we perform a weighted summation. If you are familiar with machine learning, there might be a light-bulb going off in your head. This is because this process is exactly the same as linear regression!

Linear Regression Equation, Image Reference: Pearson Correlation and Linear Regression

The image to the left shows a basic 1-variable linear equation. b0 represents the bias term, b1 represents the weight applied to the input X, and Y is your output. This is what would occur at a node with only 1 input and no activation function. In a real world scenario, you would have many, many inputs to your node. Thus, we can expand this equation to becoming the following: Y = b + W1*X1 + W2*X2 + … +WnXn. Wn represents the nth weight for the nth input X. It is possible to condense this equation using linear algebra (Note that the weights & inputs are usually present within a matrix). When we look at the weights & inputs as a matrix, we get the following equation: W^T*X + b. Note that ^T stands for transpose (I am assuming that you know what transpose means, but in the event that you don’t please refer to the Resources below). Also, note that the “W^T * X” part may be rearranged at times depending on the shapes of the weight matrix & the input matrix. This calculation is performed at every node in the ANN.

Now that we have generated a concrete understanding about the weighted summation that occurs at each node, let’s develop a concrete understanding regarding activation functions.

Activation Functions

An activation function is very, very simple to understand. It is essentially a function that takes in some input and provides some output depending on the specialty of the function. For example, a common activation function is a sigmoid function. The sigmoid function prides itself on squeezing its values between 0 & 1. Likewise, there are many activation functions that perform some kind of transformation on your input. With this in mind, I would like to now introduce the formal mathematical way of looking at what happens at a single node:

Output of a Node = y(W^T * X + b)

  • y = activation function
  • W = weights matrix
  • X = inputs matrix
  • b = bias term

Essentially, you are taking your weighted summation & passing it through an activation function. I just have 2 final comments I want to make on this subject:

  • It isn’t necessary to have an activation function. Your ANN will work just fine without one. However, activation functions allow you to transform your nodes’ outputs. These transformations allow your ANN to have better performance.
  • The choice of activation function on each hidden layer is dependent on your problem.

Final Comments

Feed-forward is a critical concept to understand. With that being said, I have made a list here that summarizes some of the key points and concepts. Note, even though I have made a list of key concepts, I highly encourage you to actually go through the feed-forward section if you haven’t already.

Photo by Jaye Haych on Unsplash

Key Concepts:

  • Feed-forward refers to data being sent to through the ANN (data -> input -> hidden layers -> output layer)
  • The feed-forward process is usually used in making predictions and in back-propagation (we will talk about this concept in a future article).
  • At each node of the ANN, the node performs a weighted summation & puts that sum through an activation function. Mathematically, it looks like this: Output of Node = activation function(Weights^T * Inputs + bias).
  • Activation functions transform the weighted summation such that it may be appropriately transformed for the context of the problem.

If you made it this far through the article, I thank you. It really means a lot to me to see people reading my content and learning something new. Let me know what your thoughts regarding this article are in the comments below.

Photo by Mike Swigunski on Unsplash

About the Author

I am an undergraduate student @ Rutgers University-New Brunswick, who is pursuing the Computer Science & Cognitive Science majors. Furthermore, I am pursuing a minor in Business Administration and a certificate in Data Science. I have been applying machine learning for a little over a year, and recently I dove my feet into deep learning. I am very much intrigued by the power of artificial intelligence and can’t wait to share my learnings with the community! Feel Free to contact me via LinkedIn or email me at

Geek Culture

Proud to geek out. Follow to join our +500K monthly readers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store