Cracking the Neural Network (Part 1)

Start seeing artificial neural networks for something more than a web of circles and lines.

Chances are, if you are reading this article, you have heard something about the whole buzz about AI and/or Deep Learning, and want to know more about the inner workings of an artificial neural network. When I first started learning about neural nets while taking Andrew Ng’s Intro to Machine Learning course on Coursera (a fantastic introductory course btw) as a high-school student, I found them to be quite daunting. Maybe it was because of the many super/sub-scripts in the mathematical notation, or maybe because the whole backpropagation business just seemed like a meaningless blackbox, or maybe because of the fancy name.

After successfully completing the intro course, I was fascinated by the power of neural networks and other machine learning techniques. Although this was true, I did not fully grasped how a neural network fully functioned, as many of the computations were just presented as formulas and algorithms to take for granted. If you’re like me, this just does not do. I feel the need to understand every nuance that is in my reach of any new concept. And the math enthusiast in me really wanted to work everything out by hand and arrive at the results on my own. Through the contents of this article, my goal is to share my treatment of neural networks with you, so you can build a deep and intuitive understanding of them. Additionally, I’d love to get feedback from others in the field. In part one, we will introduce the notion of an artificial neural network, as well as something called the forward pass (a.k.a forward propagation). In the second part, we will scratch the surface on how the network learns when presented labeled data, and in the third part, we’ll really get our math and computer science hats on and dive deep into the something called backpropagation.

Before we get started, let me warn you: this is usually not something that can be understood and implemented in a breeze. In order to achieve full understanding, you will need to play around with the math and code yourself, and be okay with sitting down for some time and pondering certain aspects yourself.

A couple tidbits before we get started:

  1. Please review some differential calculus fundamentals, like basic derivative rules (huge emphasis on Chain Rule and how it works with partial derivatives, as this is the main backbone of backpropagation.)
  2. Many other implementations of neural networks involve using matrices, vectors, tensors, etc., but we will stick to our summations and for-loops for simplicity and better intuition, for our non-linear algebra folk like myself (at the time this article is being written).

Let me start off by saying that the whole “neural network” analogy is a very loose one. Our bodily functions are significantly more intricate than the mathematical algorithms we will come up with for our purposes of Machine Learning. While this is true, we must stick with the loose analogy for intuition.

Now the crux of ML is that rather than us having to tell the machine what logic to follow, it is the machine that develops its own logic to solve a problem. This problem can be anything from diagnosing disease, to translating text between two languages, to piloting an autonomous vehicle. It is often the case that when the machine learns on its own, through many iterations of trial and error, it is able to solve problems faster and better than other machines using traditional hard-coded logic — and in quite a few instances, even faster and better than humans!

The Neural Networks we’ll be dealing with in this article fall in the branch of supervised machine learning, meaning that we feed our system with several different training examples, from which it infers a pattern / logic. As an example, if we wanted to train a neural network to classify pictures of an animal as a “cat” vs. a “dog”, we would show it dozens of sample pictures of each animal and explicitly tell it which animal is illustrated in each picture. The network will then appropriately adjust its inner structure upon seeing each sample picture and its corresponding correct answer. After enough training (after a certain time its performance will start to cap off), we’ll be able to present it a totally new picture it has never seen before, and it will apply its newfound logic on the picture to wager a guess as to whether it is a “cat” or “dog”. This is an example of a classification problem. Neural networks are also known for solving a second class of problems known as regression problems. Examples of this would be analyzing housing data (# of floors, sq. ft area, etc.), to predict the cost of a house, or analyzing any such dataset and finding a method to the madness (stock-market prediction comes to mind). Neural networks for regression are easier to implement, so we will be focusing on one of these for now, but you can build off the ideas in this mini series to build your own classifier too.

So the question we want to answer by the end of this journey is: Can a neural network be created which can learn any function or correlation in a set of data? For example, if we present the neural network with what is known as labeled data (which could follow any rule we like: the sine of the the product of two numbers, the pythagorean theorem, the arithmetic mean, etc.), will the network be able to approximate the correlation between the inputs and the output, starting off with zero prior knowledge of such functions?

Here are some examples of possible labeled training data (this one just follows elementary addition):

  • 2, 2 → 4
  • 3, 4 → 7
  • -1, 0 → -1
  • … → … (the more we train it, the better its logic will get, the better its predictions will be)

Then we can ask it: “What output should I get if we have 7 and 8?” and hopefully, the neural network will produce an answer close to 15.

The way a neural network “thinks”, is through a series of layers, nodes, and weights connecting the nodes. In the neural network graphically depicted below, we can see that it contains an input layer, two hidden layers, and an output layer. Every neural net has an input and output layer, but the number of hidden layers may vary as much as you like. (In industrial scale applications, they use neural nets with significantly greater hidden layers). Each layer has a set of nodes, depicted as white circles. The web of arrows, each connecting two nodes in adjacent layers, are meant to represent weights.

The all famous diagram of an artificial NN. Courtesy of

The input layer is where we would insert your set of features (e.g. 2,2,3), one per node and, after performing its calculations, the network will assign the single node in the output layer a value, being its “guess” (e.g. 7) for what it “thinks” the features will yield based on the logic it has developed.

Each node in the input layer is multiplied by a weight and added together, to be sent to a node in the next layer. As an example, node 1 in the input layer will be multiplied by a weight, added to the product of node 2 and another weight, which will in turn be added to the product of node 3 and yet another weight. This weighted sum of all three input feature nodes gets passed on to a node in the next layer. Another such weighted sum, with different weights, gets passed to a different node in the next layer. Yet another such weighted sum, with different weights still, gets passed down to another yet another node in the next layer. Naturally, the total number of weights in between any two layers is equal to the product of the total number of nodes in each of the two layers. Finally, each weighted sum received by each of the nodes in the first hidden layer is plugged into an activation function g(x), whose output will be the value assigned to the node itself. The most common activation function and the one we will be using for our regression neural network is the logistic function:

Also known as the “sigmoid” function represented as sigmoid(x). It’s derivative happens to be just sigmoid(x) * (1-sigmoid(x)), which comes in quite handy. Courtesy of

Other examples of activation functions are tanh(x), reLU(x), softplus(x), etc, but the logistic function is best for our purposes. Very Important: if you are going to experiment with different activation functions, please hold off on the latter two, as they make your neural network collapse into one with no hidden layers (you can ponder this later yourself). Anyways, so this weighted sum + activation function business, is called forward propagation, and goes on throughout the network from layer to layer for every neuron (equivalent term for node) until a weighted sum from the last hidden layer gets sent to the output node in the last layer. For a regression neural network, there is NO activation function for the last node! The presence of one would restrict any guess to be within the bounds of zero and one, so the last node just takes on the value of the weighted sum itself.

Now let us take two steps back. What the heck does all of this mean, intuitively? To answer this question, let us consider a simpler version of the current neural network we have in mind: one with no hidden layers.

Extremely simple neural net with zero hidden layers. Courtesy of

Let us say that our three inputs are features of housing data (those could be sq. ft. area, the number of floors, and the number of bedrooms), and we are looking for the neural net to figure out some logic between the data and the resulting cost of the house. Essentially, what we are doing is trying to fit a multidimensional function to the data (think about this for a moment). We are trying to figure out some function for house price in the form

(x1 * weight1) + (x2 * weight2) + (x3 * weight3) = $ price .

So then, the only logical way to manipulate this function would be by changing its weights. Intuitively, this means giving more importance to certain features, (maybe the area the house takes up, or the floors it has), in arriving to the final answer of how much the house would cost.

In this sense, “teaching” the neural network to “learn” how to solve a problem ultimately boils down to just teaching it to give varying amounts of importance to the input features of the problem! Now, taking this one step forward, what would the hidden layers mean intuitively?

Assuming you’ve had a go at pondering these questions for a moment, maybe it struck you that the first hidden layer is just a different way of looking at the input features, but with some features given either more, or less importance in the decision making process. In this way, the first hidden layer comprises of features of its own, each of which arise from a unique blend of features from the input layer. Applying this same thinking, the second hidden layer is a different way of looking at the first; the third a different way of looking at the second, and so on and so fourth! (lol get it?)

In other words, this means that instead of the final decision of the price of the house being based on just the hardcoded set of features we gave it, the final decision would be based on some complicated variety of the features we started off with, like: the ratio of the number of floors and total area, coupled with the number of repairs the house has had in total, where the number of repairs is given slightly less importance than the ratio — blah blah blah you get the idea. Consider how as we add more and more layers between the input layer and output, the harder it becomes to explain in English what new (and considerably more specific) features are actually being considered when making the final decision. This actually presents a criticism for neural nets when it comes to things like self-driving cars, which operate on deep neural networks with many hidden layers, because it is difficult to put into words why a certain decision was made. (This does not say anything of their ability to perform, however — and as a quick aside, self-driving vehicles actually use other techniques of ML in addition to neural nets, like Bayesian learning.)

Finally, the need for an activation function arises for a few reasons:

  1. To keep the value of any one node from blowing up, so that no one node overpowers the whole system.
  2. To introduce a non-linearity in the mix. Without activation functions, the neural network’s guesses would only be limited to linear functions. A network with thousands of hidden layers would be the same as one with zero hidden layers! (I’ll let you write out the weighted sums for a neural net in the case of there being no activation function to let you figure out why this might be, on your own.) This is also the reason why we avoid activation functions like reLU and softplus for regression problems.

So now I hope you have a reasonable understanding for how a regression neural network goes about determining what its prediction should be, given preset values for weights, and input features. As mentioned earlier, this is known as forward propagation. The actual “learning” bit comes from adjusting the values of the weights so that the features start pointing to correct outputs. This process in known as backpropagation. It deals with stuff like: How do you know what to weight(s) to change and by how much? Buckle up, because this is about to get very involved and mathematical, but also deeply interesting! Really do make sure you have a solid understanding of what was covered up till now, before you move on to part 2 of this mini series.

Until next time.