Neural Networks Part 1: Terminology, Motivation, and Intuition

Published in

Geek Culture

6 min readMay 25, 2021

This is part four of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read these in sequence. Simply because I introduce some concepts there that are key to understanding neural networks, and I’ll be referring back to them on numerous occasions.

In this article, we go through the theory behind neural networks, as well as introduce the motivation behind such a model and a few definitions. In the next part, we’ll look more at gradient descent and the backpropagation algorithm.

Let’s get right into it.

Neural Networks

Have you ever wondered how your brain functions so seamlessly and efficiently? How all the physics behind your body movements is instantaneously calculated, without even breaking a sweat? No? Me neither. The point is, it’s impressive. Very impressive. So impressive, that humans have spent countless hours trying to understand and mimic it. And that’s how neural networks came about.

Your brain is filled with a bunch of neurons. Every neuron receives an electric pulse as input and outputs a response that can be received by another neuron for further processing. Neural networks aim to copy this behavior. Inputs to the network will go through numerous layers, each outputting new information to be received by the next layer until we reach a conclusion (prediction).

Motivation

We aim here to answer the following question: Why do we need neural networks?

We’ve seen the following equation on numerous occasions:

Equation 1: Multivariate Linear Regression

Where {x_0,x_1,...,x_j} represent our features and {Theta_0,Theta_1,...,Theta_j} are the weights that minimize our cost function. When discussing linear regression, we said that this equation is good for features that are linearly correlated. For one-class logistic regression, we had to squeeze the value outputted by h so that it gives a value between zero and one. What happens, then, if we have a problem where the relationship between the dependent and independent variables aren’t linearly correlated? For example, consider the following graph:

Can we use equation 1, in the exact form shown above, to fit a line that can be used to make good predictions? Technically, yes. Will it produce valuable results? Clearly not. We haven’t said a thing about whether this is a classification or regression problem. It doesn’t matter. Clearly, a straight line won’t fit this dataset properly. It won’t classify the data points into one group of reds and another of blacks. So what solutions do we have?

One solution is to increase the order of equation 1. Here are some examples:

**Equation 2:** Equation To The Second Order With Two Features

**Equation 3:** Equation To The Third Order with Three Features

**Equation 4:** Equation To The Second Order with Four Features

As we increase the number of features and their orders, we can begin to draw some pretty creative lines, hence, getting a better description of our dataset.

Imagine, however, if we had a problem with hundreds of features, and we wanted to include all the second-order values in our equation. You can imagine how large our equation, as well as our feature space, will become. We can actually conclude that the number of features grows at a rate of O(n^2). We can derive the same analysis for an equation of cubic order, like equation 3. The feature space, in this scenario, grows at a rate of O(n^3).

Another, much more efficient solution, is using neural networks.

Network Representation

Let’s look at how we represent neural networks before we dive into the theory. Consider the figure below:

Let’s look in detail at the different parts:

Activation nodes: The circles are referred to as activation nodes. Apart from those in the input and output layers, the goal of the activation node is to run some sort of computation on the inputs received, and send the output to the input of the activation nodes in the next layer.
Input layer: The inputs to your neural network. These are equivalent to the features inputted, for example, into equation 1. The input layer is normally referred to as the 0th layer, the layer after it the 1st layer, and so on and so forth.
Output layer: The final result. Notice that h does not necessarily have to be equivalent to what we have in equation 1. Depending on the model you’re using (logistic regression, linear regression, etc.) h will be different. Also, note that the output layer can have more than one activation node. In a multi-class classification problem, for example, we have more than one activation node in the output layer.
Hidden layer: All layers in between the input and output layers, are hidden layers. Note that there can be more than one hidden layer. The indices used represent the activation node number as well as the layer number. For example a_11 is the first activation node in the first layer.
Theta: The output of all activation nodes are used as the Theta vector in the next layer. Each theta is individually represented by an arrow and the vector of thetas used in an activation node is the one created by all the thetas of the arrows pointing to that activation node. This will become more clear once we start looking at the intuition behind neural networks.

Intuition

After understanding the basic representation of a neural network, understanding how it works isn’t really that difficult. The complication lies mostly in understanding how gradient descent works, which we’ll see in the next part.

In part 3 of this series, we saw the equation used for logistic regression:

**Equation 5:** Equation For Logistic Regression

This is what’s being calculated at each activation node in the hidden layers. The only difference is that the Theta vector is provided by the activation nodes of the previous layers. Here’s an example of the equation used at a_11:

Where f is the logistic function shown in equation 5, {Theta_0, Theta_1, Theta_2,Theta_3} are randomly selected and {x_1,x_2,x_3} are the inputs. Notice that the image drawn above doesn’t show Theta_0 and x_0. These are normally always assumed to be included. Theta_0 is randomly chosen, while x_0 is always assumed to be one (referred to as the bias term).

As usual, all that’s left to do now is find the Theta vector that’ll minimize our model’s cost function. How? You guessed it, gradient descent.

Conclusion

In this article, we introduced the basics of neural networks. We saw the inefficiencies of using simpler models for problems with large feature spaces and how we can use a neural network to solve such issues. Finally, we looked at the theory behind neural networks, as well as some basic terminology.

In the next article, we’ll discuss, in detail, how gradient descent and the backpropogation algorithm are used for finding the best Theta vector for our network. Until then, I leave you with the following points to think about:

Why do we need a new cost function for neural networks?
We saw that the feature space for linear and logistic regression can grow at a rate of O(n^2) or O(n^3) . How much better do neural networks perform?
How do we choose the number of hidden layers we need for our problem? How do we choose the number of activation nodes in our input layer? How do we choose the number of activation nodes in our output layer?

Past Articles

Part One: Data Pre-Processing
Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation
Part Three: Logistic Regression Using Gradient Descent: Intuition and Implementation

References

Andrew Ng’s Machine Learning Coursera Course