Neural Networks — the Basics

Published in

Analytics Vidhya

9 min readNov 16, 2020

What if we used 100% of the brain? Or better yet: what if we could teach computers to learn like our brains? This is the fundamental concept behind neural networks (NN), a crucial subset of machine learning (ML) and artificial intelligence (AI), that emulate the human brain In this article, I’ll explain the how and why behind neural networks and look at some specific applications.

Defining Terms

To discuss the very basics of neural networks, we have to define some very basic terms. I’ll explain more complex vocabulary as it becomes useful.

A neuron is one of the little circles in the network above. Each neuron contains some numerical value. Neurons look and behave a lot like nodes in graph theory.
An input neuron is a neuron whose value is set by the user, they are the ones highlighted in blue by the initial image. Input neurons start the network off.
An output neuron is highlighted in green in the above diagram. Outputs neurons answers the questions we asked our neural network. (i.e. which team will win a game, will it rain, sunny, or snowy, how will a stock be priced tomorrow, etc.) Networks can have multiple outputs, for example one could describe the expected temperature and another could describe the expected humidity.
A layer is a vertical stack of neurons. There can be an input layer of input neurons, an output layer of output neurons, or hidden layers filled with hidden neurons, which we will discuss more later.
A weight is one of the lines that connects all the neurons, like edges in graph theory. Weights also each have numerical values. Every neuron has a weight between itself and every neuron in the layers to its left and right.
Normalization is the process of taking any numerical value and having it fall between zero and one. One can plug the value into a sigmoid function, also called a logistic curve, to do so. It is important when we have a wide range of interacting values because we don’t want one particularly large neuron determine a disproportionate amount of the output.
An activation function determines how much of a neuron’s value is used in a calculation. There are several different such functions, I’ll discuss them later.

Two-layer Networks

Now that we’ve got the basics definitions, let’s put them to good use with our first neural network. This one will have two layers, an input layer and an output layer. For simplicity, the input layer will have three neurons, each of which connects to the single neuron in the output layer via a weight. Our network will take the following steps to go generate a useful output:

3 Inputs become an output in a two layer network

1. Get the three inputs from the user.

2. Multiply all three by the weight that connects them to the output.

3. Add the resulting products together.

4. Normalize the sum and you have an output.

The above is really abstract, so let’s take a look at a more specific example by trying to figure out if a high school student is going to pass their class. We can take the percentage of classes they attended, their GPA in all other classes, and how much homework they completed as input. Then, we can follow steps two and three and if after normalizing our output it is above some threshold (in this case 0.5 probably makes sense since we are looking at a binary), we can predict that the student will pass their class. Otherwise, our neural network think they will fail.

A couple of key ideas emerge from these four steps and the example:

First, notice how there was no activation function used. That’s because we don’t have any hidden layers yet, but we’ll add those soon.
Second, neural networks are just really big functions. It was much easier to see on such a small scale, given that we literally took six parameters (three inputs, three weights), did some basic arithmetic, and normalized the output, but it’s useful to keep that in mind as the networks grow more complex

Hide and Seek –Hidden Layers

This network has blue and green hidden layers

I’ve mentioned hidden layers a couple times now, and I think it’s a good time to provide a definition: A hidden layer is a stack of neurons of any size in between the input layer and the output layer. It’s hidden from the user because the user only gives input and receives output but doesn’t interact with the hidden layer at all. There can be just one hidden layer, for a total of three layers, or there can be several hidden layers, each coming one after the other. Like all neurons, those in a hidden layer have weights pointing to them from all the neurons in the previous layer, and weights pointing from them to each of the neurons in the layer to the right.

Networks with hidden layers work in the same way as those with only two. Each neuron in such a layer’s value is determined by multiplying all the neurons in the previous layer by their respective weights and adding the products together. This process repeats for every hidden layer, so a network with two hidden layers, for example, would have the first one determines its values from the input, then the second one would get its values from the first one, and finally the second hidden layer would determine the outputs values.

A two layer network would have a cake walk with this set

So why have these 007-like layers? The answer lies in the function definition of a neural network. When a network has only one input layer and one output layer, it can only draw a linear decision boundary between two possible outputs because the formula for the output is the linear function: input1 * weight1 + input2 * weight2 + input3 * weight3 etc. A two-layer neural network would be fine separating blue from orange in the data on left.

But would be totally lost here:

A two layer network would have a tough time here

Fortunately, with hidden layers, the function can be much more complex because many more weights and additions are added, and thus the decision boundary can represent all sorts of different shapes, like the circle above.

Activation Functions

Hidden layers have a number of benefits, but one key harm is that with every hidden layer comes more and more neurons swirling around our computers proverbial head, and sometimes it’s hard to know which data is relevant and how much weight to give it. Activation functions can help with that by doing something to each neuron in the hidden layers that changes its value relative to other neurons. You may also recall that we perform a sort of activation function on our output layer by normalizing it with a logistic curve. Activation functions are an important component of neural networks, so it’s valuable to know some of the common ones featured below.

Linear

Linear activation functions are very simple: they take the value of a neuron and keep it the same, sometimes multiplying the value by a ‘slope,’ just like a linear function in graphing.

ReLu

ReLu, or rectified linear unit, is the exact same as the linear function except any negative value is mapped to zero instead.

Leaky ReLU

Leaky ReLu uses a linear activation function for values above zero, much like ReLu, but unlike ReLu also uses a linear activation function for values below zero. However, the slope for the linear function for the values below zero must be less than the slope for values above zero.

Sigmoid

As we’ve already covered, the sigmoid function turns everything into a value between one and zero, and as a value gets further from zero the closer it is to the values next to it.

The Big Question

Right now, one big question is probably etching itself into your mind: how do we decide the values of the weights? (Remember the weights are how we go from one layer to the next.) The process to set weights is actually quite simple: guess and check. We take a set of data, for example attendance’s relationship to passing or fail, called the training set, and guesses different weights until one set of weights can accurately tell us what we want to know within a certain tolerance.

To illustrate, the tolerance could be 5%, meaning that 95% accuracy is good enough, or 1%, meaning that we demand 99% accuracy. Each guess is adjusted based on the success of the previous one through an algorithm called gradient descent. Let’s now look at a more formal, step by step definition at how we can set weights.

1. Pick random weights.

2. Use them on a piece of data from our training set of data.

3. Plug the results into a gradient descent function layer by layer, starting with the output layer and moving backwards — the formula for gradient descent is very complicated and not worth memorizing but the basic idea is simple: if our estimate was too high, decrease the weights. If our estimate was too low, increase the weights. This process is called back propagation, because we move our adjustments backwards through the network.

4. Repeat the above process until our network is accurate within the bounds of the tolerance.

Through this process, we can teach multi-layer neural networks to make very accurate predictions using massive amounts of data that in which no human could ever identify a pattern.

Avoiding Overfitting

Overfitting is one of the great pitfalls of machine learning and arises when our AI learns too well. It becomes so good at getting the result it wants from the specific set of training data that it can no longer apply itself to the general case. It’s kind of like if a quarterback got so good at throwing a deep ball that he lost the ability to connect on shorter passes. Let’s look at a couple methods for correcting overfitting:

Use more data — the more holistic a data set is, the more difficult to overfit.
Dropout — one of the popular methods to stop overfitting. Dropout periodically removes random nodes from the network, forcing the network to adjust to learn with different internal structures and become better equipped to handle any input.

Special Networks and Applications

Classification

Neural networks can classify items based on characteristics of those items given as input. Classification is the most popular application of NN’s. We’ve already looked at classifying the weather as rainy or sunny, but this is just one of many examples. Other real-world applications include figuring out whether or not bank notes are counterfeit and deciding the optimal price for a product.

Convolutional Neural Networks

Convolutional neural networks (CNN’s) are especially useful for image recognition. Researchers are using them to build self-driving cars that understand their environment. CNN’s work by taking the RGB numbers of each pixel of an image as input. They then apply a filter to an image at each layer. This filter looks like an m-by-n matrix and works by multiplying each m-by-n rectangle of pixels by whatever corresponding numbers are in the matrix and adding the results together to create a new, smaller set of pixels that the network performs on which the network performs its operations. Each filter allows the CNN to ‘learn’ something different about images. For example, some filters isolate edges between different objects in the image, while others remove colors from the image and just look at the shape of the image’s contents.

Key Takeaways

Neural Networks use a continual adjustment of weights and how they pass through neurons to accurately predict an outcome based on some input.
Weights are adjusted through gradient descent.
Activation functions determine how much given neurons are weighed.
Networks have several applications, such as image recognition.