https://unsplash.com/photos/ZiQkhI7417A — Image Source: Alina Grubnyak

How do Neural Networks REALLY Work?

7 min readMar 18, 2023

The world seems to have fallen in love with neural networks over recent years. Who can blame them? It’s an incredible technological innovation that lived in theory books for years before computing power caught up to make it possible.

“Machine learning” and “neural networks” have been picked up as buzzwords by many corporations to the point where even some tech-averse people have heard of them. And while these words keep afloat massive marketing campaigns, and make some panic about the Singularity, the algorithms themselves are more than just tech bro jargon. What many see as this magic black box that spits out answers to problems actually has its roots in fairly straightforward logic and math.

Of course, most techies would tell you a neural net is supposed to simulate the processes of the human brain with some neurons organized into layers, passing information to one another. While this analogy may be somewhat true at a high level, anything more than this generalization would be considered a stretch. In addition to this misconception, many people that utilize this incredible technology don’t actually know what’s happening under the hood. While I’m not going to build another ChatGPT in this piece (your secret is safe with me Sam Altman), I hope to help illuminate some of the basics of how neural networks actually work.

I’d like to share a basic example of a Multi-Layer Perceptron (MLP) Neural Network that I found when watching the wonderful Valerio Velardo’s Sound of AI YouTube channel. The name may sound scary, but it’s really not that bad once it’s broken down. I chose this example because MLPs can be a very simple type of “Dense” (aka feedforward) neural network, which basically means that data only flows through the network in one direction: from input to output.

Supporting code that implements the concepts of this piece can be found here. The code is split into two examples:

How to build a simple MLP neural network from scratch.
How to build a simple MLP neural network with TensorFlow (Keras).

Neural Networks (Generally…)

Let’s begin by conceptualizing a few key things. Neural networks are indeed made up of a bunch of interconnected neurons (aka nodes). At its base level, a node can be defined as something that receives data, performs a computation on the data, and then sends along the processed data. You can think of it like a cashier receiving a payment, subtracting the price from the cash, and returning the change to the customer.

This action from a single cashier doesn’t seem too special (sorry cashiers). But if you chain together an entire city of cashiers, you have a massive flow of commerce that one person could never fully comprehend. The real power lies in the number of these cashiers and how money flows through the city between them. The same goes for nodes.

These nodes (our cashiers) are organized into various layers which define how they are connected with one another. The image below provides a good visualization of this process. In our example, data will follow the flow of the arrows in the diagram, traveling from left to right, passing through each layer. Each node will receive the data, perform a calculation, and then send it along.

Neural Network — Image Source: https://www.geeksforgeeks.org/multi-layer-perceptron-learning-in-tensorflow/

There is one final piece to the puzzle related to the connections between each node. Not all connections between nodes are created equal, some are stronger than others. This diversity in the connection strength between nodes is the secret sauce that allows a neural network to learn complex trends from data.

Now that we have some basic concepts, let’s dive into the technical details.

Multi-Layer Perceptron (MLP) Neural Network

An MLP consists of three types of layers: the input layer, the hidden layer(s), and the output layer. The input layer receives the raw data, while the output layer produces the final prediction. An arbitrary number of hidden layers process the information and are responsible for the bulk of the network’s computation. Adjusting the number of hidden layers and nodes at each layer is an important means of tuning a neural network.

Each node in the MLP is connected to nodes in the previous layer and the next layer. The strength of these connections is represented by weights, which are automatically adjusted as the model trains and learns to optimize the network’s performance.

Let’s step through how data flows through the network, beginning with forward propagation.

Forward Propagation

The process of forward propagation involves passing the input data through the network and calculating the output. As data is transferred from one node to the next, it is multiplied by the weight between the nodes. Once each node in the hidden layer(s) receives the data input from the previous layer, it applies an activation function to produce an output. This output is then passed on to the next layer(s) until it reaches the output layer, where final predictions are made.

Backpropagation

So now that we have made a prediction, we’ve made a neural network and we’re done, right? Not quite. The purpose of a neural network, and machine learning more generally, is for the model to update itself (or “learn”) to get better at predicting the correct answer.

Because we have predictions, we can compare them to the data that we are using to train the model to determine how far off we were from the actual answer, also known as the ground truth. Then we need to use this information to adjust the weights between each node through a process known as backpropagation.

In technical terms, this is done using the chain rule of calculus, taking the derivative of the activation function, which allows us to calculate the gradient of the error with respect to each weight.

Gradient Descent with Mean Squared Error (MSE)

In order to adjust the weights to minimize the error, we use an optimization algorithm called gradient descent. Gradient descent works by iteratively adjusting the weights in the direction of the steepest descent of the error function.

Gadient Descent — Image Source: https://www.javatpoint.com/gradient-descent-in-machine-learning

Breaking down the terminology, gradient=slope and descent=downward. So in simpler terms, gradient descent could be translated to mean “downward slope”.

Basically, you can think of it like a ball rolling down a hill toward the bottom of a valley. The ball is our current set of weights and each point on the hill represents a different set of weights for all node connections. At the bottom of the valley is our lowest error, aka our best-performing model. Our goal is for this ball to reach the bottom of the valley, so at each step (or iteration) we look to see which way the ball would be rolling down. Then we adjust all of the weights accordingly in that direction until we achieve our optimal model.

Many different error metrics can be used with gradient descent depending on the problem at hand. For example, the mean squared error (MSE) is a commonly used metric for regression (i.e. number prediction) problems.

Training the Model

Now that we have our process set up, we can begin training our model. We start by initializing the weights randomly. We then pass the input data through the network using forward propagation and calculate the error of the output using a metric such as MSE. We then use backpropagation to calculate the gradients of the error with respect to each weight and adjust the weights using gradient descent.

This process is repeated for a fixed number of iterations, or until the error is minimized to a satisfactory level. The number of nodes within each hidden layer and the number of hidden layers can be tuned to achieve optimal performance for a specific problem. However, increasing the number of hidden layers or the size of the hidden layers can lead to overfitting, which essentially means that the model is memorizing the training data rather than learning the underlying patterns in the data.

To prevent overfitting, techniques such as regularization, early stopping, and dropout can be applied. Regularization (less common than other methods) adds a penalty term to the error function to discourage large weights. Early stopping halts the training process when the error on the validation dataset starts to increase. Dropout randomly drops out nodes during training to prevent them from relying too heavily on each other.

Once the model is trained, it can be used to make predictions on new, unseen data. The new input data is passed through the network using forward propagation, and the output is calculated. The prediction is then compared to the true output to evaluate the performance of the model.

Recap

In summary, neural networks are powerful machine learning algorithms that can be used for a wide range of tasks, such as image recognition, natural language processing, and time series forecasting. Various other forms of neural networks, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), can be leveraged for better performance regarding certain tasks.

The processes of forward propagation and backpropagation while utilizing error gradient descent, allow neural networks to learn from data and adjust their weights to optimize performance. While many aspects of neural networks can vary and become far more complicated, all neural networks are built upon this set of basic processes.

Understanding the underlying concepts of any powerful tool, rather than just memorizing syntax, can allow us to maximally leverage and optimize these technologies to solve our problems.

How do Neural Networks REALLY Work?

Written by Christopher Landschoot