In-Depth Machine Learning for Teens: Neural Networks

13 min readAug 21, 2022

A neural network is a concept that tries to mimic the brain. Without going too deep, let’s dive into the basics of how the brain (as well as other neurons) works.

Note From Author

As of now, this article is part of a series with 5 others. It is recommended that you read them in order, but feel free to skip any if you already know the material. Keep in mind that material from the previous articles may (and most probably will) be referenced at multiple points in this article — especially gradient descent, the keystone of all machine learning.

Survey

If you wouldn’t mind, please fill out this short survey before reading this article! It is optionally anonymous and would help me, especially when improving the quality of these articles.

Neural Networks Survey

Biological Neurons

Neurons are used throughout the human body to relay information. On a broad scale, humans typically have three classes of neurons — sensory neurons, interneurons, and motor neurons.

Sensory neurons sense what’s happening in and outside of the human body, and send signals to the central nervous system (CNS). Interneurons help with relaying information within the CNS. Finally, motor neurons receive signals and translate them into a “reaction” — such as smiling. To draw a few analogies, you can think of our input data as sensory neurons, the processing layers as interneurons, and the output as motor neurons.

Neurons work by transmitting electrical signals between themselves. Greatly simplified, they are “activated” at a certain threshold voltage, at which point they pass the electrical signal to other nearby neurons. Neuron networks work in a similar fashion.

An artistic depiction of neurons and electrical messages. Image source

How do neural networks work?

You’ve probably seen the classic neural network diagram before:

But what does this actually mean? Let’s break it down step by step.

Input Layer

First, you have your input layer. This is where all of your input data starts. The most common way to “feed” this data into the neural network is as a large array, where each row is a single data point, and each column is a feature.

Hidden Layers

As the name implies, these layers are “hidden” from view. This is where all the “black box” magic happens. Each connection from nodes in one layer to another represents multiplication by an arbitrary but fixed value. Then, at each node, the values are summed up, and a sigmoid is applied to transform it into a value between 0 and 1.

The sigmoid that’s applied is referred to as the “activation function”, as it either makes the neuron “active” (close to 1) or “inactive” (close to 0).

The multiplication values are what the training process updates. Each “multiplier” is a separate value of θ, and the values are put into matrices/arrays to deal with them quickly and easily. Each individual θ is called a weight.

Bias Terms

We’re back to bias terms, baby! At the top of each layer, there is an independent node of fixed value 1, that connects to the next layer, but not the previous. This node is called the “bias node”, and collectively they’re called the “bias nodes”. These nodes can be thought of as the “y” term for the next layer, as it shifts the sigmoid’s center over to a “decision boundary” where a shift towards 0 or 1 has greater significance.

As an analogy, you can think of this as the term which raises a neuron near its threshold potential (threshold voltage for activation in biological neurons). At this point, if the incoming inputs are large enough, then the neuron gets activated; otherwise, it remains inactive.

The bias term is a lot like a resting potential.

In the end, nobody can really visualize how neural networks actually “work” since it simply multiplies by a bunch of values, performs a bunch of sigmoids, and outputs a prediction. This is why a neural network is often referred to as a “black box” — you put something in and get something out, but the inner mechanics are unknown (at least from a human perspective). However, in the end, it’s like a network of on and off switches that filters and propagates only the necessary information forward — much like how neurons in our body function.

Output Layer

The pattern continues! Once again, the last hidden layer is multiplied by a bunch of values, the values are summed up at each node, and a sigmoid is applied.

“Confidence”

If you research neural networks further, you might see models with “confidence scores”. These models typically use something called the softmax function instead of the sigmoid function as the last layer. This ensures that the output nodes sum up to 1, and allows the prediction to be expressed not just by value but its confidence level as well.

While this is a slightly advanced topic, something to note is that neural networks are notorious for being “overconfident” with their predictions. To fix this, researchers came up with something called “temperature scaling”, which essentially dials down the confidence levels of the neural net.

The “confidence” doesn’t always reflect the “accuracy” of the prediction, and many times neural networks tend to be overconfident. To fix this, data scientists use temperature scaling. Image source

Cost Function

Conceptually, we pretty much use a modified version of logistic regression. Instead of breaking down an equation, let’s go over it conceptually.

First, since we have multiple output nodes, we take the log loss at each node.
Second, we take the sum of the log losses and add up the summed losses for all inputs.
Finally, we add a regularization term to prevent overfitting. Note that there are other ways to avoid overfitting, but we will use L2 regularization. We will go over exactly how that works for neural networks in more detail in a later section.

For now, we’ll leave it at that as it’s a lot easier to understand and less intimidating than the equation.

Regularization

When training a neural net, it’s important to have some sort of regularization to ensure that the multipliers don’t blow up all of a sudden and that the model does not overfit the data. This can be achieved in multiple ways, but the most basic and popular approaches are listed below.

L1

Take all the weights (except those that connect to bias nodes), take their absolute value and add them up, and multiply by a fixed constant lambda. Alter lambda until optimal regularization is achieved (in other words, the model doesn’t overfit).

Note that this regularization is not applied to the weights attached to the bias terms. This is because the bias term helps “shift” the decision boundary, and restricting such a shift makes it harder for nodes in the neural network to achieve a point where values close to 0 or 1 have a significant difference in their meaning. Of course, we cannot logically “extract” what this means in human language, but it makes sense to the neural network in an intra-system aspect and makes the whole thing work.

L2

Take all the weights (except those that connect to bias nodes), square them and take their sum, multiply by a fixed constant lambda. Alter lambda until optimal regularization is achieved (in other words, the model doesn’t overfit).

This is typically regarded as better than L1 regularization as it “punishes” the cost function a lot more for higher values of θ, whereas L1 punishes all the weights equally.

This regularization is also not applied to the weights attached to the bias terms.

Dropout

Ignore a node at each layer with some probability. Intuitively, this means that each layer cannot fully rely on the previous layer to provide 100% accurate and consistent values, so it adapts to work more flexibly.

In the real world, dropout is the most common of the three because it prevents overfitting while still being robust.

Updating Weights

Whether you have or haven’t taken calculus, you’re probably wondering — how the heck do you take the gradient of all the thetas? There are a lot of web-like formations, which makes going back layers increasingly complex and difficult to visualize. Luckily for you, some very big-brain researchers have already figured this out for you.

This is what those researchers look like in real life. Image source

We’ll be going over a simplified version of the explanation. First, you perform your forward pass as usual. Next, you perform a process termed backpropagation. It’s exactly what it sounds like — you calculate “error” vectors for each layer by traveling backward, and then relate them to the weights to calculate the partial derivatives of each θ.

*Even the* *cha-cha dance* *can backpropagate!* *Image source*

Backpropagation

Before doing anything, we first split the cost function into two parts — the regularization term and the modified log loss. We can easily deal with partial derivatives involving the L2 regularization term (simply 2λθ for any value of θ). Now, let’s deal with the rest of the cost function. We start this backtracking algorithm by finding the “error” for the output layer, before the sigmoid. We denote this with lowercase delta with index L, as such: δₗ (the l is the index of the last layer). Note that the procedure to do this first step will be in the lab, in case you don’t know calculus.

Now, remember how to get from the previous layer to the next one, we multiplied by some weights? We do the same thing, backward. We take the errors at each node, multiply them by the connecting weights, and sum them up at each of the previous nodes. After that, we “reverse” the sigmoid by multiplying the half-backpropagated errors with the sigmoid’s derivative. Note that since the derivative of σ(x) is σ(x)*(1-σ(x)), we don’t actually need to know the value of x, just σ(x) will suffice when calculating the derivative. This fact will help you code this algorithm efficiently, as in the end, you don’t need to “remember” the summed value of each node pre-sigmoid, just the values post-sigmoid.

Now, we have the error for that node in the previous layer. We repeat this process for all nodes in that layer, and all previous layers as well. Intuitively, you can think of this algorithm as distributing the errors by “weight” — higher values of theta get more of the error propagated backward.

After this first step, we calculate the partial derivatives of the weights. This is very simple (at least compared to the previous step). For any given weight, we isolate the nodes it connects. Then, we multiply the forward propagation output of the left node (after the sigmoid is applied) and the δ error of the right node.

Intuitively, you can think of this second step as judging how much of the error on the right was due to the node on the left and adjusting the weight accordingly. For example, if the left node was inactive (output closer to 0), then adjusting the weight wouldn’t do much good, and as such, it isn’t adjusted significantly. On the other hand, if the left node was active (output closer to 1) and there was a high level of error, then the weight is adjusted much more. Note that the direction of the adjustment (positive or negative) depends on the sign of the error.

Next, you have to repeat this entire process for all the sets of features. After this step, you should have a large list of “adjustments” that need to be made to each individual weight. Simply take the average of all the adjustments for all θ’s, and now you have the gradient of each individual weight. Plug that into the gradient descent equation and update the weights, and you’ve officially completed one iteration of the process.

Whew! That was a lot of info. If you don’t get it the first time, that’s ok! When learning this for the first time, it took me three entire days just to get the process at an intuitive level, not to mention I didn’t even know calculus. Even when writing this, it took me a few hours to break down the calculus to a level that made sense.

By now, you can probably see how much a library like numpy helps you deal with large amounts of data effectively. If I had to write loops for each of these steps, not only would I go crazy, but my code would be a lot more prone to errors and would take forever to run.

As a connection back to biology, our brains do something similar. As we learn and process more information, certain neural pathways in our brain get strengthened, and others dissipate over time. In the end, both of these systems try to optimize the “output” as it gains more experience.

Note that the above methodology is based on this resource, and if you want to read more about it in detail, I would recommend taking a glance — but fair warning, it’s quite long and assumes an undergraduate audience at the very least, if not higher.

Extensions of a Neural Network

There are many extensions of neural networks. The two most common examples are CNNs and RNNs.

A convoluted neural network (CNN) is often used to deal with images, such as object detection, facial recognition, OCR, and more. ResNets and DenseNets are variations of this.

A recurring neural network (RNN) is used to process variable-length data, such as audio files, voice-to-text, sentiment detection, and more. It is comprised of a single segment that essentially loops onto itself. LSTMs are an extension of RNNs to have better “memory carryover” of past inputs.

The Big Question — Why?

What is the point of using neural networks? If there’s a pattern in the data, can’t you find it yourself? The answer is yes, you can — but it’s going to be a heck of a lot more difficult to explicitly define your “logic” in ways that your computer can read. An example of this is looking at the image of a car.

Your computer just sees this as a bunch of numerical pixel values.

Of course, you can look at it and right away tell that it’s a car. But how can a computer do the same? The simple answer is that it just can’t. It’s not capable of “thinking” like humans or being conscious of themselves and the world around them.

In addition, even if you were able to identify an explicit pattern for a computer to identify, it would take an extraordinary amount of code to write it down in such a way that it applies to any scenario.

But, by simulating how we “think”, the computer can also “learn” like humans. This means we don’t actually have to worry about the “problem space” (a term which is used to represent all possible combinations of the problem, aka inputs), and our computer can find the patterns for us.

As a side note, humans also require a few months for the biological neural net to train when they are born 😉.

To show how effective this is, let’s take an example. In the lab, you’ll be classifying 28*28 pixel black-and-white images, whose values range from 0 to 255. The problem space’s size is exactly 25⁶⁷⁸⁴, which is slightly larger than 4*10⁹⁴⁸³. You don’t want to be dealing with such an enormous number of combinations, and optimizations won’t do you much good either at this scale (to put it to scale, my desktop computer starts struggling when dealing with a billion data points, and progressively gets worse).

When using a neural net to tackle this problem, I was able to achieve a max accuracy of about 89% (despite taking half an hour to train). Clearly, this process is a tremendous improvement from the mind-boggling number of cases we had to consider previously. In addition, you only have to train a model once — meaning, I can reuse my trained neural net to classify more digits! Now, instead of dealing with 1⁰¹⁸⁸⁸ cases, I only have to perform about 40 thousand arithmetic operations and can identify a digit from a 28 by 28 black-and-white image with relatively high confidence — fast enough to run in real-time.

Due to their efficiency, neural nets can be used to automate certain tasks with a high input volume. An example of this is a chatbot on a website — at any particular point in time, there are simply too many users on it to handle manually. In addition, conversing with another person is a pretty simple task, but not one you can explicitly program. Here, neural nets come to the rescue! They can help many users quickly find what they need in a human-friendly way.

Hands-On Lab

Neural Networks Lab

colab.research.google.com

Parting Notes

While in this neural network article, we used a sigmoid as our activation function, there are others too! Personally, I’ve seen that ReLU and its variants are very popular, especially the leaky ReLU. To learn more about this, I would recommend starting with this link and exploring other corners of the internet for more info. Keep in mind, though, that some activation functions have special requirements — for example, the ReLU family should only be used within hidden layers, and the softmax activation function should generally only be used on the output layer.

Now you know everything there is to know about the basics of machine learning. Of course, there’s a lot more to learn — but if you’re like me, you’re now a lot more comfortable with the subject after knowing how it works, instead of blindly trusting libraries.

Now, go on! Start your own company! Make a conscious and self-aware AI! Conquer the world! Or, do the unimaginable — start using ML and AI libraries *gasp*! By now, you should know exactly why these libraries exist — so you don’t have to do all of this from scratch every time.

Done reading? Feel free to refer back to any previous articles in this series:
Gradient Descent
Linear Regression
Training Faster and Better
Logistic Regression

External resources used:
https://www.khanacademy.org/science/biology/human-biology/neuron-nervous-system/a/overview-of-neuron-structure-and-function
http://neuralnetworksanddeeplearning.com/chap2.html