The Neural Network — From Math to Code to Impact

Building an Artificial Brain

11 min readJul 5, 2022

Since the 1950s when Alan Turing asked the question “Can machines think?” in his paper “Computing Machinery and Intelligence”, researchers and engineers have worked extremely hard to prove that yes, they can. The issue is that “thinking” is hard to define. Thinking is not the ability to recall as much information as possible. Nor is it the ability to do calculations as quickly and as accurately as possible. Computers have already proven that they can do such tasks way better than we can.

The complex philosophical challenge of defining consciousness is one of the many difficulties of building artificial intelligence. Because if you don’t know where you’re going, how do you plan to get there? Instead of diving into a never-ending philosophical debate about consciousness, AI researchers decided to turn to human biology to understand intelligence. Maybe we couldn’t define consciousness yet, but if we understood how the brain works and were able to recreate one, that would be a big step in the right direction. And that is how artificial neural networks were created…

A neural network is a mechanism our brain uses to output ideas/answers to us from the inputs/information we receive.

In programming, an algorithm is often abstracted to a “black box”. Inputs go into this black box and outputs come out of it. The reason why it’s called the black box is that — usually with AI — you don’t know what exactly the computer is doing to get the outputs. That’s because the algorithm has learned by itself and this algorithm cannot explain how it knows what it knows. It can get results, but it is oblivious as to how it gets them.

Think of how you come up with ideas. You probably have a place where ideas come to you most naturally; in bed, in the shower, in the car. You might be driving down the highway when suddenly an insight pops up in your head, the feeling of a lightbulb turning on. Where did that come from? It came from your giant neural network.

A neural network is the interconnection of neurons through synapses that carry out a function when activated.

Think of your brain as an enormous circuit. Each neuron is a tiny part of that circuit, and it is connected to billions of other neurons. This enormous circuit in your mind is also a black box: it outputs your actions, thoughts, and ideas from the inputs you receive through the following steps…

Your senses (sight, touch, smell, taste, and hearing) input the stimulus you receive from the outside world in the form of chemical signals to the dendrites of your “input neurons”.
The soma (see image) interprets the signals and, if the new signal the soma releases is strong enough, this signal travels across the axon to the axon terminals.
The synapses at the axon terminals transmit the new signal to the dendrites of the next set of neurons in your brain, and the process repeats.

The round ends of the axon terminals are called synapses. Source

When I was five, my dad and I were cooking pasta and he went to the bathroom. Before he did, he told me to NOT touch the stove. So obviously I touched the stove and burned my index finger. I immediately started crying and feeling pain. The neurons in my brain were working like this:

Dendrites receive strong input signals of heat from my index finger.
These signals were extremely strong, so they made their way through many somata (soma plural) and axon terminals.
The very strong signals went through many neurons and activated my output reaction to take my finger off the stove and start crying.

Of course, I’m simplifying here; I’m not a neuroscientist, but this analogy is accurate enough for the purpose of understanding computer “brains”. What’s awesome about human brains working like a circuit is that we can represent circuits in computers — and that means we can make artificial brains: artificial neural networks.

Artificial neural networks are a set of mathematical equations interconnected to turn inputs into outputs.

Essentially, a neural network algorithm is a function just like you would learn in math class. Give an x value (input) and the neural network will return you a y value (output). However, for impactful applications, the function the computer created will likely be more complex than anything you learned in high school.

Enough with the abstractness, let’s look at what a neural network for a real problem looks like.

Digit classification using images

Credit due: I took inspiration and learned a lot from Samson Zhang’s incredible video which you can watch here.

Quick background: the MNIST database is a common database used for learning how to program machine-learning algorithms. It has a training set of 60,000 examples and a test set of 10,000 examples. These examples are 28x28-pixel images of handwritten digits.

The goal is to have our algorithm glance at an image of a handwritten digit and accurately label what digit (0–9) it is. I’ll use this example problem to explain the math and code behind a neural network.

Math theory before code.

Step 1: Understand the data

First, we need to understand the data being inputted into our black box. The data is a bunch of grayscale 28x28-pixel images. That means that each image has 784 pixels, and each of these pixels has a number in the range of 0–255 corresponding to its colour. 0 being white and 255 being black.

Therefore, our input data is a matrix with “m” rows and 784 columns with “m” corresponding to the number of example images we use to train the algorithm. Since this is a supervised learning algorithm, the training data we have is also labelled with the correct output number. The output data is a matrix with “m” rows and 1 column; the 1 column has the correct output number for each example.

Step 2: Understand the high-level structure of the neural network

Just like in a brain, an artificial neural network has layers through which the signals/data are interpreted to output a result. In the neural network we will build, there will be three layers (input layer, 1st hidden layer, and output layer).

The neural network I’m building in the example for this article only has 1 hidden layer, not 2.

The image above is a general visual of what a neural network looks like. For ours, the input layer will be comprised of 784 nodes, the 1st layer will have 10 nodes (though it could have any number), and the output layer will have 10 nodes. Each node of the output layer will represent a digit from 0 to 9.

Step 3: Forward propagation

Forward propagation is the process of data being processed/modified and moving on to the successive layers. Forward-propagation happens in two steps at each node:

Preactivation: the input from the previous node experiences a transformation by being multiplied by a “weight” and having a number called a “bias” added to it.
Activation: the calculated weighted sum of inputs is passed to the activation function. This mathematical function turns the inputs into compressed outputs that go to the following layer. An activation function is akin to when a real neuron “decides” whether to send the signal through the axon because it is strong enough or not.
Activation functions are important because, without them, the neural network would just be a giant linear regression algorithm. Activation functions enable more complex relationships to be made between the input and output values.

It’s important to not mix up forward propagation and back-propagation. Forward propagation must go undisturbed to achieve an output of the algorithm. Only after the algorithm has fully run, we use back-propagation to figure out how best to adjust the weights and biases of the nodes.

Our neural network looks like this:

We can use vector notation to do the calculations in forward propagation all at once. First, let’s compute the first layer by using the data from the input layer:

We start with A⁰, which is the matrix of values from the input layer. These values are unchanged until they reach the first layer and have a weight “w¹” and bias “b¹” applied to them. We want the size of the output matrix of layer 1 to be 10 x m, i.e., 10 rows by m columns. Since the input value A⁰ is 784 rows by m columns, we must apply the dot product with w¹ and A⁰ to get the output matrix to have 10 rows and m columns.

The dot product is a matrix multiplication operation that states that for matrix multiplication to be defined, the number of columns in the first matrix must be equal to the number of rows in the second matrix, and the product of an m x n matrix and an n x k matrix is an m x k matrix.

For this reason, the dimensions of the Z¹ matrix are 10 x m, resulting from the dot product of the w¹ and A⁰ matrices. We then apply a bias by adding a 10 x 1 matrix b¹. Lastly, we apply an activation function of our choice (sigmoid, softmax, ReLU) to the Z¹ matrix to get our A¹ matrix which is the output of the first layer. We then repeat the process for the second layer but use the softmax activation function instead of ReLU at the end.

The softmax activation function outputs probabilities which is why it’s relevant to use at the end in our case (we want to know the probability of an image representing a certain digit and pick the highest probability one).

Step 4: Back-propagation

Think of back-propagation as the step before adjusting the weights and biases of each node. Before adjusting them, we need to know how much the value of each weight and bias is contributing to errors. After having understood that, we go and adjust them using gradient descent.

First, we calculate how far off the predicted values are from the correct values using the mean squared error (MSE) function. We will denote the average error of our algorithm as the “cost” C.

Lowercase *n is the number of examples in our dataset. The mean squared error function computes the average cost of the algorithm’s predictions, squared.*

Next, we use the chain rule to calculate the derivatives of the cost with respect to the weights and biases. Calculating these derivatives allows us to know by how much to adjust each parameter. Essentially, we are using calculus to find out how much each weight and bias is contributing to errors in our algorithm. These are the derivatives we arrive at:

And so on and so on… We do similar calculations for each layer. If you’re eager to understand exactly how all of these derivatives are found through the chain rule, watch this video:

Step 5: Run gradient descent

In calculus, the gradient of a function gives you the direction of steepest ascent — the direction you should move in to increase the function most quickly. Taking the negative of that gradient gives you the direction to move in that decreases the function most quickly, which is useful for decreasing the cost function. That’s what we do in gradient descent: we subtract the derivatives we calculated in backpropagation from the actual values of each weight and bias, and we run many iterations until we get to the lowest cost thereby making our algorithm more accurate.

Alpha (in red) is the hyperparameter/learning rate that we set ourselves. “:=” is the symbol for assignment.

And that’s it! That’s what goes on under the hood of a neural network.

Math to code!

If you’re expecting this part to be different than the math, I have a surprise for you: it’s not. All we need to do to get this algorithm to work is replace all the variables above with variables in code, then define functions like ReLU and softmax, and then run many iterations. For example, this is what our forward propagation function looks like:

def ReLU(Z):
    return np.maximum(0, Z)def softmax(Z):
    exp = np.exp(Z - np.max(Z))
    return exp / exp.sum(axis=0)def forward_prop(W1, b1, W2, b2, X):
    Z1 = W1.dot(X) + b1
    A1 = ReLU(Z1)
    Z2 = W2.dot(A1) + b2
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2

The most difficult part of coding the math together is to make sure that the matrices align for each operation that you do. Once that’s achieved, it’s just a matter of translating math into code.

This is the result. The algorithm had a 90% accuracy with a relatively small 1,000 iterations of training on 42,000 training images.

Finally, now that I’ve shown how it works from the ground up I can move on with my human existence and use tools like PyTorch to not do this again! But at least now you and I both know how a neural network works.

Impact

I am just starting my journey with AI; the MNIST dataset is like the “Hello world” equivalent of learning to code. I wouldn’t have gone through all this math if there wasn’t a more exciting goal in mind. For a long time, the only thing AI could do was make X process more efficient. For example, use AI to optimize the likelihood that you click on an ad while scrolling through Instagram. But recently, we have started seeing advances in AI that demonstrate creativity and ingenuity. For example,

Creating art and music.
The materials genome project: using artificial intelligence to accelerate the discovery of better materials for different applications.
Accelerating drug discovery.
Creating smart robots (machines that could be our servants, smart machines that can fulfill ever more complex tasks).
Programming self-driving vehicles.
Discovering new electrocatalysts to reduce the cost of scaling green hydrogen. (See my article on green hydrogen here.)

DALL-E 2, OpenAI’s AI algorithm can generate beautiful art prompted by any text given in less than 10 seconds. This is an image it generated to the prompt, “A rabbit detective sitting on a park bench and reading a newspaper in a victorian setting”.

It’s hard not to drink the Kool-Aid when thinking about how different our world will be twenty years from now because of AI. The scientific revolution happened in the 1500s. In 500 years, we went from living on farms in housing with dirt floors to going to the moon and creating the internet. And that was all thanks to our dumb mortal brains. In my lifetime, the world may change even more.

Thank you for reading! Visit tobiasgm.com to find more of my work or subscribe with this link to receive my monthly newsletter, a collection of my thoughts and projects 🙂.