A Single-Layer Artificial Neural Network in 20 Lines of Python

21 min readApr 5, 2018

So you want to learn about artificial intelligence? Maybe you’ve searched up and down google looking for a beginner tutorial, but all you’ve come across is tutorials that throw math equations and code at you. Well, hopefully I can help. Here, we’ll be building a very simple neural network that we will train to identify something. We’ll keep it simple, and while there will be math equations, I’ll explain every part of them instead of just throwing them at you. All you need to know is some high school calculus.

Let me note something first though: this tutorial will be in python. Don’t run away if you don’t know Python, that’s okay. Python is very easy to follow, and you’ll probably pick it up as you go through the tutorial. If you’d like to look at the code before we start, you can go to my github. I can’t promise that it’ll mean anything to you if you look at it right now though. Let us begin!

Biological Neural Networks:

Loose Depiction of a Biological Neural Network

First off, what is an Artificial Neural Network? If you’ve taken high school biology, you’re probably aware that the human brain is made up of neurons, tiny cells that send and receive messages to each other. A neuron is the basic computation unit of the brain. Based on many factors, it sends an electric communication signal (think of it like a small ping) to other neurons or does nothing. The electric signal is referred to as Action Potential, but I find that name is hard to digest so I’ll stay away from it.

Each neuron is connected to other neurons through synapses. If sufficient synaptic inputs to a neuron fire, that neuron will also fire. In other words:

Synapses connect each neuron to many other neurons, through which neurons send their electric signal
Each electric signal has a charge, but the actual amount of charge delivered from neuron to neuron is determined by the strength of the synapse between the two; each synapse has a weight. Synapse weight is a fancy way of saying connection strength. Good connection strength = lots of charge delivered, bad connection strength = little charge delivered.
If the combined weighted electric signal is enough, the neuron can fire, where fire means that neuron too will send out an electrical signal. Thus we get this whole network of neurons firing electric signals
The weight of each connection/synapse is determined as the brain learns; the adjustment of the connection strengths is what we call “learning”. Synapses are weakened when their firing led to a bad result, and strengthened when their firing led to a good one

While this is, of course, a very simplified explanation of what’s going on up there, it will be enough for us at the moment. Time to get into the real meat!

Artificial Neural Networks:

Where to start? Let me first come out and say something: Artificial Neural Networks do not work exactly the same as biological ones. Something that trips many people up is when they try to intuitively think about artificial neural networks, they think in terms of biological ones, because they’re easy to understand. While artificial neural networks (ANNs) apply the same higher level rules, their structure is a little bit different.

An artificial neuron has 3 main parts: the input layer, the hidden layer, and the output layer. In terms of neurons, the input layer is your sensory neurons (your 5 senses), the output layer is your motor neurons (your mobility and actions), and the “hidden” layer is your interneurons, where the thinking and processing happens (inside your brain). I should point out that you can have n hidden layers; if you think about your brain, you have hundreds of millions of neurons that process in between your input and your output. You may have heard of the term deep learning: this implies multiple hidden layers, hidden because they’re not part of the visible output.

In this image, each circle is an individual neuron, and each line is a synapse.

But we’re getting a little bit ahead of ourselves. It’s best to start out simple, so we’ll be focusing on just one neuron for now. Biological neurons are fairly useless all by themselves, but a single artificial neuron can make quite a large amount of progress.

Here’s the problem that we’re going to start out with:

Can you see the pattern? As long as there’s a 1 in the right-most column, we output a 1. So in place of the question mark, we should have a 1. Don’t worry if you didn’t see the pattern; we’re making something that’s hopefully smarter than us after all.

Training Process:

Now for the big question: how do we get a neural network to learn? Recall what I said earlier about biological learning: synapses (connections between neurons) are weakened when their firing led to a bad or wrong result, and strengthened when their firing led to a good or right one. This is the underlying principle in an ANN.

Learning in an ANN is usually broken down into two steps: Forward Propagation and Backpropagation. Don’t ask me why forward propagation is two words and backpropagation is one. Whoever named them obviously had some learning of their own to do.

There’s a catch-22 here: it’s difficult to understand what forward propagation is doing without understanding what back propagation is doing, and vice versa. So for now you’ll have to trust me that there’s a point to everything we’re doing. Maybe this picture of a kitten will help calm you down:

Actually, what would probably be more helpful is a picture of an artificial neuron that you can follow along with as we go. Here it is:

If you look at it, you might be slightly confused as to why there are synapses (those lines with the w above them) if this is just a single neuron. The artificial neuron itself is only the middle part; the hidden layer. The other stuff is necessary for the neuron to do anything, otherwise it’d just be sitting there while nothing interacts with it.

The main goal of Forward Propagation is to get an output from your neuron. Following along with the picture, the steps are:

We begin with some inputs x. Let’s just focus on the first training example right now, [1,0,1]. So x1 = 1, x2 = 0, and x3 = 1. There are m number of inputs for each training example.
Next, we’ll multiply each input x by the weight of its connecting synapse, w. The weight determines how much the neuron is affected by the input. Our training process will involve tuning these weights.
These now weighted inputs, represented each as w(i)x(i), are summed together just like a biological neuron sums together all of its synaptic inputs when determining whether to fire
They are then passed through a special formula in the hidden layer, referred to as the activation function. Applying this formula, which I’ll reveal later, will give us the neuron’s output. In the picture, the circle with a line through it represents this activation function, analogous to a simple f(x) you’ve seen in math classes.

That’s all forward propagation is! At the end of it, the neuron will have computed its output (output being what it thinks the correct answer should be, 0 or 1). Of course it hasn’t learned anything the first time it does forward propagation, so the answer it first gives is typically wildly off.

There’s a step in between forward propagation and backpropagation that we need to do. Our neuron has given us its answer, but before it can attempt to improve its answer, it needs to know how wrong the answer was. In other words, to improve the model we need to first quantify how wrong the model is:

Each training example for ANNs should come with an answer. In our case, [1,0,1], the output was 1. This correct answer is called y, and the neuron’s output is ŷ. If you’ve taken statistics, this step resembles taking a residual; we compare ŷ with what the answer should be to get an error. It may be a bit confusing from the picture I drew earlier, since the neuron’s output is labeled y, but that y was for use within the model. Outside of the model, it’s referred to as ŷ.
This error can be computed in a number of ways; it’s called the cost function. The basic method that we’ll be using is simply y-ŷ.

Now, onto Backpropagation, the more difficult yet more productive sibling of forward propagation. Its main goal is to adjust the weights based off of our error so as to provide us with more accurate results the next time around. The actual “learning” happens here. Once again, think back to biological learning: we want to weaken weights that lead us to the wrong answer and strengthen ones that lead us to the right answer

How are we going to do this? We’re going to minimize the cost function, i.e. get the error as low as possible. There’s a couple ways we could do this. You might be thinking brute force: try every possible weight, and see which one results in the lowest cost function. Sadly that’d be like trying to find a needle in a haystack. You’d be there all day.

Ask yourself this: what’s the best way to find a needle in a haystack? Bring a flamethrower. And that’s where we’re going to introduce Gradient Descent. It’s a fancy name for a simple concept. If you’ve taken calculus, you may have seen it before, but if not that’s okay. Let’s explain it. Forget about the neuron for a second. I just want you to think about a ball in a curve:

The ball starts up high, then rolls to a lower point, then to an even lower one, and eventually bottoms out and stops rolling. The point at the bottom, where the ball has stopped is called the minimum.

Let’s say the ball, starting at the top, has the goal of getting to the bottom. What information can the ball use to adjust its position to find the lowest point? All it has is the slope of the curve at its current position. Our ultimate goal is to get to the bottom where the slope is zero. To accomplish this, when the slope is negative, the ball should move to the right. When the slope is positive, the ball should move to the left. As you can see in the picture, after a few iterations of this, the ball has found the bottom, or the minimum. The process of finding the minimum is called gradient descent. Gradient is just a fancy word for slope, and descent just refers to the fact that we are descending to the minimum.

So, some simplified steps of gradient descent can be described as follows:

Calculate the slope at our current position
If the slope is negative, move right by some amount
If the slope is positive, move left by some amount
Repeat until the slope is 0

But by how much should we the ball move right or left? If you look at the picture, steeper slope means we are farther away from the minimum, while a less steep slope means we are close to the minimum. We can use that! Also, bringing in some more math, our curve is in the (x,y) plane. Increasing our ball’s “x” position will make it go right, and decreasing its position will make it go left. So our new steps will be:

Calculate the slope at our position “x”
Change x by the negative of the slope, since we want to move it in the opposite direction: x = x -slope
Remember, as we get closer to the bottom, the slope gets closer and closer to zero (our goal). So, as we update, we’ll slowly be adjusting x by less and less. Eventually, after many iterations, we’ll come to a solid number that’s being adjusted in negligible ways, and this number will be our minimum.

Now let’s put this in terms of our neuron. The “minimum” is where our model produces an output that’s the same as the actual answer. Remember, we’re minimizing the difference between our answer and the actual answer. To get to that minimum, we’re trying to find the “x” where that minimum occurs. The “x” is our weighted inputs summed together. We’re trying to find at what weights do we get the minimum, just like in the example we were trying to find at what x do we get the minimum.

Hopefully that’s not too difficult to understand, because now we can begin to explain how to do this with the neuron. First off, how do we even get the slope? And the slope of what? We’re going to take the derivative of our activation function, with respect to the weighted inputs. This is because our “curve” is going to be the activation function, so we’ll want to find the slope of it (derivative = slope). It’s with respect to the weighted inputs because like I said earlier, our “x” is going to be the weighted inputs summed up. Think about the dy/dx you’re familiar with; our activation function is the y and our summed weighted inputs are the x.

I glossed over it, but why is our curve going to be the activation function? It’s because that was the function we used to produce our neuron’s output; it was the f(x). A curve is a representation of a function, and in our case, our curve is a representation of our activation function. I know I haven’t revealed what that function is yet, and that’s because it’s not important to know yet. You’ll just have to keep on trusting me.

Now that we’ve learned about Gradient Descent, we can finally reveal the Backpropagation algorithm! It’s pretty much just gradient descent. The steps are:

Using the derivative of the activation function, calculate the adjustment. For us, our adjustment is a simple x-slope (i.e. x -derivative)
We also want to make our adjustment proportional to the error, since as we get closer to the error being zero we want to adjust by less and less. Our slope already does this, as it gets less and less towards the minimum and flattens out, but we still want to do it to avoid potentially not being able to adjust at a small enough rate in order to reach the minimum (you may run into that in larger data sets). So, the solution is to multiply our derivative by the error, thus making the adjustment proportional to the error, leaving us with x -(derivative*error) as our adjustment
Finally, adjust the weights in our neuron using the calculated adjustment. Just like biological neural networks, we are adjusting the connections from the inputs; in essence, the neuron “learns” (well actually computes) which inputs are important and which aren’t.

Hopefully these concepts all make sense, because we’re about to put them all into code form finally! Before we move on, a quick summary of what we’re doing:

Forward Propagation: Send some inputs, multiplied by their weights, into our activation function (I’ll finally reveal it when we start coding). This will give us our neuron’s output, y.
In between: Compute the error of our neuron by computing the cost function, y-ŷ. This tells us how wrong our neuron is, and how much to adjust by.
Backpropagation: Calculate the adjustment for each weight using the derivative of the activation function, andthen adjust the weights.
Each cycle of Forward and Backpropagation is called one epoch. We’ll do many, many epochs to determine our final answer (1,000 for us in our problem).

If you’re still having trouble understanding, hopefully you’ll be able to solidify the ideas as we code. It also might help to think of an artificial neuron like a feedback loop:

You put some inputs through your neuron, you get a result, compare it with what the answer should be to get an error. Then you send it back, making tweaks to the algorithm. Once it’s made its way back to the beginning, you send it through the network again, computing a little bit differently. You once again compare it to get the error, send it back to update your algorithm, etc.

Coding the Network

If you’ve made it this far, congrats! You’ve proved yourself a worthy student. Take a moment to pat yourself on the back. Now we get to apply everything we’ve learned up until this point.

First things first: We’ll be using the python library numpy, mainly for the numpy.array function, which turns your data into matrices. Why matrices? If you’ve had linear algebra, the concept is probably familiar to you. If not, that’s okay; I’ll explain. Matrices allow us to do calculations for each of the m training examples in one line. (Yes, there are m number of inputs for each training example, and m number of training examples).

It might help to explain this in programming terms. Without matrices, we’d have to use a for loop for our first step of forward propagation:

# The m here refers to m number of training examples
for (int i = 0; i <= m; i++){
    training_inputs *= synaptic_weight
}

It might not look bad now, but once we start doing more advanced stuff this is going to be taking up an awful lot of memory, and more importantly will require more typing of us than necessary. Instead, we can condense it to just a dot product between two matrices:

np.dot(training_inputs, synaptic_weights)

If you don’t remember how to do a dot product, it might be worthwhile to refresh your memory.

In any event, let’s begin the program. We’ll start by typing out things we’re given. Let’s put our training inputs in there as x:

import numpy as np
x = np.array([[1,0,1], [0,1,0], [0,0,1], [1,0,0]])

You might have noticed there’s double brackets in there. That’s because the training inputs is a 1x4 dimensional array, with each element in that array being one of the input sets (i.e. [1,0,1] is our first element, [0,1,0] is our second, etc.). While we’re at it, we might as well add in the outputs as y:

y = np.array([[0,1,1,0]]).T

We have double brackets again because this is a 1x1 array; the [0,1,1,0] is the element. If our element was x, we’d have np.array([x]).

But wait, what’s that .T? Well, we actually have a problem. We need to compare inputs with their corresponding output, but our inputs are a 1x4 matrix, while are outputs are a 1x1 matrix. In order to do a dot product, matrix A must have the same number of columns as the number of rows in matrix B. Our matrix A has 4 columns, while our matrix B has 1 row. That’s no good. .T is a function for numpy’s array called transpose, which takes an “any x 1” size matrix and converts it to a “1 x any” size matrix. So we use that.

Visual representation of transposed vectors

You might be thinking about trying this:

y = np.array([[0],[1],[1],[0]])

This would also be valid, but with huge data sets later on you won’t want to type out brackets around each element; you’ll want a quick and easy function, like .T.

Next we’ll want to code in our weights. Now where do we start with our weight values? Well we have to start somewhere. We have 3 inputs for each training example, so let’s make a 3x1 array:

synaptic_weights = np.random.random((3,1))

Yes, we start with random weights. Our network’s job is to change the weights anyway, so why would it matter what they start at? The only constraint is that we want to start with numbers between 0 and 1, hence the random.random function. We do this to normalize our results; it’ll save us a lot of computation to keep our weights close-ish to 0, rather than having them out at, say, 1,000. This should make a little sense, but to fully grasp it we’ll have to finally introduce something that I’ve been teasing about: The activation function.

To jog your memory, the activation function takes in all of the weighted inputs and gives our neuron’s output. Seems magical, doesn’t it? Biological neurons probably have their own “activation function”, transforming the amount of electric charge poured into a neuron into an output signal, but unfortunately we haven’t figured it out yet, so we can’t use that.

So what actually is our activation function? There are actually many different activation functions. You could even make one yourself if you really wanted to. But the one we’ll be using, and one of the most common ones, is called the Sigmoid Function. The graph looks like this:

The equation is y = 1/(1+e^(-x)). Why do we use this specific equation? Sadly it wasn’t as simple as someone writing a paper proving this is the ultimate learning algorithm. In fact, the reason there are so many activation functions is because sigmoid is not the one learning algorithm to rule them all. Sigmoid just happens to work for us here.

There are many mathematical reasons to why Sigmoid fits, but they’re hard to grasp without a solid overall understanding of neural nets. The technical explanation is better saved for farther down the road. Still, I can provide a little explanation. The most important thing is the Sigmoid function is easily differentiable, saving us a lot of computation because remember, we need the slope (derivative) in order to perform backpropagation. Also it’s real-valued and differentiable, is non-linear. It has a first derivative that doesn’t change signs, which makes it consistent when using the slopes for adjustment like we do in backpropagation. It has one local minimum and one local maximum… the list goes on and on. It just fulfills everything we’re looking for in an activation function. That’s the simplest way to put it.

I know I’m probably not leaving you with a great understanding of it, but I really do think it’s counterproductive to try to understand why we use this activation function before you understand ANNs completely. Let’s move on, just knowing that this is what we’ll be using.

We’re going to start coding forward and backpropagation now, and hopefully you remember that we’re going to do the cycle (epoch) many times. That means we’re going to put all the code that comes next in a for loop:

for iteration in range(1000):

We’re going to call Sigmoid’s input z (not x like you may have been expecting). This is because we want to differentiate it from the x and y we used for inputs and outputs. Z represents all of our weighted inputs (weight*input) summed up. Let’s make z in python using a dot product:

z = np.dot(x, synaptic_weights)

A dot product will return the multiplied matrices (matrix x*matrix w) added up for each row, which is just what we want. Take a look below to make this a little clearer:

Here we have z from a random iteration. Each row is the z value for its corresponding training example. Remember, matrices allow us to do all the training examples at once.

Z is the input to our activation function, so we treat it as the standard “x” in the (x,y) coordinate system on the Sigmoid graph. Since we just made z, it’s as simple as plugging it into the activation function:

sigmoid = 1/(1+np.exp(-z))

This will give us a value between 0 and 1, representing a probability. As with any probability, just multiply the decimal by 100 to get the percentage. Above 50% would indicate our neuron thinks the output should be 1, below 50% would indicate it thinks it should be 0. How close it is to 100% or 0% determines the neuron’s confidence level. This is another reason why we like the sigmoid for our problem; it’s outputs are easily interpreted as probabilities. We want probabilities because in larger problems, the computer isn’t going to have 100% confidence in its outputs, just like humans rarely have 100% confidence in their answers.

We’re done with forward propagation, so the next step is easy: we need to calculate the error, to see how far off our neuron’s answer was from the correct answer:

error = (y — sigmoid)

Allow me to reiterate that since we’re using matrices, we’re actually doing each of these steps for all training examples in one line. Here it’s computing the error for each of our examples. That leads us to a huge plot twist: this isn’t just one neuron anymore; it’s actually 4 neurons, one for each training example. This should make sense; forward and backpropagation for 1 example is a neuron, and we’re doing forward and backpropagation for all 4 examples.

Doing all 4 training examples at once allows the network to have more information in each iteration, so it can learn faster. Think of it like if you were playing a memory matching game, you would be able to win a lot faster if you were allowed to flip over 8 cards at a time instead of just 2. This is why neurons have strength in numbers.

Now we can begin our backpropagation. Think back to the ball example. What do we use to adjust the ball’s position? We use the slope, so the first step is calculate that. From before, do you remember what we are finding the slope of? Well, what function have we used in the network? The activation function! We’ll be taking the derivative of that, giving us the slope of the curve. I’ll save you the calculus and just provide what it is:

sigmoidDerivative = sigmoid * (1 — sigmoid)

With that, we’re almost done! We just need to calculate our adjustment. Remember, our adjustment was going to be x = x-(slope*error), but we can simplify that in code with just a +=. Wait! We have to be very careful here. You might have noticed that even though we had x = x-slope*error, we changed it to a +=. Changing the sign changes what the network is trying to learn. I changed it to a += because I want the network to find the probability that we will output a 1. A -= would find the probability that we would output a 0. Let me explain better:

Take a look at the sigmoid graph. The graph is near 0 while in the negative x values, but is near 1 while in the positive x values. This correlates to the += and -=; with +=, the network will learn when to output a 1, and with -=, the network will learn when to output a 0.

Now that we’ve explained that, let’s adjust the weights:

synaptic_weights += error*sigmoidDerivative

But wait! Our network is trying to learn which inputs are important and which aren’t. So shouldn’t we include the inputs in our adjustment? After all, our adjustment is supposed to be related to the inputs. In effect, we want to make the adjustment proportional to the inputs. We also want to sum up what we get in each column, so the adjustment will take in to effect the results from all the training examples at once. We can accomplish this easily with another dot product. So let’s modify the code from before to get this:

synaptic_weights += np.dot(x.T, error*sigmoidDerivative)

I think it’s worthwhile to take a look at what’s going on in this step from the computer’s point of view. Take a lot at this snapshot from a random iteration:

Our input comes in three slots (i.e. [1,0,1]), and we have 4 training examples. X is transposed, so we have all four “input slot 1’s” in the first row, all four “input slot 2’s” in the second row, and all four “input slot 3’s” in the last row. Then, according to the rules of dot products, we multiply the error*derivative by the corresponding column (look at the picture below to make this clearer). We need to do this because each row of error*derivative is specific to a particular training example; obviously there are different adjustments for each training example.

Continuing with the rules of dot products, all the columns of x.T are added together, and this creates the 1x3 matrix called adjustment. It’s relatively easy to see now that each row of adjustment correlates to an input slot: row 1 is the weight adjustment for “input slot 1”, row 2 is the weight adjustment for “input slot 2”, and so on. We just add that adjustment in to the synapse weights

Close the loop, and we’re done! Now, let’s test it with our new situation, [0,1,1]. In order to test it, we’ll just need to perform one final forward propagation, to get our network’s output:

print("Considering new situation: [0,1,1]")
# calculate weighted inputs
newZ = np.dot(np.array([0,1,1]), synaptic_weights)
# put weighted inputs into our activation function to get the network’s output
activationOutput = 1/(1+np.exp(-newZ))
print(activationOutput)

Pretty good! Our network found the answer to be 1 correctly, and with [(0.9977–0.5)/0.5] = 99.54% confidence . You could try this again with 10000 iterations, and you’d probably get 99.99% confidence. Think of it like if you were allowed 100 turns in a memory matching game rather than 10; you’d probably get better results. Personally though, 99.77% is enough for me.

Conclusion:

Isn’t amazing what we managed to do with just one layer of neurons? Imagine what we’ll be able to do with a multiple layers of these things! Join me next time as we do make just that; a full-fledged, multi-layer network. We’ll be using it to classify objects as cats or not cats. Maybe you’ll be able to use what you learn there and build a “Hotdog or Not Hotdog” app, like the one that was sold for millions in HBO’s “Silicon Valley”. Until next time, happy coding!

A Single-Layer Artificial Neural Network in 20 Lines of Python

Written by Michael DelSole