How neural networks work

Gabriel Rado
Sinch Blog
Published in
9 min readAug 17, 2020

It’s unbearable how much marketing AI has received during last years among developers. And by unbearable I mean what I’m really doing with this massive amount of information to generate any useful knowledge?

Every time I hear “it’s learning” or “this has some intelligence” I think “well, it’s just lots of if/else statements, isn’t?”. Well, technically yes, but in a far different way from what I thought. To satisfy my dilemma, I decided to reinvent the wheel and build one from scratch. Was this useful to any commercial solution? Of course not. Was your first “Hello World” commercially useful? Not either, but I think it was crucial to start scratching the tip of the iceberg. I think the same applies here. I want to show what is called learning and what a neural network actually does. Spoiler: it’s just some derivatives!

Instead of taking control of the world, our journey starts with a rather simpler objective, recognizing a digit out of a picture:

calm down Sherlock, I know it’s an 8

It has a total of 784 pixels (28*28), with each one between full black or full white. Imagine writing down some if/else’s to check pixel by pixel if they form an 8: recognizing two closed circles? First, what is a circle (in term of pixels)? What if the number is a little tilted? As one can see the 8 is not entirely closed. A bad 8 is still an 8, right? An upside down 3 is still a 3? What about 6 vs 9?How to deal with these cases? This is the moment when we stop using conventional programming logic and start applying a neural network to solve this.

First of all, lets consider a black box system that answers “does this image have an upper closed circle?” and another one that answers the same about any lower closed circle. Both of them returning boolean. I now this is very abstract and execute complex tasks and I’m sorry for it, but please continue with me. Let’s call them a1 and a2, respectively. So to identify an 8 we’ll have isThisAnEight = a1 && a2. To be more flexible, let’s use an float between 0 and 1 instead of a boolean, where 1 is 100% true and 0 being full false. Now, we our amazing fictional machines a1 and a2 could return .6 and 1. Is this still an 8? What about .7 and .9? We can tune how we answer this dilemma being arbitrarily specific:

isThisAnEight = 1.2*a1 + .8*a2 + .4

Now we’re adding more weight to a1 rather than to a2 to decide if it is an eight and even adding a lot of bias (.4), meaning something like “we’re already expecting an 8, lets require less certainty from our machines”. Just to be sure this value doesn’t explode to very high module numbers, we can encapsulate the right side of the equation with a function that receives any value and outputs between 1 and 0, keeping our previous logic about true or false. Let’s name these constants and encapsulate with that bounding function called s:

w2*a2 + w1*a1 + b1 outputs any unbounded value

isThisAnEight = s(w2*a2 + w1*a1 + b1) , outputting between 0 and 1 because of s

Now imagine we’re dealing with number 5. How many machines should we consider to identify a 5? And about number 6? Independently of how much, the idea is the same:

someOutput = s(w1*a1 + w2*a2 + ... + wn*an + b1), where we have a unique weight for each input and a single bias for each output. The catch is that any input can be an output of another expression composed by its several own inputs! This way we could have layers where the first ones represent abstract lines (straight and curved), the intermediary ones are circles and specific angles (right angles, etc) and the last ones are exactly the 10 possible digits we could have (from 0 to 9). Another catch is this: we can reuse a2 to identify 6’s lower circle and a1 to identify 9’s upper circle. Meaning the same input could be inserted into different outputs! This way all this arbitrary decisions and connections with its weights and biases will turn into a ugly network:

ignoring the exact amount of values on each layer and the amount of layers itself, we can see how exponential this grows :(
a closer neural network to our current example

So we know for sure we’ll have 784 values(the pixels on the original image) on the input layer and 10 values(one for each possible digit) on the output layer. But which values to fill all the weights and biases? Besides that, how many layers and how many values per layer we’ll have?

First of all, let’s name each node on this network as a neuron. And about the weights and biases values on each neuron, we don’t have to decide! This is the beauty of machine learning: we can set all of them at random and let the network itself regulate and find better guesses for these values! About the layer amount and the neurons amount on each layer, it’s also very empirical to each problem (really), meaning we should try as many architectures as needed until we found a model to fit our needs. To achieve this “regulate itself” behavior, we need:

  1. A mathematical way to check if the neural network is doing right or it’s being very dumb

2. Another mathematical way of knowing to tweak its weights and biases values based on the previous assumption, something to call a learning process (damn, how I hate this expression)

About the first need, we could establish an cost function that:

  • is zero if the neural network is absolutely correct (meaning no error)
  • increases in value the most far away the predictions are off the correct expected values

Let’s translate this to math. Calling x the expected result ,y the actual result from the neural network and C the cost function, we could say:

  • C >= 0, meaning it is always zero or positive
  • y ~ x -> C ~ 0 , meaning the closer y gets to x , the smaller C gets.

There’s no unique right equation to map this cost function. Instead there’s several options with its pros and cons (that I don’t intend to discuss here). For now I want to focus on number 2, finding a way to change the parameters based on that cost function. Remember, we want C~0, so considering a single fixed input (the number 8 as in the beginning, for example), we want to find a set of weights and biases that decrease C the most, that gets the minimum value out of the cost function. For this, we can use some derivatives to achieve this behavior:

derivative of a single parameter function: receives one value, outputs one value
  1. set any starting random point on the cost function graph. By random point I mean a collection of random valued weights and biases
  2. derive the cost function at this random point. This way we can discover in which direction it decreases
  3. take a very small step towards decreasing the cost function and thus changing weights and biases accordingly
  4. repeat 2 and 3 considering this new starting point as many times you want:
the four previous steps applied to all input data at once is also called batch gradient descent. google it!

Now that I presented the happy path, lets add some sadness to it. First of all, small steps towards the bottom take longer than bigger steps. But bigger steps offer the risk of missing the minimum and start going up again.

I know 2 dimensions are far away from 23860 😞, but I’m sure it will help you understand it! Think of it as climbing down a hill

Second, our cost function is not a function of a single parameter. Considering a network with 784, 30 and 10 neurons on 3 layers, we actually have 23860 parameters (23820 weights and 40 biases). For now let’s keep it simple as in that picture: a two dimensional function, also called a surface.

Another main point is that for each input (each different digit picture), we have a different cost function surface, with different local and global minima. By the way, we have 60000 numbers cataloged here. This way, on each different input we have a different cost function surface. Meaning if our network have already minimized its cost function to this specific number 8 with precision, it still knows nothing about any other numbers or even any other different 8's!

For each different input we have a different cost function surface. Let’s consider a small subset between all digit picture inputs while we apply the following steps simultaneously:

  1. Consider the same random point on each different surface, averaging the general direction needed to decrease the cost on all surfaces.
  2. After updating all the values towards this same single general direction on every surface, all these inputs are “discarded” and a new subset between all unused inputs is used, called a mini batch. Let’s continue updating another small step on all weights and biases values, repeating these steps until after several mini batches we used all available inputs.

If this is too abstract, imagine this: you’re playing GTA V, but with a 3 strip split screen with one main character on each strip. With only one single keyboard and one single mouse (sorry PC master race here), you command simultaneously and in real time all 3 characters, each of them being in a different map section. Now comes the issue: your goal is to make their altitude as low as possible after random initial positions. But trying to climb down a hill with Trevor in mid desert may lead Michael uphill on Vinewood. Instead of 3 moving points imagine something close to your mini batch size (5, 10, 100?) and instead of a three dimension surface as any physical map, imagine a 23860 dimension surface (even the word surface here doesn’t make sense anymore haha). The real problem on our network is exactly the same, only more cumbersome.

So we have:

  • batch gradient descent: uses all available inputs at one single averaged parameter update. Good when all inputs generates the same global minimum on the cost surface. The fastest method when we can generalize data
  • stochastic gradient descent: uses every single input alone for several small updates. Good when we have very different input data. The most precise method when we can’t generalize any data, takes the longest.
  • mini batch gradient descent: dividing all available inputs into n subsets, apply n averaged updates. good when a small set of data still points towards the average global minimum. A medium term between the above methods
silence, some drunken machine is learning!

Notice that in a single graph the path from mini batch gradient descent would look like more of a drunken person and less of a intelligent sober machine. But don’t be fooled, it’s fitting the same values for several different inputs at the same time! So while the stochastic gradient descent is slow and precise, and a full batch gradient descent is very generalized but also very fast, trying mini batches with different sizes can gives us a precious info about the best mini batch size for our data.

Last but not least, we talked about:

  1. What is a neural network
  2. How to measure the accuracy of a neural network
  3. How to use this measure to improve it

But we didn’t talked about actually improving it, I mean, I left all the math aside! And being quite serious, that part is simply too complex and painful to this post, and it’s very detailed on my github (please, feel challenged to prove this as I did). The main point is that to update any value (being a weight or a bias) one only have to know beforehand the outputs it generates. This method is called back propagation. Instead of feeding an input and computing values across the network until a value outputs, we starting updating our output parameters propagating this difference back through the entire network:

go watch 3Blue1Brown! right now!

I hope you have now a new and clearer way of seeing this concept, without any hidden intelligence, only the explicit needs of the cost function and the desired output. The main point being neural networks are just a tool, and as with every other tool, it needs an operator that knows how to use it and wants something very specific!

Any mistakes I made, points left unclear and additions, feel free to write them down or contact me!

--

--