But what is artificial intelligence exactly?

Alessio Bandiera
CodeX
Published in
13 min readApr 2, 2022

The beginning

This is the story about my journey into experimenting with neural networks.

The journey begins like many other stories begin: on YouTube, when you are bored. I don’t know about your YouTube home page, but mine is peculiar, filled with random short videos, ugly memes and math videos from many math-related channels. And, one day, the algorithm blesses me with a video that caught my attention:

The video illustrates an implementation of a so called “neural network”, which represents the brain of the chrome dino, and as time progresses, the dino learns how to avoid obstacles by learning its environment. This concept, to me, was astonishing to say the least: I’ve been programming for quite some time, but I’ve never seen something like this in my life. Can you really teach something to a bunch of numbers? How can we talk about learning in this context? And what is a “neural network” anyway?

So I immediately knew how I was going to spend the rest of my day, searching on YouTube for other videos explaining in much more detail what’s going on in the dino’s brain.

Neural… what?

Basically, unless you’ve been living under a rock, you may have noticed that artificial intelligence is all around us nowadays, and many companies are implementing these kind of algorithms that can interpret the data they gather from us better and better. These algorithms are, for the most part, nothing but “neural networks”: let me show you a little graph that illustrates what we are talking about, even if you’ve probably seen this diagram online in some other article.

But, first things first: the term “neural network” implies that we are dealing with a “network” of some kind of “neurons”, exactly how brains work if you think about it (I’m not a biologist but you get the point).

A picture of a neuron, which I stole from Wikipedia.

Let’s say that a neuron can be easily represented by two parts, a circle, which for now I’ll call the head, and a line, the tail. Thus, a network of such neurons would look something like this picture I stole on Wikipedia (exactly like the neuron by the way):

A picture of a neural network model, which I also stole from Wikipedia.

I know, this looks daunting at first, but this is a fairly good representation of a neural network; as the arrows suggest, this diagram has a direction, and in fact what we usually do is feed the neural network into the green dots (on the left), and get the resulting output out of the yellow dot (on the right). This diagram also suggests that the number of neurons is completely arbitrary: the number of inputs, the number of outputs, and even the number of those weird blue dots at the center that we haven’t talked about yet. But before that, we need to introduce another very important term in this machine learning world, layers: if you look closely, the graph is somewhat divided into three vertical sections, defined by the columns of neurons: those are the layers of our small neural network. What we can conclude is that networks like this surely have one input layer and one output layer, and the middle column is most commonly called the hidden layer (usually the depth of the network is more than one single hidden layer).

As our intuition would suggest, this sketchy representation of a brain has to match some kind of mathematical concept, otherwise this whole network thing is nothing but a weird drawing. And in fact, this is the drawing of a mathematical model, which is able to learn by tweaking and adjusting its parameters.

Yeah, that’s cool… but in practice?

And now comes the most tedious part of the journey, where we need some prior knowledge to grasp the idea behind the formulas that I’m about to show you. In particular, the branch of mathematics that we need in our discussion is mostly linear algebra, which to be honest it’s a subject I’m currently studying myself. But for this article, I’d like to keep things as intuitive as I can, so let’s have a look at the formula, shall we?

The super easy formula of neural networks.

As simple as that.

I know what you are thinking, it’s something like “what on earth is that” and “that escalated quickly, at first I thought I could read this article without any prior knowledge”. Truthfully speaking, this exact formula is just a way to express the fact that what a neural network really does is take as input some number of values (we call it the input vector), transform these values multiple times passing them through the hidden layers (which are denoted with the composition of functions), and returning the result of these transformations. As soon as you understand this idea, you know that what we are really interested in finding is just how to transform the input values to get the expected output values.

To make things simple, imagine that you have a collection of 2D points, and you want to find the best line that can describe the relationship between the x values and the y values. For example, let’s say we have these numbers:

[(-2, 5), (3, 7), (6, 8.2), (-10, 1.8), (-5, 3.8)]

As we all learn in school, given that we need to find a line, we know that we only need 2 points, since for 2 points there is only one line that can pass through them simultaneously; but, for the sake of the experiment, we need more than just 2 points, and you’ll see why later on.

OK, so where do we start? Well, a line is basically represented by two numbers, the coefficient — which represents the slope of the line — and the y-intercept — which tells us the “altitude” of the curve. School also taught us that there is a simple equation from which we can derive these two numbers, but it’s not a very general solution, because what we would like to do is determine a general algorithm that can do better than just lines (because you have to admit it, lines are boring, and also the world is more complicated than just linear relationships).

Let’s try with another idea: let’s say that the solution is y = 3.6x + 5.2, with two random numbers in the equation. Now, let’s check if we are wrong, for example by plugging in the x coordinate of the first point, and we get

3.6 * (-2) + 5.2 = -2

which is definitely not 5, so we know that our line is incorrect (who would have thought that, by throwing random numbers, you don’t get the correct solution). This result surely doesn’t surprise us, but we would like to evaluate how wrong we are: the expected value was 5, but we got -2, so we are 5-(-2) = 7 units wrong. How do we adjust the parameters of our line, to match the exact answer then? In reality, the full answer to this question goes way beyond the scope of this article, so the way we are going to do this is with the algorithm I used to train my neural network, which is called “random search”.

The idea is fairly simple: generate some amount of random adjustments to your solution, evaluate how wrong you are with each of these adjustments, and then choose the best one — and repeat this process over and over until the solution is good enough. Sounds reasonable, don’t you think?

But we are missing an important step, how do you choose the best tweaking? Let’s introduce the cost function:

The formula used to evaluate the average cost.

And here it is, another bad-looking formula to express a simple concept. What this formula is saying is to compute the sum of the squares of the differences between the output layer and the expected outputs, and divide the result by the number of outputs (which means that we are evaluating the average cost). And if you think about it, we just made the first step of this algorithm manually, with the result of 7, except that we didn’t square the difference, so why there’s a square there?

We usually square the difference for two reasons: the first is that we want to interpret the error as distance from the correct value, and thus it makes much more sense to always have a positive answer (and also because we are evaluating the average); but you might ask “just use the absolute value then” and you would be right, but in this example we are squaring to give more importance to bigger errors — because the higher the error, the higher the square.

So let’s write a simple Python script that can handle the hassle of computing the average cost by hand:

Small Python code that computes the average cost.

As we can see, the result is not very good: in fact, the lower the number is, the lower the error is, and 362.856 is definitely not the lowest we can go.

To minimize the error, we need to modify our code a little bit, since we need to perform this operation multiple times with many random adjustments, until we are satisfied with the solution we get.

Small Python code to approximate any line from the given data set.

This code includes some small details we haven’t talked about yet: first, 40_000 is the number of “epochs”, which represents how many times we adjust our approximation basically (try to change this number and see what happens); second, we are generating random values without the standard random Python library, and the reason is that we need every adjustment to be equally likely to be picked, and this can be achieved by returning samples from the normal distribution, centered in 0 with 1 as the standard deviation; third, the STEP_SIZE is a multiplicative constant, which defines how big the generated adjustments should be. I intentionally omitted the results by the way, check them out by yourself, I think it’s much more satisfying.

What were we talking about?

But what all of this has to do with the “neural networks” we were talking previously? Well, in reality, neural networks are nothing more special than this small example, except that they are able to approximate non-linear functions, and they are not limited in 2D, in fact they usually solve problems with billions of dimensions of complexity. Take the MNIST data set for example: if you’ve never heard about this data set, it’s just a massive amount of pictures of handwritten digits, associated with the corresponding digits. From this, you can build your own neural network, able to recognize handwritten digits!

Now, let’s dive deeper into what I’ve been doing in the past few days. Have you ever heard about the Rust programming language? If you haven’t, don’t worry, the gods will eventually take you on the right path. Rust is probably the one single programming language that completely changed the way I write and think the design for my code, and if you don’t know what to do in your spare time, go check it out! It’s not an easy language, by any means; but it’s awesome, trust me, not only Rust itself, but the whole Rust ecosystem, and also the community is amazing — which is a very important detail.

This Rust praise was just a way to introduce what I’ve been doing this week, which is basically a neural network from scratchbecause that’s what “Rustaceans” do. This is the repository with all the code I used to perform the experiments.

For the nerds: usually the computations for neural networks are performed on proper hardware to process the data faster, such as GPUs or Google’s TPUs; instead, what we are doing is computing the matrix multiplications with our CPU because we are dumb, and also because GPU code is stupidly hard — if you need good results and you need them fast, rent some GPU online and don’t bother me. Also, if you decide to run this code onto your machine, be warned, the CPU can get very hot (74°C) — depending on the hyper-parameters you choose, and the function you want to approximate — because the library I’m using to perform the computation in parallel uses every bit of the CPU available (if you want you can change this behavior, but I don’t know how because I’m too bored for that).

Basically, all the relevant code is here:

The actual relevant code.

This is the code that performs the transformations we talked about earlier, and also we have this other snippet which is pretty important:

The actual-actual relevant code.

which, as the name suggests, performs the random search, optimizing the parameters of the network.

For the nerds, again: this code is a little different than our previous Python example, because to speed up the computations, I generated random seeds instead of random adjustments directly, then I evaluated the corresponding adjustments based on the given seed, and then rebuilt the best adjustments after calculating which ones give the minimum average cost.

If you didn’t understand that, read the sentence again, eventually it will make sense.

…but wait, what parameters are we optimizing here? There is no such thing as “coefficient” in the diagram we explained earlier, so what are we even talking about?

What are the parameters to tweak here?

Do you remember the picture at the beginning of the article, showing the weird lines-and-circles maze? Now, I want you to imagine that every single circle and line in that drawing represents a number — that’s a lot of numbers to imagine. A random number, for each component of the model, that’s the way we initialize neural networks, exactly as we did with our line. And how do we calculate things?

Warning: this is going to get a bit more technical, but if you made it this far you already have a decent idea of this discussion — if you want to skip this part you are free to move on, but I’d suggest to take a moment to understand the core idea of this paragraph because it’s definitely the most important section of the whole article.

Each layer is nothing but a vector, and each “tail-maze” between layers is a matrix, and that’s where linear algebra comes up. Now let’s add some proper terminology here: each circle represents the bias of each neuron — which you can interpret as the y-intercept of a “more dimensional curve” — and each line represents a weight, because to evaluate things we now have to deal with matrix multiplication, and thus this matrix gives a weight to each input (and also changes the size of the vector).

So, to get some numbers out of that thing, do these steps:

  • put your inputs into the green dots
  • multiply the input vector with the weight matrix, to get a vector with the same size of the blue hidden layer
  • add the bias of the hidden layer
  • apply a non-linear function
  • repeat until you reach the end of the network
One step of the network.

Look at this piece of the formula at the beginning of the article once more, because now it should make sense! Easy, right?

…but wait, what about the fourth step? See that sigma symbol right there? That’s the non-linear function, which is often called activation function: its purpose is to allow the neural network to learn non-linear functions, roughly speaking — if you want the technical words, “the composition of affine transformations is still affine”, thus the network wouldn’t be able to learn non-linear functions.

The final results

Let’s look at some results of our Rust program then:

Approximation of sin(x) from -5.0 to 5.0.

Isn’t that amazing? And what about this one:

Approximation of sin(2x) + x from -5.0 to 5.0.

I think that the results speak for themselves — in particular, in these examples I used networks with 3 hidden layers made out of 32 neurons each.

Conclusion

And with that, I hope I raised your interest in this beautiful field, which combines mathematics and computer science beautifully. As I said, I’m still learning a lot of stuff, but I think that nothing compares to the feeling when you are able to put in practice what theory tells you. It took me a lot of time to fix many bugs in my Rust code, but eventually here we are with these amazing curves plotted on Desmos.

Last, but definitely not least, huge thanks to Modesti Dennis, who helped me throughout every step of this journey, he knows way more math-related stuff than me. Also, thanks to him for introducing me into the Rust programming language, and for teaching me how to approach things differently.

If you are really interested in the details of this…

This is the end of the story, but I would like to list some bugs I encountered while writing the Rust code, so if you try to write a neural network from scratch and the average cost doesn’t decrease, check these things first:

  • don’t build an entire library to make a single experiment, it’s dumb because more code = more errors, inevitably
  • check if you are evaluating the average of the cost of the sum of the starting values and the adjustments, and not the cost of the adjustments alone, otherwise don’t expect the code to work anytime soon
  • make sure to generate the random adjustments with the normal distribution, because the tweaks must be equally probable in every direction
  • make sure that the network has an appropriate size: this part is very important and the answer to this is completely empirical, you need to play around with the parameters, and eventually you will find the best parameters for your problem
  • don’t draw any conclusion unless you know what you are doing

In the README.md of the repository I left the results of some experiments for reference.

--

--