Implementing deep learning network from scratch. Scala example.

Vsevolod(aka Seva) Dolgopolov
Towards Data Science
6 min readMar 7, 2019

--

Deep learning has got a lot of attention in the past few years. Started as a black magic voodoo cult, it is on the way to become a pretty standard engineering task. It is becoming less magic but more a mature toolset to solve a wide range of data related problems.

Nevertheless there is still a place for mystery, since it is not that obvious how this thing is actually able to learn by itself and even to do it “deep” without any direct intervention by a programmer. Let’s try to understand that.

Neural Network in a nutshell.

The notion of “deep learning” refers to an artificial neural network, that mimics to some degree a pattern of how our brain works. Basically it is about sending an input through chain of connected layers where each layer make its own impact to the end result.

The actual learning comes in place by iterative search for the best possible impact/weights each layer have to provide in order to get an output we need.

Figure 1. Neural Network with 2 hidden layers

But before we take a look at an actual implementation, it is important to understand what is the purpose of all that layers, also know as hidden layers. XOR problem makes it clear. As you can see in a Figure 2, you can not find any linear function that is able to separate area of A’s from area of B’s as you can do it with AND and OR . There is an intersection in-between that doesn’t allow us to decide whether we are in A or in B segment.

Figure 2. AND, OR, XOR gates

In order to find an answer we extend our 2D space with additional dimension(or many more another dimension), so you can eventually separate one feature from the other.

Figure 3. 3D space of a XOR gate

In terms of neural network, additional dimension is just another hidden layer. So all we need — is to figure out whether this another dimension let us solve our XOR problem. To do that we will apply a Backpropagation algorithm — key concept published in 1975 to enable interconnected layers to learn their own weights or in other words to learn how meaningful they are to help us with separating A from B in XOR.

With Backpropagation algorithm on board we will apply basically 3 steps to enable our “deep” learning to do its job:

  1. Forward pass
  2. Backward pass
  3. Update weights

Forward pass is about making a prediction with currently available weights. Backward pass gives us information about contribution, every prediction do to miss the target. With this information in hand we will fix our dimension weights and hope that in the next iteration our prediction will get us closer to a target.

There are different approaches how to do a weights fixing. One of them is a gradient descent that we will also use here.

Implementation in Scala.

In Scala code this 3 steps process could look like this:

Listing 1. Basic Neural Network Pattern

Let’s take a closer look at our first step — forward function. We will implement it as a recursive one:

Listing 2. Making prediction with forward pass

Forward pass takes care of:

1. applying weights to network layers

2. passing this weighted layer through sigmoid activation function

As an outcome we get a new List of layered predictions within a network. With this List in place we can go and look for errors each layer’s prediction contributes to missing the target(rule output).

We will start with the difference between target we want to achieve an prediction we did in the very last step. Since our List of prediction was generated in a LiFo manner, first element in the List is also our last prediction we made, so we can take a target we given and see how far we are from target. With that first error magnitude in a hand we go and look for the rest using a backpropagation pattern.

Listing 3. Finding the first predictions error and passing it to backpropagation

Since we discovered an error magnitude(or delta) we have at the final end of trained network, we can use it to discover error delta from a layer before, since we know an prediction these layer has got. And so forth for every other hidden layer before what leads us to another recursive function.

Listing 4. Backpropagation pattern

You may noticed, we provided our weights in reverted order, that is because the only way to compute a backward path is do it from the end of a network.

In general we are done here. We know how to compute predictions and its error deltas and how to use it to update a weights. The only thing we need to do is to start iterating over our dataset and applying updated weights to predictions we try to met. And do it so long until we get a weights that meet targets as close as possible.

To verify how close you are at every single iteration you need to identify a networks loss. As our network learns, loss have to decrease.

Listing 5. Calculation of a prediction’s loss

If you run an example implementation provided for this blog you will see:

Neural Network learning log

So decreasing loss means that our prediction gets near to the target values they suppose to have. And if we take a look at predictions done at the end of training process, they are pretty close to values that are also expected to be true.

Training results. XOR Gate is fulfilled

A bit of a trick.

May be the last thing that is still not covered at all, is network’s initial weights. We know petty much about how to update them but where do we get them first. To clarify that we need to make a step back and recap a definition of layer’s prediction. What we already saw is that to make one we need 2 steps:

  1. Scalar product of input and weight: net = np.dot(input, weight)
  2. Activation of that product with sigmoid function: 1/(1+ np.exp(-net))

But in theory the first step have actually look like that:

net = np.dot(input, weight) + b

where b is standing for bias or threshold and have to be an another Tensor, responsible for regulating the resulted net before it gets activated by sigmoid. We actually also need to have a bias as well, not just weights as we did before Sounds like we need to implement a few more things. But there is a trick there.

To avoid that additional complexity we just do the following:

  1. add additional column of ones to our training set tensor (Listing 6. Line 3)
  2. extend layers weights with the same column of ones (Listing 6. Line 11)

and so integrate a bias in our optimisation problem.

Listing 6. Prepare network weights and bias

And back to the question where do we get weights. Take a look at generateRandomWeight function on a Line 1. That is where our weights initially comes from and they do it more or less randomly. It is pretty weird to realize it for the first time, that the backbone of a prediction — weights, could be just randomly generated and still do a proper prediction after we’ve updated them for a few times.

Conclusion.

So hopefully you were able to see that “deep learning” is pretty close to a regular programming task. The mystery around this peace of software, is basically based on two main patterns:

  1. Identify how far from it actual targets our neural network’s prediction is, by applying backward propagation pattern.
  2. Gradually reducing this error space by updating layers weights with the help of stochastic gradient descent pattern.

And may be a few useful links:

PS. I guess it is worthwhile to mention that this is not a blog post about production ready implementation of neural network written in Scala. Maybe next time ;) The main focus here was to show the basic patterns as transparent and obvious as possible. I hope you enjoyed it.

--

--