A Dive into Deep Learning — Part 2

Karim ElGhandour
14 min readJun 28, 2022

--

Preface

[Disclaimer: I am just getting started in the field. The articles that I write are a way for me to learn more. I write to force myself to dig deep into the topic and get a better understanding. Question everything that I write. I will include references of the resources that I use at the end of the articles. To understand more as to why I am doing this, check my very first post] Additionally, there are a lot of resources online that are created by professors and experts in the field that can try to guide you through deep learning a lot better than I possibly can. IN2346 — taught by Prof. Niessner, Prof. Leal-Taixé and is available online here is a huge example of that.

This is part 2 of the series. You can find part 1 here.

Supervised Learning

Linear Problems, Linear Regression and Maximum Likelihood Estimate

In the first part we have reached the conclusion that we can use Least Squares Estimate (L2 Loss) to help find the best separator between classes. Now why not try to find a probabilistic model instead?

Maximum Likelihood Estimate (MLE) tries to find parameter values that maximize the likelihood(probability) of predicting a class given an input image and weights.

(1) θ[MLE] = arg max (θ) p(Y|X,θ)

But what is arg max (θ)? It is an operation that tries to find the best θ that gives the maximum value of probability. Okay now we can expand the function. Since the probability of all possible predictions can be represented by the product of the individual probabilities of each class, the function can be represented as follows

The issue, however, with using the product is that it adds an unnecessary complexity. So the solution is that we can apply log to the function and by using the fact that Log(AB) = Log A + Log B we can change the previous function from product to summation as follows

But what is the actual probability? If we assume our probability distribution actually has Gaussian Distribution (Also known as the normal distribution) where the there is one class that has the highest probability for an image, while the others have a small probability

Red curve is the normal probability distribution curve. (Source: Wikipedia)

We can then write our probability as follows

In the first equation, µ (Pronounced: Mew) stands for the mean of the distribution (the X where the probability is highest), and the σ (Lower case sigma) stands for the standard deviation which is the amount of variation of the points from the mean. Visually, this can be represented that the µ is the translation of the probability distribution across the X axis, and the σ is how “wide” the curve is.

Now we have everything we need to find the optimum weight to maximize the probability (and minimize the loss function). Here comes the mathematically intensive part.

If we want to maximize(or minimize a function) we can then find its derivative and equate it to 0. To do so we need to get rid of the summation signs , and since the left side of the equation (green rectangle) is a constant we can replace the summation with “n”, which makes it -n/2*Log… For the right hand side we can use the Matrix Notation to get rid of the summation sign. After we do so we can then apply the derivative, equate to 0 and find the optimum weight.

As you can probability figure out, we have reached the same θ as we did using the Least Squares Estimate if it follows two assumptions which are “Training Samples are independent, and the Training samples are generated from the same probability distribution”

We need to note that we are assuming that Log is to the base of e (Natural Logarithm, a.k.a Ln ). That is why the derivations are done as though we are deriving Ln not Log.

Classification vs Regression

So far we have mentioned the word regression a lot of times, but we did not really explain what it means, and what’s the difference between regression and classification.

Let’s first deal with definitions

1- Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome’ or ‘response’ variable, or a ‘label’ in machine learning parlance) and one or more independent variables. [Source: Wikipedia]

If we take the age/height example mentioned in part 1 as an example, regression would mean finding the relationship between the dependent variable (Height) and the independent variable (age). By finding this relationship (The line equation), we can then use it to predict the height of a new person given their age.

2- Classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. [Source: Wikipedia]

Taking that same example, we can divide the problem into classes: [Tall, Average, Short]. So given the person’s age and height, we can “classify” which class they belong to. If by plotting on the graph their point lies above the line then they are tall, if the point lies on the line they are average and if their point lies below the line they are short.

Disclaimer: Please note that the height of a person does not increase forever as they are getting older, so the line would plateau after a certain age. We are talking about the growth period of a normal human ( 4–18 in the diagram) Also it is not scientifically accurate, but just an illustration.

Linear Regression: Recap

To recap, the linear regression equation can be written as

Or in the Matrix form as: ŷ = Xθ

Since we are randomly initializing the weights and the biases, then we want to find an equation that maximizes our probability/chance of getting the correct class, or in other words, minimizing the loss which is the difference between our prediction and the true (expected) class. This can be done either through:

1- Least Squares Estimate:

2- Maximum Likelihood Estimate:

Both of which made us derive that the best weights and biases can be calculated through:

Logistic Regression, Sigmoid and (Binary) Cross-Entropy

Instead of following Linear Regression’s approach of drawing a line, how about we normalize the values and map them between 0 and 1, We can use that in classification so we know that if the outcome is above 0.5, then they are in class 1, and if they are below 0.5 then the class is 0.

Sigmoid Function accomplishes that, in fact, it is still introduced as the very first Activation Function ( more on that later )

If we are dealing with a binary classification problem (only 2 possible outputs) with two possible probabilities p and q, Jacob Bernoulli suggested a distribution named after him “Bernoulli’s Distribution” which dictated that for a random variable X in the condition of success we can define X=1 and in the condition of failure X=0. We can represent this probability as:

Source: sciencedirect.com

This is useful because we can then apply this to our problem to write it as follows:

Where ŷ_i is output of the sigmoid function on our dot product between the weights and the input x. This is because as we said, the sigmoid function normalizes the values to become between 0 and 1 and hence it can be used as a probability and the same rules for the probabilities apply. We can then get rid of the product as we did with MLE by using the natural logarithm

p(y | X, θ)

Now we can find the loss of object (not the cost), which would be

This is called Binary Cross-Entropy loss. In case our target class y_i is 1, then the part of the equation cancels out ( = 0 ) so we are left with

We would like y_i to become as close to 1 as possible which would make our Loss become as close to 0 as possible (which is our desired output, to always minimize the loss)

But now we want to find the Cost function ( Which is the average of the loss functions as we have explained before )

Notice the negative sign, that is because the Natural Logarithm grows from negative infinity at x = 0 to 0 at x=1, so the summation will result in a negative number, hence why we multiply it with a negative sign to result in a positive number.

Source: Wikipedia

Okay so now we want to minimize the loss, which is usually done by directly getting the derivative. But do not forget for logistic regression we are first getting the value of the linear regression first, then we are using the sigmoid function on that output to obtain the logistic regression. So we cannot directly derive the function. The way this is done is by using a concept known as Gradient Descent, which is basically dividing the gradient into several parts, where you begin from the derivative of the last step to the derivative of the first then by using the chain rule we multiply them. In this case, we find the derivative of the sigmoid function and then multiply that by the derivative of the linear regression, finally getting the derivative of the logistic regression. (More info about Gradient descent in later parts)

Perceptrons and Neural Networks

If we are faced with a linearly separable problem, our first train of thought would be to go with linear regression. We would like to find a line (in case of a 2D function), a plane (in case of a 3D function) or a hyper plane (in case of an nD function). In which case, we can use linear regression to randomly assign weights and biases, and use the L1 or L2 loss functions to finally find the correct weights and biases that separates the data. The output of the linear regression would be called the Linear Score Function. Actually there will be easier solutions than finding random initial weights and biases then trying to optimize, but more on that later.

This is our basic network so far. For each output z, we can calculate the value by

Now if we want to add complexity, why shouldn’t we just add additional layers (like z) after the z? The answer is because at the moment the function is still linear. Hence adding non-linearity to the function is necessary.

Activation Functions

are non-linear functions that takes as input the linear score that we previously received. Their task is to add non-linearity, so as we add extra layers we are adding complexity to the model. That is why activation functions are necessary.

What are some examples of Activation functions?

Sigmoid that was explained earlier in this article

Sigmoid function (Source: Wikipedia)

TanH is simply a changed version of the sigmoid function which instead of normalizing the values from 0 -> 1 , actually normalizes it from -1 to 1

Source: (Wikipedia)

RelU: Rectified Linear Unit RelU(x) = max(0, x) which is in fact an interesting function. Basically if the x is less than 0, then it returns 0, if it is more than 0 then it returns the value as it is. When we go in depth into the differences between activation functions we will understand why it is still one of the best activation functions still widely used. But for now, keep in mind that it still adds complexity to the model as during weight initialization, the weights can actually be negative which leads to negative values, that are then zeroed by the RelU function. There are other variations of the RelU like LeakyRelU which instead of zeroing the value directly, it reduces it by multiplying it with a small constant. So the LeakyRelU(x) = max(0.01x, x)

Source: (Medium — Article by Kanchan Sarkar)

New Architecture

So now instead going directly from the input to the output, there is a transition where we apply an activation function to it.

For simplicity, the linear score and the activation function are represented in the same “block”.

If we plot a loss against time graph, we expect it to look like this where we begin with a high loss and overtime reduce it by continuously modifying the weights. Our goal is to loss the weights just to reach the point where we get good generalized guesses.

Loss (y-axis) against Time/rounds trained(x-axis)

Now if we plot a Loss against Parameters (Weights and biases) graph, we expect it to look like so.

Loss (y-axis) and Parameters [Weights and biases] (x-axis)

What this graph means is that we would like to minimize the loss to reach the (local or global) minima in the graph and by doing so we find the (local / global ) best weights and biases for our function. Mathematically when we want to minimize or maximize we get the derivative to the function. However, by getting the derivative we are moving towards the maximization. To minimize we will need to multiply it by negative to go to the other direction.

Weight update

Now for the real part. How should we update the weights? This can be done by using a formula like follows

Where the new weights θ are equal to the old weights minus the derivative of the Loss function multiplied by a constant we call the learning rate.

Learning Rate is how fast we want our weights to be updated each iteration. The smaller it is, the slower the updates. The bigger the value, the faster our model changes but with the added risk of missing the optima that we want to reach.

What we have reached so far

Our model is now learning, we are modifying the parameters as we are training to minimize the loss. Now our model can start to separate some “non-linear” models.

This is basically a Perceptron

Perceptrons and the XOR problem

Perceptron has a single input layer, and a single output unit (Neural Network). As we are learning we are trying to find a separation between our two possible classes, so we are continuously modifying the weights.

But the problem is when we are exposed to a problem such as the XOR (Exclusive OR). Which states that for two inputs X1 and X2, the output is true if the input is Odd. So only if X1 or X2 is 1 then the output is 1, but if both are 1 (or both are 0) the output is 0. Its plot will look as follows

XOR Graph

Now that is a problem for our simple perceptron. It is impossible to find a single line that separates our two classes properly. That discovery caused a “Dark Age of AI”, and it wouldn’t be until much later when researchers came up with the idea of why not add extra units ( Neurons)?

By having a model such as this one where the Y1 and Y2 also passes through a linear score and activation function before resulting in an output, we find that now instead of having a single input layer directly connected to a single output layer, we actually pass through what is called a hidden layer. A Hidden Layer is basically a transition step where we calculate values that are then used as input to the next layer. There is no limitation to the number of hidden layers we can have, but of course as we are adding layers, we are adding high complexity resulting in a lot slower training time and sometimes this is not needed.

But now, we found out that actually instead of finding a single line that separates the value, by adding the new layer, we can achieve non-linear separation.

This is basically the idea behind Deep Learning and Neural Networks, and that is why it was an exciting transition because now we can separate non-linear problems dynamically. These can be complex image classification tasks, speech translation, image segmentation. Of course it is a lot more complex than I am describing in this article.

Conclusion

This concludes the second part of the series. So far we have covered both linear and logistic regression. Found the different loss functions and what they represent. Finally I introduced the Perceptron, XOR Problem and Deep Neural Networks.

So What now?

If you like it so far, make sure to follow me on Medium to be updated on my next articles in the series. Also make sure to start a conversation, whether in the comments or with your friends / families. Our knowledge advances only when we start to talk and question it. Lastly you should definitely check the resources I mentioned and look at other articles!

References:

1- This series is heavily influenced by IN2346 — taught by Prof. Niessner, Prof. Leal-Taixé and is available online here

2- I am also using Deep Learning — Ian Goodfellow and Yoshua Bengio and Aaron Courville and is available online here

3- Arg Max — Wikipedia

--

--

Karim ElGhandour

Love technology. Currently doing an MSc. in Machine Learning and Cybersecurity. Opinions are my own.