Deep, Deep Learning Concepts
Deep dive into face recognition and 1x1 convolution concepts. Also, I will discuss various optimization algorithms that I learned from Coursera. Let's start.
Before we could understand the face recognition problem, let us understand the face verification concept.
— What is face verification?
In the face verification problem, we verify if the input face image belongs to the particular name/ID of the person which is also passed as input to the face verification system. In other terms, you can say that it is a 1:1 concept. Generally, in the real world, we could achieve up to 99% accuracy and sometimes a little more in some verification systems.
Face verification is a concept that is used in face recognition also.
— What is face recognition?
Face recognition is a broader concept which definitely uses the face verification concept. In this, we recognize the input face image among the bulk of data and outputs the name/ID of the person if the image belongs to anyone of the people in the bulk data. Let us assume that we have a database of 1000 persons and we are using the face verification concept also. For verifying the image, we have accuracy up to 99%. That means that there is only a 1% probability of occurring error for each image while verifying it. Thus, if we could improve our face verification to more accuracy, we could improve the face recognition accuracy as well.
— What is One-shot learning?
One of the problems that we have to face while achieving face recognition is one-shot learning. It is defined as recognizing the face using only 1 image, i.e. we have to learn to recognize the face of the person using only 1 sample image when we don’t have any considerable amount of dataset to learn the model. But after learning this, we could train a convolutional neural network.
In order to overcome this problem, we compare the features of the two images and find the degree of difference between the images. If the difference between the images is more than the threshold value then we say that the images belong to different persons and to a similar person otherwise.
So long as you can learn this function d, which inputs a pair of images and
tells you, basically, if they’re the same person or different persons. Then if you have someone new join your team, you can add that person to your database, and it just works fine.
In order to learn the difference function as explained above, we use Siamese Networks.
— What are Siamese Networks?
Imagine, if we could achieve something in which we could compare the last feature vectors of the convoluted images i.e. the vector obtained before passing it through the softmax function. That encoded feature vector contains the information about the discrete features of the image and if we will compare that vector, then we will certainly be comparing the features of the two input images. This way we could easily tell if the images belong to the same person or not. I hope you imagined properly!!
In order to train the modal and learn the parameters, we need to define the cost function. For face recognition systems, we define a triplet loss function.
— What is the triplet loss function?
As the name suggests, we define the loss function containing the three images. The three images are defined below:
- Target image(A) — the image which we are trying to recognize
- Positive image(P) — the image very much similar to the target image
- Negative image(N) — the image different from the target image
Now, the loss function is defined as below:
During training, if A, P, N are chosen randomly, it is easily satisfied. Therefore, choose triplets that are “hard” to train on.
In order to learn more about the above concept, I would suggest the readers read this paper.
Just a little more knowledge!
— 1x1 convolution
- In the above convolution procedure, we multiply the 6x6x1 matrix with the 1x1x1 matrix.
- We are getting the 6x6x1 convoluted output. Since there is only 1 filter being used to evaluate, therefore, we are getting the single channel in the output.
Let us now evaluate using a 6x6x32 input matrix.
- If we have a 6 by 6 by 32 instead of by 1, then convolution with a 1 by 1 filter makes much more sense.
- The one by one convolution will look at each of the 36 different positions, and it will take the element-wise product between 32 numbers on the left and 32 numbers in the filter.
- At last, it will then apply a ReLU non-linearity to it.
If there will more than 1 filter, then the output will look something like below.
blue lines represent different layers
- #filters correspond to the number of the filters required in the output.
The above concept is also known as Network in Network and it is best explained in Network In Network research paper.
— Using 1x1 convolutions
- If you want to shrink the height and width, you can use a pulling layer. But what if we want to shrink the number of channels without affecting the width and height of the image, then 1x1 convolutions helps a lot. Let us understand this using an example.
- We have 28x28x192 dimensional volume as the input and we want to decrease the volume to 28x28x32 i.e. keeping the width and height the same and reducing the number of the channels.
- We can achieve this by selecting the 1x1x192 convoluted matrix with 32 channels/filters.
- Thus we will have 28x28x32 dimensional volume in the output with the same width and height but the different number of channels.
- The one by one convolution adds non-linearity to the convolution matrix.
The above concept of Network in Network has influenced many other network architectures including the Inception networks. In order to learn more about inception networks, read the blog post added in the references section.
— What is the Exponentially weighted average?
Let us understand this concept using an example.
Consider the below graph showing the temperature of London for the whole year.
If we analyze the graph, then we have zig-zag values for the temperature which leads to a very irregular graph. Thence, this data looks a little bit noisy and if you want to compute the trends, the local average or a moving average of the temperature would work. In order to compute the moving average or exponentially weighted average, we consider the effect of previous values to a certain extent and add it to the present values like below.
In more generic ways, we could say like below.
If β = 0.9, then the graph would look like(in red):
β=0.9 means average over the 10 days.
If β = 0.98, then the graph would look like(in green):
β=0.98 means average over the 50 days.
If β = 0.5, then the graph would look like(in yellow):
β=0.5 means the average over the 2 days.
Bias Correction
When we define β=0.98, we actually get the below purple curve rather than the green curve,
If you see the above carefully, the purple curve actually starts very low. This happens because we define V0 = 0.
Thus,
V1 = βV0 + (1-β)V1
⇒ V1 = (1-β)V1
Which is not a good estimate. In order to handle such a scenario, we add bias term in order to handle this.
Instead of taking Vt, we calculate Vt/1-β^t.
This formula gives us more related values.
— What is Gradient Descent with Momentum?
- Let’s say that you’re trying to optimize a cost function that has contours as shown below. So the red dot in the middle denotes the position of the minimum.
- You may be doing gradient descent on the whole batch or mini-batch, the process of the gradient descent would look like below.
- This up and down oscillations slow down gradient descent and prevent you from using a much larger learning rate.
- In particular, if you were to use a much larger learning rate, you might end up overshooting and end up diverging like below.
- In order to handle the above behaviors, we could use momentum in gradient descent also. With momentum, the algorithm would look like
— What is Gradient Descent with RMSProp?
You see above that momentum can really speed up your gradient descent process. There is one more algorithm known as Root Mean Square Prop. Let us understand that.
RMSProp algorithm is similar in effects as momentum algorithm, just there is a difference of the equations. Please look at the below image to know the algorithm.
🎶 β used in the RMSProp algorithm is different from the one used in the momentum algorithm. Do not think of them as same.
— ADAM Algorithm
Adam algorithm is also known as Adaptive momentum estimation. The algorithm uses both the momentum and RMSProp concepts and comes out to be a very efficient algorithm. Please look below to understand the algorithm.
Generally, we consider the values of the variables used in the above algorithm.
- α — needs to be tuned
- β1–0.9
- β2–0.999
- ε — 10^(-8)
I would like to thanks Coursera deeplearning.ai for the above concepts and would suggest the readers go through it. Links are added in the references section.
— References
- Coursera(CNN by deeplearning.ai) — **highly recommended**
- Optimization algorithms by Coursera — **highly recommended**
- https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202