Faster AI: Lesson 4— TL;DR version of Part 1

This is Lesson 4 of a series called Faster AI. If you haven’t read Lesson 0, Lesson 1, Lesson 2 and Lesson 3, please go through them first.

In this lesson, we are going to learn about different types of optimization algorithms, semi supervised learning and collaborative filtering.

As usual, for the sake of simplicity, I have divided this lesson into 4 parts:

  1. Gradient Descent Algorithms in Excel [Time: 11:28]
  2. State Farm Distracted Driver Detection Competition [Time: 56:57]
  3. Pseudo Labeling (Semi Supervised Learning) [Time: 1:23:45]
  4. Collaborative Filtering (Path to NLP) [Time: 1:36:01]

Before diving into these parts, Jeremy quickly goes through Convolutional Networks in Excel and there he shows how convolution works by explaining each part in detail using Excel Spreadsheets.

Link to excel file

1. Gradient Descent Algorithms in Excel

Jeremy goes through all these algorithms in excel and explain to us how they work and gives us demos of all these algorithms inside excel.

SGD (Stochastic Gradient Descent)

Like we talked in our previous lesson, SGD basically calculates the partial derivatives of the loss function with respect to the weights of the system and it is multiplied with a small number like 0.001, 0.0001, called Learning rate and that resultant value is subtracted from the calculated value of the weights. After repeating this same process over each iteration the weights will be optimized and we can predict the desired value from the system.


Basically what momentum does is at every step of Gradient Descent, it calculates the average value of each of those steps and this average value gives us the sense of direction on where the function is heading. This average values when multiplied with some number helps to add momentum like aspect to the whole optimization algorithm and speed up the process. Instead of now just partial derivative this new calculated value is used to optimize the weights of the system.


It is often the case when single learning rate for whole system might not work, as sometimes it fails to correctly optimize all the parameters of the system. While one parameter might update properly, others might lag behind and gets slowly updated than required.

To overcome that, ADAGRAD is introduced. It uses a concept called Dynamic Learning Rate, which basically means, provide different learning rates to each parameters of the system.

In Keras, when you execute model.summary(), it will display all the parameters of the system in each layer, so this ADAGRAD provided dynamic learning rate to each of these parameters.

Due to this every parameter will update differently and process is much more optimized.


It is first introduced by Geoffrey Hinton in his famous Coursera Class. It uses similar concept as Momentum does but instead of using average of running gradients, it uses square of the average of running gradients.

One nice thing about this algorithm is that, it doesn’t overshoot or explodes while optimizing if the learning rate is too high, which is often the case with previous algorithms. So, instead of overshooting the value, it goes around it and slightly increase or decrease the value and keep it under desired range.

It uses concept of average gradients from momentum and dynamic learning rate from ADAGRAD.


To put it simply, ADAM is RMSPROP + Momentum.

It uses both the concept of average of running gradients and square of the average of those gradients as RMSPROP and uses momentum.


It is an addition to ADAM optimizers. It uses a concept called automatic Learning rate annealing.

Learning rate annealing is the process of decreasing the learning rate as the process tends to reach near the optimum value of the loss function.

EVE automates this process by automatically decreasing the learning rate.

It does so by keeping track of loss value from previous epoch, and the value before that epoch and calculates the ratio by which they are changing and if its too high then, the learning rate is decreased and if the loss is somewhat constant, it will increase the learning rate.

ADAM with Annealing

This is Jeremy’s idea based on automatic annealing and ADAM optimizer.

Instead of using and comparing loss values from the epochs, he uses average of sum of squared gradients.

He then compares the sum of squared gradients from previous epoch and current epoch, the gradients should always be decreasing and if it happens to be increasing, then the algorithm decreases the learning rate.

The excel file used to demonstrate and explain above concepts is available here.

2. State Farm Distracted Driver Detection Competition

Jeremy here introduces another competition from Kaggle, called State Farm Distracted Driver Detection. Here bunch of images are given in both train and test sets and each shows the behavior of driver while driving. Basically, if the driver is looking away from the road and looking else where, he is considered to be distracted and if he or she is looking forward, he or she is not distracted.

Jeremy then goes on to explain how he tackled this problem and came to a solution.

He followed these steps to solve this problem.

  1. Before using your model with real data set, you would want to test it on sample small dataset, which will save your time and helps to correct your model.
  2. Start with simple model of 1 dense layer, make the first layer, BatchNormalization layer, this will automatically normalizes the input and no need to calculate all the standard deviations and subtract it from the input.
  3. Always flatten the layers before the dense layer to put all the output from previous layer to a single vector form.
  4. When first training with the small sample data, with default learning rate, if it overshoots, decrease the learning rate.
  5. Then after decreasing the learning rate substantially for the second run, if the accuracy starts to increase, this gives us the idea that if the accuracy doesn’t move at first with default learning rate, always decreasing it should be the first instinct.
  6. When the validation accuracy of that small sample model is around 0.5 then its a good sign, else if its below that there is something wrong.
  7. If you are solving a computer vision problem, obvious thing would be to use Convolutional model architecture
  8. To avoid overfitting use data augmentation
  9. Another best way to reduce overfitting by data augmentation would be to try different types of data augmentation one at a time on a sample data with big enough validation set and try to find the good value of each data augmentation parameters and try combining them all together.
  10. Regularization cannot be used on sample data it is correlated with real data set. when we add more data, we need less regularization so for a fixed sample data it will not work to check accuracy as the real data is more and regularization on that need to be reduced.
  11. You can use Dropout to reduce overfitting.
  12. Its always good idea to use Imagenet features if your problem involves dataset similar from ImageNet.
State Farm code file is available here.

3. Pseudo Labeling (Semi Supervised Learning)

Particularly in State Farm dataset there 80,000 data on Test set, which are unlabelled. So, we can utilize this huge unlabeled data to our advantage by the use of pseudo labeling.

This is how it works:

  1. Use that unlabeled data on some model and predict labels of that data. They are now called Pseudo Labels
  2. Now take training labels and concatenate them with pseudo labels.
  3. If you are using convolutional layers, concatenate the pseudo labels of convolutional validation set data with training features
  4. Use these concatenated data as training data to train a new model.

This approach increases the accuracy of the model and uses the unlabeled data to our advantage.

More on Semi Supervised Learning here.

4. Collaborative Filtering

While building Recommender system, two main concepts are used

  1. Meta-data based approach
  2. Collaborative Filtering

By experimentations it is known that meta data based approach adds very little value to the whole recommender system and collaborative filtering amplifies the results by greater extent.

In Collaborative filtering, if we are building a movie recommender system, it says, find similar people like you and find what they like and based on that, the system will assume you will like the same.

Now the lecture breaks down to two parts:

  1. Concept

As said above we need to understand the existing users to better recommend movies to another users. When we understand the current users, we will know how much similar is the targeted user from the existing ones.

To do that, Jeremy creates set of random values vector, these values conceptually represents the likeness or character of each user.

Now same approach is done with movies, similar random value vector is created to represent character of the movie.

Using these two vectors as weights, and Rating of the movie as the real value, Jeremy uses gradient descent to calculate his own rating for these movies. Just like predicted Y values from our previous linear function method.

Now Using loss function and gradient descent, the process is iterated and when the predicted rating of the movies are close to the real ratings of the movies, the weights are optimum and in our case the weights represents the character of users and movies. Now based on these weights we can compare between users to better recommend movies to them using collaborative filtering.

2. Implementation in Keras

Jeremy applied above concept and implemented using Pandas to better construct the dataset and uses Keras to create the model.

Here in this case, He uses something called Functional Model, instead of our previously used Sequential Model.

He uses Embedding layers to better map the user characteristics with user IDs and Movies Characteristics with Movies IDs.

What Embedding does is, by looking at the user id value from users table, it calls and grabs the particular column from the user characteristics matrix. If its user id: 1, then the embeddings will look up for first column and if its 3 then its the third one.

Then using this concept it is implemented using Neural Nets and it performed better than existing state of the art system on MovieLens dataset. Surprisingly, it took like only 10 seconds to train that model.

Collaborative Filtering code file is available here.
All the notes of this lesson is available here.
I encourage you to watch the Full Lecture. You can also jump to any particular topic on video by following the video timeline.
All the codes and Excel files are available here.

In this lesson we went through optimizers and briefly touch the Functional Model in Keras, which we will talk further in our next lesson.

See you there.

Next Post: Lesson 5