Faster AI: Lesson 5 — TL;DR version of Fast.ai Part 1

Kshitiz Rimal
Deep Learning Journal
8 min readSep 9, 2017

--

This is Lesson 5 of a series called Faster AI. If you haven’t read Lesson 0, Lesson 1, Lesson 2, Lesson 3 and Lesson 4 please go through them first.

This lesson is more conceptual and goes into lots of different ideas in deep learning.

As usual for the sake of simplicity and convenience, I have divided this lesson into 3 parts:

  1. Analyzing Collaborative Filtering results [Time: 00:10:00]
  2. Natural Language Processing [Time: 00:34:00]
  3. Recurrent Neural Networks [Time: 1:43:06]

1. Analyzing Collaborative Filtering results

In our previous lesson, we went on briefly about collaborative filtering and how it can used to make recommender systems. Here we will talk about some of the techniques of this approach.

Latent Factors

In order to characterize any particular set of data, for example: movies list and users list, we need some values to properly describe them. In case of movies, just having list of movies with their ratings given by the users is not enough.

We further need to classify the movies in terms of various factors like, action, comedy, dialogue, sci-fi etc.

This can be done using latent factors. These factors are values in a matrix pair, which describes various aspects of the movie. And similar to movies latent factors we have users latent factors as well.

At first these values are initialized at random. Then these values are matrix multiplied and given another set of values, these are initial guesses of the rating of the movies, given by users. Now using gradient descent the rating are optimized to be close to the real ratings. While the we calculate the ratings close to the real ones, the latent factors also gets optimized like the weights of the system. So, at the end we have ratings approximately close to real ratings and latent factors to properly describe movies and users.

In this lesson we have movies of latent factors of 50, that is it has 50 columns of data to uniquely describe any particular movie.

PCA

Principal Component Analysis (PCA) is mainly use to intuitively understand what is going on with the data. It is mainly used in visualizing and analyzing the data.

In our case we have 50 dimensional latent factors to understand a movie. Which is hard for us humans to get proper picture of. So, what we do is we merge this 50 dimensions down to 3 dimensions and plot them on graph to classify which movies belongs to which dimension and get proper look at each of them. That way, its much easy for us to understand the latent factors that was calculated.

So, in other words, PCA extracts out or lets us give out major factors which is influential to the data from many factors or dimensions in our case. Out of 50 components, we merge down to 3 major principal components to analyze our data.

Here from top to bottom, top being most violent movie and bottom being the most happiest movie. Similarly, top right is more sci-fi and action, whereas bottom left is more adventure and drama.

Keras Functional API

Up to now we have been coding using, Keras sequential API. We can follow the same approach now on as well but, regarding the nature of our task, it would be difficult for us to implement that way.

In this particular case of movie recommendations, we would require to input two values as input, movie id and user id as vector pair and need to predict the rating of the movie.

This implementation is much easier with Keras Functional API.

What is does is, instead of specifiying sequential layers where each layer has separate input for them or explicit input, we would require to specify input to each functional layer such that the output of previous is the input for the next.

Here as you can see, We need to specify Input at first in this functional API then, use ‘input’ from first layer as the input to the second layer ‘x’, which in turn will be reused as another input for next layer. The values gets changed in each layer, for the same variable.

According to the nature of the task, we can use any kind of model, functional or sequential as our desire.

2. Natural Language Processing

All this steps we took in our course, from convolutional image model to movies recommender system using collaborative filtering and concepts of embedding layers, these all leads to Natural Language Processing or NLP in short.

Here, we analyze the sentences from english or any other language and try to understand what they mean or what they tend to say.

The first example of NLP is sentiment analysis. Given any texts, we need to analyze whether its a positive sentence or negative sentence.

In order to analyze any given word what we do is, in training vector we specify the id for each word, that is in training vectors we have values such as : [123, 12, 14, 15], each of these integers represents a word and using Embedding layers we talked about in our previous lesson, we retrieve the particular word for each of these ids. For example, 123 might represent: ‘the’, 12: ‘and’, 14: ‘money’ and so on. So each training example will have list of these words and each training example means a sentence so, one sentence is represented using list of these integers.

Now, to better understand each of these words, these words also have latent factors, just like in case of movies.

Here ‘was’ is represented with 50 latent factors.

Similarly, latent factors of ‘and’.

Now to analyze these words and to calculate the sentiment of it. First we try to use 1 hidden layer neural network.

This gives out accuracy values close to the academic standard values. Then Jeremy uses convolutional neural network for the same dataset.

As convolution is used for structured dataset, and so the sentences are also structured, we get accuracy results greater than the academic standards even by running only few epochs.

One major difference here as compared to other convolutional layer is that, it uses 1D convolutional layer, as compared to our previous 2D layers.

Because, here as an input we have list of one dimensional vector values, instead of 2d matrix of an image. And so the max pooling and filter will also be of 1 dimensional here. In other words, filter will slide through each vector in one direction only rather than going rows and columns in our previous cases.

Another important thing to notice here is that, although we have 1 dimensional vector of values. We have channel size of 32 at first. In case of image, we would just have RGB channel, that is 3. We have 32 because, typically for sentences and words, 50 is used but as we have small dataset , Jeremy thought 32 might just work.

Other layers and approach are similar to the previous convolutional models.

Pre-trained Model (Pre-trained Embeddings/Word Vectors)

Similar to our previous approach, it is always better to use a pre-trained model to better analyze the sentiment of the word vectors as well. For that, we have various models available on the internet. The one Jeremy uses is GloVe model. Here , he uses the one called ‘Gigaword 5’ which has 6 Billions of word tokens already trained and has 400k of vocabs.

Now to better understand these trained words and how each word relate to other. We used something called ‘TSNE’, which is also another dimensionality reduction approach, which reduces 50 dimension of word vectors to 2D plane, so that we can intuitive know where each trained word lies.

Here as you can see, there is a cluster of punctuation marks in top right corner, that means the model knows they are similar to one another. Similarly, words like ‘military’, ‘iraq’ and ‘war’ are closer to one another.

3. Recurrent Neural Networks (RNNs)

In this lesson, Jeremy briefly goes through some concepts used in RNNs and how they are different from other neural models.

In our previous models, we specify input from the start and each layer will process the input fed at the beginning and gives out the output.

Now, in case of RNNs, what we do is, imagine you have input of word ‘I’ in first layer, now this input gets processed at layer 1 and the result is feed into layer 2, now imagine giving input of word ‘am’ to layer 2, instead of layer 1. Now, at this layer, Layer 2, we have two inputs, that is input ‘I’ from layer 1 and ‘am’ as another input, now these two values are processed in layer 2, then the result is feed into layer 3, now again, imagine layer 3 has another new input called, ‘kshitiz’, then the layer 3 in altogether has processed input of ‘I’ + ‘am’ and ‘Kshitiz’. This approach of specifying inputs in this manner is what recurrent networks does.

This approach leads to many advantages. First, it will know the occurance of any particular word, that is, ‘I’ should be followed by ‘am’ and then only ‘Kshitiz’. Each layer will have its own state and overal model will understand where each input lies in bigger picture, this leads to accurate predictions of sequences, which is perfect for our example of NLP.

From above image, now due to this approach, the RNNs can know the similarities and differences in sentences like ‘I went to Nepal in 2009’ and ‘In 2009, I went to Nepal’. which was previously very difficult to analyze.

Lesson 5 notes: http://wiki.fast.ai/index.php/Lesson_5_Notes

Lesson 5 Video: https://www.youtube.com/watch?v=qvRL74L81lg

In our next lesson we will cover more on RNNs and LSTMs(Long Short Term Memory), a type of RNN.

See you there.

Next Lesson: Lesson 6

--

--

Kshitiz Rimal
Deep Learning Journal

AI Developer, Google Developers Expert (GDE) on ML, Intel AI Student Ambassador, Co-founder @ AI for Development: ainepal.org, City AI Ambassador: Kathmandu