Curriculum Learning in Deep Learning
A Technique You May Not Know.
By Ahmed Hamdi
Have you ever felt that modifying your deep learning model architecture isn’t enough? Well, let me tell you that you are right.
The learnings from this article are inspired by an Omdena Challenge on “Using Computer Vision to Detect Ethnicity in News and Videos and Improve Ethnicity Awareness”
In our project, one of the collaborators suggested the use of Active Learning. It was the first time for me to hear about this topic, I don’t even remember meeting it in any machine learning course that I took, so I just started reading about it online. While searching in this topic I discovered a whole new topic called Curriculum Learning, also never mentioned to me before but this one did pique my interest.
Even though we are at a point where we have accessibility to multiple different models with amazing architectures and great performance on benchmark datasets, the bitter truth is that no this is not enough. Datasets come in all shapes and sizes, sometimes those sizes aren’t that well balanced. They could also have mislabeled samples, filled with noise and rare samples that we can’t point out. All mentioned reasons and much more could result in having a great model like for example EfficientNet, perform very poorly on your dataset, that is why we need training techniques.
Curriculum learning was introduced by Bengio et al. in 2009, in his paper called “Curriculum Learning” he mentioned how he noticed that humans and animals learn much better when the information is presented to them in a meaningful order, we don’t just throw a child in a library and expect him to have Ph.D. in math why do we expect a model to be any better? I know I know models and children are not the same, but still haven’t you ever got emotionally connected to your model at some point?
Getting back to our topic, Bengio et al. wanted to order the training samples before presenting them to the model from easy to hard, by this they are allowing the model to have the chance to learn from the easier features first then converge later on harder features. In their research paper, they experimented with shape recognition. As you can see in the image below, they considered the fewer variant shapes as easy samples and the more variants as hard ones. This method proved an improvement in decreasing the generalization error.
The curriculum learning technique didn’t stop there, on the contrary, multiple variations since then have been published, with even some stopped focusing on modifying the data in a scheduled way, they started modifying the main model in a scheduled way. To make things clearer I have added below two figures, one showing the data-based curriculum learning training loop, and the other one showing the model-based curriculum learning training loop.
If you don’t understand the figures now, no problem, just keep reading the article then come back in the end and you will find out they started making sense.
We can breakdown the different curriculum learning variations into the following categories, Vanilla Curriculum Learning, Self-Paced Learning, Balanced Curriculum Learning, Self-Paced Curriculum Learning, Progressive Curriculum Learning, and Teacher-Student Curriculum Learning
We will talk about each category except Vanilla CL, you may ask me why are we going to leave it out? The only reason is that we have already discussed it. Vanilla Cl is the basic form of curriculum learning that Bengio et al. proposed in their paper, OK then… I think we are ready, let’s start by…
Self-Paced Learning in some way is not a curriculum learning variation, we could consider it as the cousin of curriculum learning, they are based on the same concepts of scheduling the samples before feeding them into the model. As I explained to you just a few lines up in curriculum learning we don’t introduce the model to all our datasets at once, we chose the easy ones first, but the question here……how did we even know the easy samples from the hard ones.
Bengio et al. decided that the basic shapes will be the easy ones, but maybe the model didn’t consider it that way, maybe the model could have chosen the bigger shapes first as the easy ones.
M. Kumar et al. in their paper said and I will use their wording “Our self-paced learning strategy addresses the main challenge of curriculum learning, namely the lack of a readily computable measure of the easiness of a sample”. They decided to leave the choice to the model itself, instead of giving the model a predefined schedule from easy to hard that we chose, they will let the model re-order the samples by itself.
You may ask me how will we even know what the model considers an easy sample, multiple papers suggested different approaches, we will discuss some but for now, M. Kumar et al. suggested to decide based on the prediction probability, for me that makes sense, what the model predicts with high values I may consider it a confident prediction which suggests that the model find this sample as an easy one.
Another approach was in Yong Jae Lee et al. ‘s paper called “Learning the Easy Things First: Self-Paced Visual Category Discovery” where they decided to introduce an easy function, the function will decide the easy samples based on the features represented in the images and order them accordingly.
Balanced Learning, I know what I am saying now is a no-brainer for anyone reading this article, but your batch size should be balanced, if you didn’t know…..well now you know. Why should it be balanced? In this way, you make sure your model won’t favor a class over the other, and as you don’t tell a mother to choose her favorite child, you shouldn’t let your model through this.
In the paper “A self-paced multiple-instance learning framework for co-saliency detection”, Dingwen Zhang et al. suggested that in addition to the curriculum learning approach, the samples should also be diverse enough by forcing it to be chosen from different parts of the image.
Self-Paced Curriculum Learning for me is the best of both worlds, this is one of my favorite training techniques. As you could have guessed from the name, it combines the curriculum learning concept with the self-paced learning concept. It was introduced by Lu Jiang et al. in their paper “Self-Paced Curriculum Learning”, their reasoning that in curriculum learning the way we order our training samples is fixed, so by the pass of time and the improvement of our model, the order of the samples stays the same and on the other hand, in self-paced learning the model could over-fitting because we left it chose the data samples itself.
The solution was by merging both of them into one, they started by ordering the samples before training and then letting the model re-order the data samples again during the training. By this, they gained the merits of both techniques.
Progressive curriculum learning also takes the approach of training the model on datasets from easy to hard, but instead of having to find a way to choose the samples that are perceived as easy from the samples that are perceived as hard, we can treat all samples as equals and find a way to make the model thinks that these samples are easy/hard.
Before giving you the paper example for progressive curriculum learning, let’s get back to one of the most important layers in any model, the dropout layer.
The dropout layer was introduced in 2014 by Nitish Srivastava et al. in their paper “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. The dropout layer works by randomly setting the neurons to 0 at each update of the training phase; this helps prevent the model from overfitting on the training data by intentionally sabotaging the training progress.
This is ok until you think like Pietro Morerio et al., in their paper “Curriculum Dropout” they discussed that at the beginning of the training loop, our model’s weights are initialized randomly and the model is already confused enough by this, Adding a constant probability for the dropout layer we will be doing more harm than good at the start of the training. So they suggested making a scheduled dropout layer, which starts with a dropout probability of 0 and makes its way up to a specified max value. If you look close enough you will find that somehow they have applied curriculum learning concepts, by making the probability of the dropout layer 0 they made all the training samples look like easy samples, and by increasing the dropout probability they have hardened the model recognition of those samples making them seem like hard samples…..Truly amazing!
A major challenge in GANs was the production of high-quality output images, in “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, published 2018, Tero Karras et al. introduced a new way of applying progressive curriculum learning to GANs helping the quality improvement of the output images.
By using low-quality images first as input (considered easy samples) and while the training progressed, they improved the quality of the input image while simultaneously adding layers to the beginning of the model they were able to speed up and stabilize the training of the network.
And our final paper in progressive curriculum learning is “Curriculum by Smoothing” published by Samarth Sinha et al. in 2020. I think by now it’s obvious that progressive curriculum learning is my favorite curriculum learning variation.
Getting back to the paper, Samarth Sinha et al. suggested applying a gaussian filter to blur the feature maps in the model during training and gradually start reducing the blurring effect of the Gaussian filter, why do this? in their paper they made the argument that the random initialization of the model weights leads to the feature maps being too noisy. By applying the blurring filter they can reduce the noise present in the feature maps and leave only the most obvious features.
Later on, during training, the model starts learning so the noise present in the feature maps will start decreasing, we simultaneously start decreasing the blurring effect.
Teacher-Student Curriculum Learning separated the training into two different tasks, training the model and training the student. The teacher model is supposed to get feedback from the student model and set up the training schedule by learning the optimal learning parameters for the student. The student model is the final model we will have and it will be trained on the schedule set up by the teacher model. This was originally introduced as a reinforcement learning training technique but then was adapted into other tasks.
While I was reading papers that applied the teacher-student concept I came across “On The Power of Curriculum Learning in Training Deep Networks “. In this paper, I was introduced to a pacing function. I don’t think this is the first one to use a pacing function but let us explain it here. A pacing function is supposed to decide how long we are supposed to train on each subset of the training set.
This article could be biased toward computer vision as this is my field of study but by doing a small search I have found that curriculum learning could be used in various other tasks ranging between Machine Translation, Question Answering, Speech Recognition, and even robotics.
In the end, I would like to say that in our field, feeling overwhelmed by information could be frustrating and could lead you to over work yourself, try to have a curriculum learning mindset, start by the easy tasks and you will eventually work yourself up to the hard task