Curriculum learning

Noha Nekamiche
AIGuys
8 min readMay 12, 2023

--

The concept of curriculum learning in Machine Learning models is based on the idea of ordering the training data in a specific way. This approach is inspired by the way humans learn, as it introduces training data from simple to complex samples. This is in contrast to the standard approaches, which are based on random data shuffling during the training process. By ordering the data in this way, it is expected that the model will be able to learn and generalize well.

These approaches may be limited, however, if the samples must be ranked from simple to complex and the pacing function for introducing more difficult data needs to be figured out.

Through this article, we will explore how the literature has addressed these limitations, and we present a variety of curriculum learning approaches for different machine learning tasks.

Thus, this article is organized as follows :

  • Introduction
  • Definition of Curriculum Learning
  • Different Curriculum Learning Variants
  • Applications of Curriculum Learning
  • Conclusion
  • References

Introduction

Deep learning has reached the state-of-art in a wide range of tasks, but the main focus was to build deeper and deeper neural network architecture. If we take CNN models as an example, they have reached top-5 error of 15.4% on ImageNet & the more recent ResNet models reached the top-5 error of 3.6%.

However, if we analyze all these architectures, we find that they take into consideration examples in random order and focus only on improving the training process.

Moreover, the training is usually performed with some variants of mini-batch stochastic gradient descent, in which the samples are chosen randomly in each mini-batch.

Since that neural networks were inspired by the human brain, it’s also interesting to get inspired by the human learning way. When we start to learn new things, we all go from the basic concepts to the advanced ones; this is the case for academic programs also.

Curriculum Learning

The first definition of Curriculum learning (CL) was introduced by Mitchell in 1997 as follows :

Definition 1: A model M is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

According to this definition, we can define CL as increasing the complexity of experience during the training process, and in this way, the early contribution was using the CL paradigm.

In contrast, other studies propose to apply the CL paradigm in other components of Machine Learning, such as a series of methods that propose to increase progressively the modeling capacity of model M by adding neural units, deblurring convolutional filters, or by activating more units as the training process advanced.

On the other hand, we find the methods that propose to apply the CL paradigm according to the class of tasks T, by increasing the complexity of tasks T.

As we all know that in all Machine Learning models, we try to optimize an objective function in order to get better predictions. Thus, in the Curriculum Learning paradigm, we consider the objective function as a performance measure ‘P’.

To summarize, the general frameworks for CL can be regrouped into two main frameworks, one at the data level and the other at the model level, which is represented as follows :

Data-level Curriculum Learning
Figure 1 : Data-level Curriculum Learning

Observing the two frameworks, we find that they have a common parts: the Curriculum scheduler and the performance measure ‘P’. Where the scheduler is responsible for deciding when to update the complexity of data/model in order to get better performance. However, when CL is applied to data, a difficulty criterion will be added to order the data from easy to hard examples. After that, the selection method determines which examples to use for training at the current time. In the same way, Curriculum over tasks will be applied (see Figure 1).

Now, when CL is applied to the model, it doesn’t require a difficulty criterion because this time we need to augment the complexity of the architecture/ parameters of the model which is achieved by the model capacity curriculum (see Figure 2).

Model-level Curriculum Learning
Figure 2 : Model-level Curriculum Learning

Different Curriculum Learning Methods

Vanilla CL

was first introduced by Bengio et al. in 2009, who demonstrated that machine learning models improve their performance when increasing the difficult samples during training. This category is only a rule-based criterion for sample selection.

Self-paced learning

It’s different from the previous one ☝, in terms of evaluating and ordering the samples which are fed to the model. In this type of learning, the order is not known in the beginning but computed according to the model performance. Therefore, the order can variate during the training process.

Balanced curriculum

This category pays attention to the diversity of the samples while being introduced to the model. So in this category when feeding the samples from simple to hard ones for the model, an extra constraint will be added to balance the selected samples at the same time ‘t’. For example constraints that ensure diversity across image regions or classes.

Self-paced curriculum learning

It’s a paradigm where predefined criteria and learning-based metrics are jointly used to define the training order of samples. It was first introduced by Jiang et al. in 2015 and applied to matrix factorization and multimedia event detection. Furthermore, it was also used in other tasks like weakly-supervised object segmentation in videos and person re-identification, etc.

Progressive CL

This category doesn’t use CL with respect to the sample order, it’s designed instead as a progressive mutation to the model capacity or task settings. So it applies the curriculum concept to a connected task or to a specific part of the network.

An example of this category is the approach proposed by (Karras et al., 2018), which progressively grows the capacity of Generative Adversarial Networks to obtain high-quality results.

Teacher-student CL

In this category, the training is divided into two tasks, a model that learns the principal task (student) and an auxiliary model (teacher) that determines the optimal learning parameters for the student.

So the curriculum here is used via a network that applies the policy on the student model that will eventually provide the final inference.

Implicit CL

Refers to the application of CL without specifically building a curriculum. As an example, Sinha et al. (2020) suggest gradually deblur convolutional activation maps during training. This process is different from the typical curriculum because it creates a network with reduced learning capacity that becomes more complex as the training proceeds.

Applications of Curriculum Learning

CL was applied in different tasks of Natural Language Processing, Computer Vision, Reinforcement Learning, and robotics. In this section, we will provide some examples of different tasks.

Natural Language Processing

Machine translation is one of the well-known tasks in NLP, which involves translating text from one language to another. CL can be used to train machine translation by introducing easier translations first, for example short sentences or simple language structure, then increasing the complexity of translations.

Another example: is language modeling which tries to predict the next word in a sequence of words. Here also we can apply CL to train the model first with easy tasks first, like predicting the next word in a short sentence, then augment progressively the tasks to predict the next word in longer sentences or paragraphs. In this manner, the model can learn more efficiently and effectively, improving its accuracy as well as its ability to generalize to new situations.

In the same logic, we can apply CL in sentiment analyses, see Figure 3 which represents an example of ordering the training data from easy cases to hard ones.

Figure 3 : Examples from SST-2 sentiment classification task.

we find many other examples that u can check, see links in references.

Computer Vision

One of the famous examples of using the CL paradigm in computer vision, we find object detection. In which the dataset is ordered from simple that contain few objects with fewer variations in lighting, background, scale, etc., the model in early epochs will be able to learn how to detect the objects easily. Then when the training advances we will introduce more complex images with multiple objects, and more challenging variations (see Figure 4).

Figure 4: Ordering the data from easy to hard ones

Image segmentation is another application of CL. In this task, the goal is to identify the boundary between objects within an image and attribute a label based on the object category to each pixel. Therefore, you can use any of the detection & segmentation models such as high-quality Faster R-CNN or Mask R-CNN.

So by using the CL paradigm we can improve the performance of computer vision models and reduce the training time. And by increasing the complexity of training data, the generalization of the models will be better and more robust in real-world applications.

Conclusion

To finish this article we find that the curriculum learning paradigm was applied successfully in different tasks of computer vision, NLP speech processing, and robotic interaction. We can also notice that it brings improvements in different tasks from image classification, object detection, and semantic segmentation to question answering and speech recognition, etc.

However, it’s not always the case to get better performance when using the CL paradigm, because in some cases it can degrade data diversity if the sub-datasets after ordering don’t contain different representations of the data. This is less of a suboptimal training process and as a consequence gets worse results.

On the other hand, model-level CL is not sufficiently explored there are few articles that apply the CL paradigm over the model, and most technics use it from the data perspective. So it’s time to start investigating this area.

Moreover, the CL paradigm was also combined with other learning paradigms like supervised learning, cross-domain adaptation, self-paced learning, semi-supervised learning, and reinforcement learning. But if we look in the literature, we find only a few works that combine it with unsupervised and self-supervised learning.

Thus, it is interesting to use the CL paradigm for unsupervised learning, specifically that we don’t have labeled data. Because learning from a subset of easy samples, in the beginning, may offer a better starting point in order to optimize an unsupervised model.

References

https://arxiv.org/abs/2101.10382#:~:text=Training%20machine%20learning%20models%20in,without%20any%20additional%20computational%20costs.

--

--

Noha Nekamiche
AIGuys
Writer for

AI researcher & Phd Student at CIAD LAB, UTBM