Why Continual Learning is the key towards Machine Intelligence

Vincenzo Lomonaco
Oct 4, 2017 · 11 min read
Image for post
Image for post
Brain circuits In Brainbow mice. Neurons randomly choose combinations of red, yellow and cyan fluorescent proteins, so that they each glow a particular color. This provides a way to distinguish neighboring neurons and visualize brain circuits. 2014. HM Dr. Katie Matho.

The last decade has marked a profound change in how we perceive and talk about . The concept of once confined in the corner of AI, has now become so important some people came up with the new term “Machine Intelligence”[1][2][3] as to make clear the fundamental role of in it and further depart form older symbolic approaches.

Recent Deep Learning (DL) techniques have literally swept away previous AI approaches and have shown how beautiful, functions can be to solve incredibly complex tasks involving high-level perception abilities.

Yet, since DL techniques have been proven shining only with a large number of labeled examples, the research community has now shifted his attention towards Unsupervised and Reinforcement Learning, both aiming to solve equivalently complex tasks but without (or less as possible) explicit supervision.

However, most of the DL research today is still carried out solving tasks which would hardly lead to a more long-term vision of endowed with common sense and versatility.

In this story I’d like to throw some lights on the paradigm of Continual/Lifelong Learning and why I think this is at least as much as important as the and paradigms.

What is Continual Learning?

Continual Learning (CL) is built on the idea of and enabling the of ever more complex and .

In the context of it means being able to to take into account different tasks and data distributions but still being able to re-use and retain useful knowledge and skills during time.

Hence, CL which force us to deal with an higher and realistic time-scale where data (and tasks) becomes available only during time, we have no access to previous perception data and it’s imperative to build on top of previously learned knowledge.

On the terminology

What I’ ve described under the name of Continual Learning is now a fast emerging topic in AI which have been often branded as Lifelong Learning or Continuous Learning and it’s not well consolidated yet.

The term “Lifelong Learning” has been around for years in the AI community, but prevalently used in areas far away from the field of Deep Learning. This is why more people would go for a modern term like “ or “ targeting specifically Deep Learning algorithms.

I personally love (and used in my papers ) since it focuses and makes explicit the idea of a . The distinction with is subtle but important as beautifully put

Even though, current research focuses on rigid task sequences problems where we actually stop learning at the end of each task [4], I find would be much more appropriate in the long term with the developments of algorithms which can deal with a .

On the other hand, the term “” may result too confusing in many contexts (especially in Reinforcement Learning) as often used as the opposite of “”. That’s why the DL community seems to start converging to the use of the term “” instead.

Wait a minute, but what’s wrong with the terms “ and “? As many other researchers, I see the term “” opposed to “” as the technical way of processing data in an algorithm rather than a paradigm of learning [5].

The term “” instead, while still focuses on the idea of building knowledge incrementally, doesn’t really express the idea of which sometimes means also to temper or erase what has been previously learned [6].

Why Continual Learning?

Let’s set back for a moment and look at some given during the past by some prominent researchers in the field of and This quote is from Loyd Humphreys in “The construct of general intelligence”:

And this is Reuven Feuerstein in “Dynamic assessments of cognitive modifiability”:

Let’s have a look at the last one from Sternberg & Salter in the “Handbook of Human Intelligence”:

Wow! I really liked the last one, very concise and to the point. Now, can you see what’s connecting all the definitions? It’s the idea of adaptation, the ability to mold our cognitive system to deal with the always changing demanding circumstance.

Yet, very little of this can be found in the current Deep Learning literature where much of researchers’ focus has been devolved to solve more and more complicated problems but in narrow and closed task domains.

Adaptation, while at the core of the definition of intelligence, has been currently left out of the game.

In the next paragraph we will talk more about adaptation, and why it’s an essential quality of any AI systems facing the real-world and not unnatural benchmarking settings.

The second and most significant notion behind is scalability.

Scalability is one of the most important concept in Computer Science and once again at the core of Intelligence.

As we’ll see in the next paragraphs, in CL this idea force us to think at Intelligence and develop algorithms which can already deal with real-world computational and memory constraints.

If we want machines which are endowed with versatility and common sense, we better make sure they are in terms of intelligence and stay in terms of resources (computation/memory).

On Adaptability

Let’s focus now again on adaptation and why it is important for a Strong AI system. Nowadays, no matter if you are working on , , working on or , you would go for a fixed well confined task and pick a function which can be trained to solve it.

This is amazing if you have an which involves perception (high-dimensional) data, but suddenly becomes less interesting when you want to tackle open world problems where things keep changing over time.

Unless you assume that the universe can be constrained in a there’s no escape: you need to keep adapting.

CL for continual improvements

The simplest application of CL is in scenarios where the data distributions stay the same but the data keeps coming. This is the classical scenario for an system.

You can think at a lot of applications like or systems where data keeps flowing and continually learn from them is really important to refine the prediction model and in the end improve the service offered.

If you think about it, very little amount of problems (also very constrained and well defined a priori) cannot benefit from a bunch of new data which comes only later in time.

CL for ever-changing scenarios

However, nowadays, for most of the commercial DL applications it’s ok to re-train the model from scratch with the cumulated data. The game becomes really interesting instead when the scenario keeps changing over time. This is where really shines and other techniques are unable to solve the problem.

Most of the time it’s very hard to collect a large and representative dataset a priori, but it can be even wrong when the semantics of these data keeps changing over time (i.e. we are actually solving a different task).

For example, you can think to a system in a complex environment in which the reward keeps changing based on a hidden variable we do not control (welcome to the real life LoL).

On Scalability

Now, how we can ensure that our cognitive system can scale in terms of intelligence (while processing more and more data) but maintaining computational/memory or at least ?

The core trick is to process data once and than get rid of them. Like biological systems storing perception data (given their high-dimensionality and noise rate) would be impossible to maintain and process cumulatively on a long time scale!

So, you can imagine the AI system as an actual brain which filter perception data and retain only the most important information (Edge Computing people on fire here LoL).

At this point, some of you may think: “

Well, IDC published a white paper this year arguing that by 2025 (less than 8 year away) data generation rate will grow from 16 ZB per year (zettabytes or a trillion gigabytes) today, to 160 ZB and we will be able to store only between 3% and 12% of them. You read it right. Data has to be processed on the fly or it will be lost forever because the storage tech can’t keep up with the data production which is the result of many exponentials combined together.

Image for post
Image for post
Data creation by type each year. From early 2000s we’ve talked a lot about Big Data, well let’s put that in prospective!

Hence, in the end, CL is not only about the computational burden (to avoid retraining our model from scratch each time we have new available data) but it’s the only way of learning since most of the time we won’t be able to even store the data!

CL is ideal for Unsupervised Streaming Perception data

With high-dimensional streaming (real-time) data (~25% of Global Datasphere in 2025 [7]) the problem appears even clearer since it would be just impossible to keep the data in memory and re-train the entire model from scratch as soon as a new piece of data becomes available.

Of course in a supervised setting it could be very hard to couple real-time perception data with labels (yet feasible with temporal coherent data as Neurala showed) but what if we are in an setting? Well, CL becomes the perfect buddy to pair with!

CL enables Multimodal-Multitask Learning

Now, what if we don’t have a single stream of perception data but many of them coming from different sensors (with different input modalities) and at the same time we want to solve multiple tasks (welcome to the real-world again)?

Łukasz Kaiser & All from Google Brain this year [8] come up with a single model which has been able to learn very different tasks in very different domains and with many input modalities (with a huge static training set).

However, this beautiful prediction model would be really impossible to use in a real-world context with current DL techniques since updating it would require to re-train the entire model from scratch (good luck with that) as soon as a new piece of data is available

Yet, Multimodal and Multitask Learning are essential towards since, in my view, it’s the only way of endowing machines with common sense and basic, implicit, “reasoning” skills.

Let’s make an example. In the picture below you can see a very famous and funny error made by a system [9] based on DL techniques:

Image for post
Image for post
One of the mistakes made by the Multimodal Recurrent Neural Network proposed in 2015 by Karpathy & Fei-Fei [4]

So, in this case the , based on the training set with <image, caption> pairs, has wrongly identified the toothbrush as baseball bat. But why as humans we laugh at this error? Because it’s obvious that a child of that age won’t be able to hold a baseball bat and that as a matter of prospective a baseball bat can’t be that small.

All these inferences which can be intended as a simple version of reasoning are also what we call common sense. But what if the same system, other than just give caption to images was also trained to evaluate more precisely the age of a person in the picture and imagine the weight/size of each particular object in a scene. Well in that case, disambiguating the toothbrush from the baseball bat would have become much easier, right? Because the co-occurrences of a very young boy holding that weight are much less frequent!

Of course, for more complex tasks, multiple input modalities are also needed: like disambiguating type of birds based on visual but also auditory cues.

So, in the end, can be really what makes our AI agents smarter but, which essentially enables asynchronous alternate training of such tasks and only updating the model on the real-time data available from one or more streaming sources in a particular moment!

State-of-the-art & Future Challenges

While not already at its explosion, has been getting more and more attention in the Deep Learning community and in the last two years very good contributions have come out ([10][11][12][13]).

I plan to cover a good part of them in a series here on Medium on CL, but let’s summarize what we know so far.


  1. Contrasting is possible in many ways, and not only through careful hyper-parametrizations or regularization techniques.
  2. We have already proved that CL can be used in complex problems in , or with .
  3. Accuracy results (on some toy benchmarks) are impressively high, almost in line with other techniques which to previously encountered data.


  1. It’s not completely clear how evaluate CL techniques, and a more formalized framework may be needed.
  2. We have pretty much focused on solving a rigid sequence of (simple) tasks (on the scale of dozens) and not on streaming perception data, neither on multimodal/multitask problems.
  3. It’s not clear how to behave after the saturation of the capability of the model, neither how to selectively forget.

In our latest work [14], which has been recently accepted @ CoRL2017, we tackle the 2nd problem, providing a dataset and benchmark CORe50, specifically designed for , where temporally coherent visual perception data becomes available in (a lot of) small batches.

Image for post
Image for post
CORe50 official Homepage.

If you are intrigued by the latest research on CL, take a look at the collaborative wiki and open community continualai.com I’m currently maintaining or join us on slack! :-)

Even though still very far from solving the problem I’m confident that in a few years Deep Learning techniques will be able to smoothly and continually learn from streaming multimodal perception data leading to a new generation of AI agents which will unlock thousands of new applications and services opening the path towards

I hope you enjoyed this post, and I’m looking forward to hear from you in the comment section! Thank you for your attention, and remember to like it or share it! :-)
I’d also like to give a special thanks to my fellow PhD student Francesco Gavazzo for the useful discussions and his great suggestions which lead to this story!

If you’d like to see more posts about my work on AI and follow me on Medium and on my social: Linkedin, Twitter and Facebook!
If you want to get in touch or you just want to know more about me and my research, visit my website vincenzolomonaco.com or leave a comment below! :-)


A Non-profit Research Organization and Open Community on…

Vincenzo Lomonaco

Written by

AI & Continual Learning Post-Doc @ Unibo | Co-Founder & President of ContinualAI.org | http://vincenzolomonaco.com


A Non-profit Research Organization and Open Community on Continual Learning for AI | Join us @ ContinualAI.org/join_us

Vincenzo Lomonaco

Written by

AI & Continual Learning Post-Doc @ Unibo | Co-Founder & President of ContinualAI.org | http://vincenzolomonaco.com


A Non-profit Research Organization and Open Community on Continual Learning for AI | Join us @ ContinualAI.org/join_us

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store