Why Continual Learning is the key towards Machine Intelligence
The last decade has marked a profound change in how we perceive and talk about Artificial Intelligence. The concept of learning, once confined in the corner of AI, has now become so important some people came up with the new term “Machine Intelligence” as to make clear the fundamental role of Machine Learning in it and further depart form older symbolic approaches.
Recent Deep Learning (DL) techniques have literally swept away previous AI approaches and have shown how beautiful, end-to-end differentiable functions can be learned to solve incredibly complex tasks involving high-level perception abilities.
Yet, since DL techniques have been proven shining only with a large number of labeled examples, the research community has now shifted his attention towards Unsupervised and Reinforcement Learning, both aiming to solve equivalently complex tasks but without (or less as possible) explicit supervision.
However, most of the DL research today is still carried out solving specific, isolated tasks which would hardly lead to a more long-term vision of Machine Intelligence endowed with common sense and versatility.
In this story I’d like to throw some lights on the paradigm of Continual/Lifelong Learning and why I think this is at least as much as important as the Unsupervised and Reinforcement Learning paradigms.
What is Continual Learning?
Continual Learning (CL) is built on the idea of learning continuously and adaptively about the external world and enabling the autonomous incremental development of ever more complex skills and knowledge.
In the context of Machine Learning it means being able to smoothly update the prediction model to take into account different tasks and data distributions but still being able to re-use and retain useful knowledge and skills during time.
Hence, CL is the only paradigm which force us to deal with an higher and realistic time-scale where data (and tasks) becomes available only during time, we have no access to previous perception data and it’s imperative to build on top of previously learned knowledge.
On the terminology
What I’ ve described under the name of Continual Learning is now a fast emerging topic in AI which have been often branded as Lifelong Learning or Continuous Learning and it’s not well consolidated yet.
The term “Lifelong Learning” has been around for years in the AI community, but prevalently used in areas far away from the field of Deep Learning. This is why more people would go for a modern term like “Continuous” or “Continual Learning” targeting specifically Deep Learning algorithms.
I personally love (and used in my papers ) “Continuous Learning” since it focuses and makes explicit the idea of a smooth and continuous adaptation process that never stops. The distinction with Continual is subtle but important as beautifully put in Oxford Dictionaries:
Both can mean roughly “without interruption” […] however, Continuous is much more prominent in this sense and, unlike Continual, can be used to refer to space as well as time […]. Continual, on the other hand, typically means ‘happening frequently, with intervals between’ […].
Even though, current research focuses on rigid task sequences problems where we actually stop learning at the end of each task , I find Continuous Learning would be much more appropriate in the long term with the developments of algorithms which can deal with a continuous stream of perception data like the real world.
On the other hand, the term “Continuous” may result too confusing in many contexts (especially in Reinforcement Learning) as often used as the opposite of “Discrete”. That’s why the DL community seems to start converging to the use of the term “Continual” instead.
Wait a minute, but what’s wrong with the terms “Online” and “Incremental Learning”? As many other researchers, I see the term “Online” opposed to “Batch Learning” as the technical way of processing data in an algorithm rather than a paradigm of learning .
The term “Incremental Learning” instead, while still focuses on the idea of building knowledge incrementally, doesn’t really express the idea of adaptation which sometimes means also to temper or erase what has been previously learned .
Why Continual Learning?
Let’s set back for a moment and look at some definitions of intelligence given during the past by some prominent researchers in the field of Psychology and Learning. This quote is from Loyd Humphreys in “The construct of general intelligence”:
“ The resultant of the process of acquiring, storing in memory, retrieving, combining, comparing, and using in new contexts information and conceptual skills.”
And this is Reuven Feuerstein in “Dynamic assessments of cognitive modifiability”:
“ The unique propensity of human beings to change or modify the structure of their cognitive functioning to adapt to the changing demands of a life situation. ”
Let’s have a look at the last one from Sternberg & Salter in the “Handbook of Human Intelligence”:
Goal-directed adaptive behavior.
Wow! I really liked the last one, very concise and to the point. Now, can you see what’s connecting all the definitions? It’s the idea of adaptation, the ability to mold our cognitive system to deal with the always changing demanding circumstances.
Yet, very little of this can be found in the current Deep Learning literature where much of researchers’ focus has been devolved to solve more and more complicated problems but in narrow and closed task domains.
Adaptation, while at the core of the definition of intelligence, has been currently left out of the game.
In the next paragraph we will talk more about adaptation, and why it’s an essential quality of any AI systems facing the real-world and not unnatural benchmarking settings.
The second and most significant notion behind Continual Learning is scalability.
Scalability is one of the most important concept in Computer Science and once again at the core of Intelligence.
As we’ll see in the next paragraphs, in CL this idea force us to think at Intelligence and develop algorithms which can already deal with real-world computational and memory constraints.
If we want machines which are endowed with versatility and common sense, we better make sure they are scalable in terms of intelligence and stay sustainable in terms of resources (computation/memory).
Let’s focus now again on adaptation and why it is important for a Strong AI system. Nowadays, no matter if you are working on Unsupervised, Reinforcement Learning, working on Vision or NLP, you would go for a fixed well confined task and pick a function which can be trained to solve it.
This is amazing if you have an industrial/routine problem which involves perception (high-dimensional) data, but suddenly becomes less interesting when you want to tackle open world problems where things keep changing over time.
Unless you assume that the universe can be constrained in a finite number of variables you can process deterministically there’s no escape: you need to keep adapting.
CL for continual improvements
The simplest application of CL is in scenarios where the data distributions stay the same but the data keeps coming. This is the classical scenario for an Incremental Learning system.
You can think at a lot of applications like Recommendation or Anomaly Detection systems where data keeps flowing and continually learn from them is really important to refine the prediction model and in the end improve the service offered.
If you think about it, very little amount of problems (also very constrained and well defined a priori) cannot benefit from a bunch of new data which comes only later in time.
CL for ever-changing scenarios
However, nowadays, for most of the commercial DL applications it’s ok to re-train the model from scratch with the cumulated data. The game becomes really interesting instead when the scenario keeps changing over time. This is where Continual Learning really shines and other techniques are unable to solve the problem.
Most of the time it’s very hard to collect a large and representative dataset a priori, but it can be even wrong when the semantics of these data keeps changing over time (i.e. we are actually solving a different task).
For example, you can think to a Reinforcement Learning system in a complex environment in which the reward keeps changing based on a hidden variable we do not control (welcome to the real life LoL).
Now, how we can ensure that our cognitive system can scale in terms of intelligence (while processing more and more data) but maintaining computational/memory fixed or at least sustainable?
The core trick is to process data once and than get rid of them. Like biological systems storing perception data (given their high-dimensionality and noise rate) would be impossible to maintain and process cumulatively on a long time scale!
So, you can imagine the AI system as an actual brain which filter perception data and retain only the most important information (Edge Computing people on fire here LoL).
At this point, some of you may think: “Mmmh, Moore low is not over yet, and maybe it will never be.. so, who cares about Continual Learning if computational power still doubles every year?!”
Well, IDC published a white paper this year arguing that by 2025 (less than 8 year away) data generation rate will grow from 16 ZB per year (zettabytes or a trillion gigabytes) today, to 160 ZB and we will be able to store only between 3% and 12% of them. You read it right. Data has to be processed on the fly or it will be lost forever because the storage tech can’t keep up with the data production which is the result of many exponentials combined together.
Hence, in the end, CL is not only about drastically reducing the computational burden (to avoid retraining our model from scratch each time we have new available data) but it’s the only way of learning since most of the time we won’t be able to even store the data!
CL is ideal for Unsupervised Streaming Perception data
With high-dimensional streaming (real-time) data (~25% of Global Datasphere in 2025 ) the problem appears even clearer since it would be just impossible to keep the data in memory and re-train the entire model from scratch as soon as a new piece of data becomes available.
CL is ideal for streaming perception data since it embeds the idea of continually updating the model with the new available data.
Of course in a supervised setting it could be very hard to couple real-time perception data with labels (yet feasible with temporal coherent data as Neurala showed) but what if we are in an Unsupervised/Reinforcement setting? Well, CL becomes the perfect buddy to pair with!
CL enables Multimodal-Multitask Learning
Now, what if we don’t have a single stream of perception data but many of them coming from different sensors (with different input modalities) and at the same time we want to solve multiple tasks (welcome to the real-world again)?
Łukasz Kaiser & All from Google Brain this year  come up with a single model which has been able to learn very different tasks in very different domains and with many input modalities (with a huge static training set).
However, this beautiful prediction model would be really impossible to use in a real-world context with current DL techniques since updating it would require to re-train the entire model from scratch (good luck with that) as soon as a new piece of data is available from one of the many input streaming sources.
Yet, Multimodal and Multitask Learning are essential towards strong Machine Intelligence since, in my view, it’s the only way of endowing machines with common sense and basic, implicit, “reasoning” skills.
Let’s make an example. In the picture below you can see a very famous and funny error made by a Automatic Image Captioning system  based on DL techniques:
So, in this case the Multimodal RNN, based on the training set with <image, caption> pairs, has wrongly identified the toothbrush as baseball bat. But why as humans we laugh at this error? Because it’s obvious that a child of that age won’t be able to hold a baseball bat and that as a matter of prospective a baseball bat can’t be that small.
All these inferences which can be intended as a simple version of reasoning are also what we call common sense. But what if the same system, other than just give caption to images was also trained to evaluate more precisely the age of a person in the picture and imagine the weight/size of each particular object in a scene. Well in that case, disambiguating the toothbrush from the baseball bat would have become much easier, right? Because the co-occurrences of a very young boy holding that weight are much less frequent!
Of course, for more complex tasks, multiple input modalities are also needed: like disambiguating type of birds based on visual but also auditory cues.
So, in the end, Multimodal/Multitask Learning can be really what makes our AI agents smarter but only through Continual Learning, which essentially enables asynchronous alternate training of such tasks and only updating the model on the real-time data available from one or more streaming sources in a particular moment!
State-of-the-art & Future Challenges
While not already at its explosion, Continual Learning has been getting more and more attention in the Deep Learning community and in the last two years very good contributions have come out ().
I plan to cover a good part of them in a series here on Medium on CL, but let’s summarize what we know so far.
- Contrasting Catastrophic Forgetting is possible in many ways, and not only through careful hyper-parametrizations or regularization techniques.
- We have already proved that CL can be used in complex problems in Vision, Supervised or with Reinforce.
- Accuracy results (on some toy benchmarks) are impressively high, almost in line with other techniques which have access to previously encountered data.
- It’s not completely clear how evaluate CL techniques, and a more formalized framework may be needed.
- We have pretty much focused on solving a rigid sequence of (simple) tasks (on the scale of dozens) and not on streaming perception data, neither on multimodal/multitask problems.
- It’s not clear how to behave after the saturation of the capability of the model, neither how to selectively forget.
In our latest work , which has been recently accepted @ CoRL2017, we tackle the 2nd problem, providing a dataset and benchmark CORe50, specifically designed for Continual Learning, where temporally coherent visual perception data becomes available in (a lot of) small batches.
Even though still very far from solving the problem I’m confident that in a few years Deep Learning techniques will be able to smoothly and continually learn from streaming multimodal perception data leading to a new generation of AI agents which will unlock thousands of new applications and services opening the path towards Strong Machine Intelligence.
I hope you enjoyed this post, and I’m looking forward to hear from you in the comment section! Thank you for your attention, and remember to like it or share it! :-)
I’d also like to give a special thanks to my fellow PhD student Francesco Gavazzo for the useful discussions and his great suggestions which lead to this story!
If you’d like to see more posts about my work on AI and Continual/Lifelong Deep Learning follow me on Medium and on my social: Linkedin, Twitter and Facebook!
If you want to get in touch or you just want to know more about me and my research, visit my website vincenzolomonaco.com or leave a comment below! :-)