Continual Learning for Production Systems

The new “Agile” in the Machine Learning Era

Vincenzo Lomonaco

Published in

ContinualAI

7 min readAug 24, 2019

Fig. 1: **The Machine Learning Lifecycle**, © copyright [1]

Introduction

The Agile software development approach, popularized around 2010 by the Manifesto for Agile Software Development, advocates adaptive planning, evolutionary development, early delivery, and continual improvement as key properties in order to provide a rapid and flexible response to increasingly fast changes of the market and its requisites.

As the linear waterfall models, originated in the manufacturing and construction industries, were proved to be unable to provide the competitive edge in the increasingly complex and fast-changing software world, Agile and Scrum models became the standard the facto for software development these days.

But what happens as we move towards Software 2.0? In his 2017 blog post, Andrej Karpathy was foreseeing a fundamental shift in the software development world:

I sometimes see people refer to neural networks as just “another tool in your machine learning toolbox”. They have some pros and cons, they work here or there, and sometimes you can use them to win Kaggle competitions. Unfortunately, this interpretation completely misses the forest for the trees. Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we write software. They are Software 2.0.

Machine learning is now slowly conquering and pervasively transforming every industry with more and more “end-to-end” predictive modules that are now integrated in almost every product and service pipeline out there (see a couple of example here).

However, current machine learning models are inefficiently trained from scratch and statically deployed at every iteration cycle, making it somehow similar to the pure sequential waterfall model (rigid and difficult to adapt).

In essence, the parallelism in based on the assumption that data are the new software requirements (they keep coming and changing through time) and the training process is the “design & development” phase, resulting in the software product (our prediction function).

What if we could translate what we have learned in the last 50 years of Software 1.0 to Software 2.0?

Continual Learning as the Agile of Machine Learning

It turns out that we can! In the last few years we have witnessed a tremendous progress in a sub-field of Machine Learning, named “Continual Learning”, where the basic idea is to continually train our prediction models, incrementally as new data (requirements) become available and with huge benefits as for the Agile methodology:

Efficiency: as the process is continuous we don’t need to start from scratch every time with an enormous computational waste, re-learning things we have already learned.
Adaptiveness: being the learning process very fast, efficient and flexible we can guarantee adaptation and customization capabilities at unprecedented levels.
Scalability: The computational and memory overhead stays bounded (and low) throughout the entire product/service life-cycle, allowing us to scale in terms of intelligence while processing more and more data.

While continual learning is often thought as just a nice property to explore for future Artificial General Intelligence (AGI) agents and its practical use to be limited to applications in embedded compute platforms (with no cloud), in this post I argue that in the next few years it will pervasively become a must-have property in every ML system and extensively used in production environments as well.

The importance of Continual Training

Sometimes you can hear people at conferences saying “Aaah, ML systems are incredibly inefficient w.r.t. the brain!”, “ML approaches are incredible data hungry!”, “ML algorithms are just for HPC enviroments!”. Of course they are. In the context of vision, for example, it has been estimated that a child takes up to 3-5 years to develop a reasonably good vision system, then refining it and adapting to the surrounding environment for the rest of his life. Why for machine learning systems it should be any different?

We expect a ML system to be trainable in a matter of minutes and learn a perfect model of the external world. But we should rather aim at a continual learning system that can build its prediction capabilities on top of what it has previously learned, making up for its previous biases and shortages and efficiently adapting to novel environmental conditions as new data become available.

Pretty much as in Software 1.0, after more than 50 years of experience in “software engineering” we acknowledged the impossibility to create a complex system with a pure linear development model, I argue we are going to realize soon the same for Software 2.0.

Turns out that some folks in the industry are already starting to acknowledge the shift, for example for Google Play and other Google services with Tensorflow Extended:

Fig. 2: TFX: A TensorFlow-Based Production-Scale Machine Learning Platform.

Probably at Tesla to some extent:

Fig. 2: Building the Software 2.0 Stack Andrej Karpathy (Tesla).

And in many other companies providing MLaaS such as Amazon SageMaker, IBM Watson, etc., or in startup such as Neurala, Cogitai.

Why so many company are now starting to invest in continual learning? Well because it’s so much cheaper! Let’s take a look at the practical example below.

A simple example: Cutting down your AWS Bill by 45% or more

So, let’s say for the sake of simplicity that you have a web company and you want to recognize the content of the images posted on your web platform by its users.

Unfortunately, you don’t have all the data in advance but you have small batches of new images with labels (user tags for example) at the end of every day (iteration cycle) and you still want to adapt your prediction models as fast as possible to improve the user experience and recommend them the best content on the platform.

By today standards it would mean to re-train the whole ML model from scratch on all the accumulated data and re-deploy it in place of the old one. However this is incredibly wasteful in terms of computations and memory, as you learn the same things all over again.

What if we can just update it with the new images available? In our last paper “Fine-grained Continual Learning” (take the following numbers with a grain of salt, as they are not all part of the paper but projected for the sake of this post) we show that a rather simple CL strategy AR1* assessed in a continual learning scenario of 391 training batch can:

reduce computations needed by an average of ~45% across its life-cycle: it starts with a 0% advantage over the re-train & re-deploy approach and it ends with ~92% less computation at the 391th training batch. This considering that the re-train & re-deploy approach needs an increasing number of epochs (from 4 to 50) for every batch, while AR1* stays fixed to 4 epochs.
reduce working memory overhead by an average of ~49% across its life-cycle: as we don’t need to keep in memory all the training data encountered so far but just the ones contained in the current training batch the memory overhead can be reduced from 0% at the first training batch to ~99% at the 391th training w.r.t. the re-train & re-deploy strategy).

for a trade-off in terms of accuracy lost of just ~20% points at the end of the life-cycle of our object recognition system.

Fig. 4: Continual learning accuracy over three scenario of increasingly complexity with 79, 196 and 391 training batches. Each experiment was averaged on 10 runs. Colored areas represent the standard deviation of each curve. Accuracy performance for the cumulative upper bound, not reported for visual convenience, is∼85%. Results in tabular form and more information are available at https://vlomonaco.github.io/core50.

As the continual learning strategies become better and better in limiting the performance gap w.r.t. the inefficient re-train & re-deploy strategy (a.k.a. “Cumulative”), we can still spare more than ~45% of computation throughout our life-cycle and ~49% in memory usage.

Furthermore, It is worth noting that, in this simple example, we considered just a limited number of incremental training batches (391), where a continual learning strategy really show its advantages when the number of batches is potentially higher. This means that the ~45% and ~49%, could further grow along with the life-cycle length: The longer the life-cycle, the better the efficiency improvements.

To sum up, CL systems are not yet ready for taking over ML systems as we know them today (re-train & re-deploy). However, I think many mixed approaches would be very helpful in a number of real-world applications with a good trade-off between adaptation speed/efficiency and accuracy performance.

As we look at the future of Software 2.0, I cannot see a different way for efficiently patching your system, improving its features and adapting to the requisites of an increasingly fast changing global market.

ContinualAI is an official non-profit research organization and the largest open community on Continual Learning for AI. Visit our official website: continualai.org to learn more about the organization and join us as a member, or consider supporting us with a small donation! Follow us on Facebook, Twitter, Instagram, Medium, Github, YouTube!