Latent Replay for Real-Time Continual Learing at the Edge

Bringing State-of-the-art Continual Learning to Embedded Devices

Published in

ContinualAI

7 min readMar 5, 2020

Fig.1: **iCub** Robotic Platforms and the **CORe** Android Application.

Artificial Intelligence (AI) technologies have the potential to radically transform the way we experience the world and connect with other people by making the objects around us “smarter”: gathering and processing information, making decisions, adapting to changes, and interacting with humans and other objects, rather than being simply hard-programmed in advance to execute very specific functions.

However, most “smart” devices today operate as mere gateways to remote computing and AI infrastructures; this imposes limitations on the usability and response speed, due to the latency of remote communication, and also raises potential privacy concerns.

Indeed, one of the key challenges faced by the modern AI technologies such as gradient-based deep learning systems is their well-recognized limitations when it comes to the continual learning, i.e. the ability to learn incrementally, online from non-stationary data streams, changing environments and tasks, their limited ability to generalize out-of-distribution and perform transfer and meta-learning.

While some steps in this direction have been recently moved, training on the edge remains often unfeasible and considered impractical. In fact, given the high demand in terms of memory and computation, the industry standard de facto today is training models on powerful multi-GPUs servers, and deploy them to edge devices for inference only.

In our recent paper “Latent Replay for Real-Time Continual Learning” we focus on real-time CL and prove that continual training with small non-i.i.d. batches (~300 images from a short video) can be compatible with the limited computing power made available by CPU-only embedded devices and robotic platforms.

Native Rehearsal

Let’s start with the basics. In [1] it was shown that a very simple rehearsal implementation (hereafter denoted as “native rehearsal”), where for every training batch a random subset of the batch patterns is added to the external storage to replace a (equally random) subset of the external memory, is not less effective than more sophisticated approaches such as iCaRL.

Therefore, in our work, we started by expanding CWR* and AR1* [2] with the trivial native rehearsal approach. In Fig. 2 you can see the learning trend of CWR* and AR1* of a MobileNetV1 trained with and without rehearsal on CORe50 NICv2–391.

It is well evident that even a moderate external memory (~1.27% of the total training set) is very effective to improve the accuracy of both approaches and to reduce the gap with the cumulative upper bound (obtained by training the model on the entire training set) that for this model is ~85%.

Fig. 2: **Comparison of CWR* and AR1* on CORe50 NICv2–391 with and without rehearsal** (external memory size of 1500). Each experiment was averaged on 5 runs with different batch ordering: colored areas represent the standard deviation of each curve. The black dashed line denotes the reference accuracy of the cumulative upper bound.

Latent Replay

Even if in the CORe50 setting the extra memory required by native rehearsal is not an issue (we store only 30 patterns for each of the 50 classes), a constant refresh significantly increases the required amount of computation because of the extra forward and backward steps and this makes the resulting training too resource demanding for real-time applications.

In deep neural networks the layers close to the input (often denoted as representation layers) usually perform low-level feature extraction and, after a proper pre-training on a large dataset (e.g., ImageNet), their weights are quite stable and reusable across applications. On the other hand, higher layers tend to extract class-specific discriminant features and their tuning is often important to maximize accuracy.

Fig.3: **Architectural diagram of Latent Replay**.

With latent replay (see Fig. 3) we denote an approach where, instead of maintaining copies of input patterns in the external memory in the form of raw data, we store the activations volumes at a given layer (denoted as “Latent Replay layer”).

To keep the representation stable and the stored activations valid we propose to slow-down the learning at all the layers below the latent replay one and to leave the layers above free to learn at full pace.

In the limit case, where lower layers are completely frozen (i.e., slow-down to 0), latent replay is functionally equivalent to rehearsal from the input, but achieves a computational and storage saving (x69 and x394 w.r.t. the cumulative strategy, respectively) thanks to the smaller fraction of patterns that need to flow forward and backward across the entire network and the typical information compression that networks perform at higher layers.

In the general case where the representation layers are not completely frozen, the activations stored in the external memory suffer from an aging effect (i.e., as the time passes they tend to increasingly deviate from the activations that the same pattern would produce if feed-forwarded from the input layer). However, if the training of these layers is sufficiently slow, the aging effect is not disruptive since the external memory has enough time to be rejuvenated with fresh patterns.

When latent replay is implemented with mini-batch SGD training:

In the forward step, a concatenation is performed at the replay layer (on the mini-batch dimension) to join patterns coming from the input layer with activations coming from the external storage;
The backward step is stopped just before the replay layer for the replay patterns.

Fig.4: **Comparison with other CL strategies**: Accuracy results on the CORe50 NICv2–391 benchmark of CWR*, AR1*, DSLDA, iCaRL, AR1*free (for different choices of the rehearsal layer reported between parenthesis). More details about these experiments are reported in [2] and the code to reproduce them is available here.

Real-World Deployment on Embedded Devices

Now, let’s see where we can go with this simple yet effective idea. Below we report a demo of CORe: an Android app we developed for commonly used android smartphones. To the best of our knowledge, this is the first time a deep continual learning strategy is shown to be effective on a smartphone device.

The app will be open-sourced upon publication of our paper, but the apk can be already downloaded here.

CORe App demo. The camera field of view is partially grayed to highlight the central area where the image is cropped, resized to 128x128 and passed to the CNN. The top three categories are returned for each image (closed set classification) and a green frame is placed around the icon of the most likely class. A training session is triggered by tapping the icon of one of the 10 existing classes or one of the (initially) five empty classes.

The app comes pre-trained with 10 classes (corresponding to the 10 CORe50 categories) and allows to:

continually train existing classes (by learning new object/poses)
learn up to 5 brand new classes.

As the app is started, it switches to inference mode and classifies the framed objects with an inference efficiency of about ~5 fps (CPU-only with no hardware acceleration). When learning is triggered, a short video of 20 seconds (at 5 fps) is acquired and the resulting 100 frames are used for continual learning that completes in less than 1 seconds after the end of the acquisition.

Behind the scenes the app is running a customized Caffe version cross-compiled for Android and using the same MobileNetV1 architecture introduced before, here initialized to work with 15 classes. Latent replay in this case is implemented at the “pool6” layer with an external memory of 500 patterns.

Low level code is written in C++ and the app interface in Java. A training session consists of 8 epochs, 5 iterations per epoch with a mini-batch size of 120 patterns: each mini-batch includes 20 original frames (from the current batch) and 100 replay patterns.

In order to speed up training during the video acquisition, a second thread immediately moves available frames forward in the CNN and caches activations at latent replay layer so that, when the acquisition is concluded, we can directly train the class specific discriminative layers. Further details and precise timings of the different phases are provided here.

Take-Home Message

Latent Replay is an efficient technique to continually learn new classes and new instances of known classes even from small and non i.i.d. batches.

State-of-the-art CL approaches, such as AR1*, extended with latent replay, are able to learn efficiently and, at the same time, the achieved accuracy is not far from the cumulative upper bound (about 5% in some cases).

The computation-storage-accuracy trade-off can be defined (based on the latent replay layer choice) according to both the target application and the available resources so that even edge devices with no GPUs can learn continually from short videos, as we proved through the development of an Android application.

We hope this will spark a new interest in continual learning at the edge, which has been always thought to be impractical for deep learning models before, but we believe it may constitute the backbone of pervasive and distributed AI computing in the future.

**Vincenzo Lomonaco**, the author of this story.

If you’d like to see more posts on AI and Continual Learning follow me on Medium or join ContinualAI.org: an open community of more than 700 researchers working together on this fascinating topic! (Join us on slack today! 😄 🍻)
If you want to get in touch, visit my website vincenzolomonaco.com! 😃