Tackling Reinforcement Learning with the Aurora OPU

A Use Case to Bring Down your Energy Bill with Episodic Control Algorithms

Training a state-of-the-art artificial intelligence (AI) algorithm can emit five times more CO2 than a car would in its lifetime [1]. As achieving state-of-the-art dictates whether a paper is successful, researchers are developing programs that are more and more computationally-hungry to reap higher scores. Let’s have a look at the consumer headlines about AI: beating humans at Go, at StarCraft II or at Dota2? However polluting a car may be, it is nonetheless useful. Can the same be said of the algorithms developed to play Go or strategy video games?

Against all odds, computer playing games might just be a great ally in the fight against climate change: pun aside, games provide toy versions of complex real-life problems [2] and reinforcement learning (RL) — the area of AI that these examples mainly deal with — is a powerful tool to solve them.

Powerful indeed, but — as we said — consuming a mastodontic amount of energy [3]. In this blog post we explore a way to make RL algorithms more lightweight, thriftier, taking a closer look at one algorithm in particular: model-free episodic control [4].

Hippocampal learning with ones and zeros

In RL, the environment is modeled as a Markov decision process; we observe states, perform actions and receive rewards.

A key challenge in RL is to balance exploration and exploitation. Exploration is the strategy of blindly trying some action to see if it might lead to better scores in the long run. Exploitation instead consists in the greedy reuse of actions that are known to yield high rewards.

In model-free episodic control, the idea during exploitation is to match the current observed state against the record of previous attempts and choose the action that served us best in the past. We keep track of a value for each encountered state-action pair that renders their potential to generate high rewards. In order to opt for the proper move, we look in our record for the nearest neighbors of the observed state and check which associated action was most interesting.

The cumbersome essence of high dimensions, evaporated…

Weighing the differences between these observations is not the easiest thing to do. Let us forget the toy problem for a moment and think of Elon Musk’s promise. Imagine we would like to use that same algorithm with the data provided by the 7 cameras, 12 ultrasonic sensors and the radar on a Tesla. Two issues arise: memory and computation requirements are over the roof!

There is a trick to solve them both: we can embed the states into a low-dimensional space without altering their geometric properties using a random map [5].

Clusters of 3D points are projected onto a 2D plane.
Clusters of 3D points are projected onto a 2D plane.
Figure 1. Random projection of 3-dimensional points clusters on a plane (courtesy of I. Poli).

Moreover, random projections strengthen the algorithm’s independence to fallible human intuition. We end up with a more universal method.

…under a beam of light

A LightOn Optical Processing Unit (OPU) can extract a custom number of features from very high-dimensional vectors with a constant low power consumption. It thus fits well in the model-free episodic control algorithm as a surrogate for the traditional GPU-based linear random projection, whose cost soars with the size of the observations.

Figure 2. Functional block diagram of MFEC+OPU.
This is the result of a short run using a latest-generation Aurora OPU on the LightOn Cloud.

Relationship with convolutions

A typical way of extracting features from images is using convolutional neural networks (CNN). Let’s say we use the “convolutional” part of the CNN, as not to spend time training the classification layers of such a network. A simple architecture yields similar results. However, processing the observations with CNNs consumes a lot more energy than the OPU for each input image [6].

We can improve the algorithm by combining both approaches: a rough reshaping of the input images to even out the useless details followed by a random projection bears the best results with little to no overhead. Or contrarily, we can make case-specific enhancements to the preprocessing to easily build a finer feature distiller over (actually, before) the generic framework that is the random projection.

Conclusion

Model-free episodic control is not the panacea of RL. It is actually a rather modest algorithm, but a good starting point to understand the success of neuro-inspired episodic control methods [7]. These have been shown to outperform deep RL algorithms at least in the first stage of learning (see Figure 3). We therefore have the opportunity to devise robust AIs (possibly with the aid of imitation learning to combine episodic control with another agent), leveraging the properties of RP and light-based computing to address one of deep learning’s major flaw: using a sample-efficient algorithm early on, such as episodic control, reduces the data-hunger often associated with pure-RL techniques, which in turn brings down the electricity bill.

Figure 3. Learning curves of Ms. Pacman for different RL algorithms (taken from [7]).

Have a look at the Github repository to see the implementation details and reproduce our results. LightOn supports research through the LightOn Cloud for Research program. This program allows you to get free credits to speed up your computations. Apply here!

About Us

LightOn is a hardware company that develops new optical processors that considerably speed up Machine Learning computation. LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud! 🌈

Follow us on Twitter at @LightOnIO, subscribe to our newsletter and/or register to our workshop series. We live stream, so you can join from anywhere. 🌍

The author

Martin Graive, Intern in the Machine Learning Team at LightOn AI Research from July to December 2019.

Acknowledgements

Thanks to Victoire Louis and Iacopo Poli for reviewing this blog post.

References

[1] Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.arXiv preprint arXiv:1906.02243 (2019).

[2] Risi, Sebastian, and Mike Preuss. “From Chess and Atari to StarCraft and Beyond: How Game AI is Driving the World of AI.KI-Künstliche Intelligenz 34.1 (2020): 7–17.

[3] Schwartz, R., J. Dodge, and N. A. Smith. “Green AI.arXiv preprint arXiv:1907.10597 (2019).

[4] Blundell, Charles, et al. “Model-free episodic control.arXiv preprint arXiv:1606.04460 (2016).

[5] Johnson, William B., and Joram Lindenstrauss. “Extensions of Lipschitz mappings into a Hilbert space.Contemporary mathematics 26.189–206 (1984): 1.

[6] Lacoste, Alexandre, et al. “Quantifying the Carbon Emissions of Machine Learning.arXiv preprint arXiv:1910.09700 (2019).

[7] Pritzel, Alexander, et al. “Neural episodic control.Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

We are a technology company developing Optical Computing for Machine Learning. Our tech harvests Computation from Nature, We are at lighton.ai

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store