The Learning Robot

Anton Chernov
Apache MXNet
Published in
8 min readMar 29, 2019


“You know, it’s moments like these when I realize what a superhero I am.”
Tony Stark [1]

Photo by Uwe Niklas for the Embedded World 2019 Daily Magazine

Until recently, developers in the embedded space didn’t have much opportunity to be exposed to Machine Learning due to the enduring perception of strict hardware limitations. It seemed that power-hungry algorithms like deep neural networks required cloud-based high-end processors or beefy graphic cards combined with a lot of prior scientific knowledge. The reality is that embedded devices already have all the necessary power and most of the required tools are free and open source with extensive tutorials and documentation.

Applications such as offline translation, text-to-speech and speech-to-text, computer vision, autonomous driving, federated learning greatly benefit from their work loads to be processed directly on the devices for efficiency, resiliency and privacy reasons.

Project Inception

First prototype

From ancient times, as early as Homer, Greeks were imagining robotic servants, like the first robot to walk the earth, a bronze giant called Talos [3].

Because a deep learning workshop was postponed at the last minute, Thomas Delteil a ML scientist from AWS Deep Engine team, had a couple days to kill in the Berlin office. With the help of a discarded Raspberry Pi and a SainSmart 6-Axis Desktop Robotic Arm with a mounted camera on it, we assembled a prototype that would:

  1. Detect faces on the camera frames using a MTCNN deep neural network
  2. Actuate the motors of the robot to center the camera on the highest confidence face detected

Not quite Talos, but still. On the Raspberry Pi the system wasn’t running faster than 2 Frames Per Second (FPS), an upgraded version with a Nvidia Jetson TX2 was presented at the Amazon Research Days demos in Seattle with a much snappier rate of 20FPS.

The idea of using robotic arms for cool deep learning demos was rolling!


After attending CppCon 2018, I got in touch touch with Lars Koenig and Michelle Rossi from the Qt Company and we immediately clicked and decided to partner on a demo for the Embedded World exhibition. While exploring and drafting ideas, reading Peter Norvig’s “we don’t need to duplicate humans” [4] we had a few amusing moments reviewing some videos on YouTube in this regards.

The drawing board

We wanted to make the demonstration as interactive as possible. It should contain something new that would surprise and delight the users.

Robotic arms are quite common nowadays and used in variety of applications, however they feel unnatural in every way. Historically all kinds of apparatus were used to convey instructions to a robot, including joysticks, controllers, other devices. And yes, even typing mysterious commands into a black screen, are still used extensively. However, for this demonstration we thought it would be a good idea to try using normal hand movements for control.

We shaped the idea into a challenge — people would compete for points gained by performing a task to climb up the leaderboard. The challenge? Use two arms to control two robotic arms and their grips in order to move books from the side of the stage into a basket.

Would you participate in the great robot arm challenge?


We would use a pose estimation Deep Neural Network (DNN) to estimate the wrist positions based on camera input to control the positions of the robots. In parallel we would crop the palm areas and use another DNN to classify OPEN or CLOSED states to control the grips.

For the Deep Neural Networks we used Apache (incubating) MXNet. It is a flexible and efficient framework for deep learning. It is cross-platform and available on cloud instances, desktops and embedded devices. A big differentiator for MXNet is its support for a multitude of languages and the ability to seamlessly combine declarative and imperative programming to enjoy the best of both worlds [2].


In order to use MXNet on Jetson or other embedded devices, like Raspberry Pi, one needs to compile it first. It can be done via dockerized cross compilation scripts or on the device itself.

The advantage for cross compilation is that any machine can be used to do the compilation, for example an AWS c5d.18xlarge with 72 CPU cores. That makes the compilation fast and fun. Alternatively, without cross compilation, one can use ARM-based instances, like a1.4xlarge with 16 cores. That’s how we compiled Qt initially for Jetson. On the device compilation is usually slow due to obvious reasons of limited resources. But it’s a robust last resort if previous methods don’t work for some reason.


For the pose estimation, after a few attempts with different neural networks we finally went with an implementation based on the following paper: Simple Baselines for Human Pose Estimation and Tracking [5] (pre-trained models released in the latest release of GluonCV).

The team gets their pose estimated

For the hand pose, we used transfer learning and fine-tuned a pre-trained ResNet18 image classification network.

Big thanks to the Vancouver AWS team members for contributing more than 5000 palm pictures

During prototyping we spun up a AWS P3 instance with a powerful GPU to not be constraint on computing resources. Afterward we packed the models into the Jetson device and started working them out to lose “fat” gained on the EC2 instances.

The MXNet Model Server from AWS labs wraps MXNet into a web endpoint running on the device allowing a REST like API for the model calls. Both models, pose estimation and gesture classification were working at the same time, in parallel, allowing the load to be balanced effectively.

The arms

The only thing that our arm could hold at the beginning was a cap and only because it wasn’t moving.

The Niryo One arm that we’d chosen, not only for its beauty, is a kickstarter project that was released just recently. The Niryo SDK wasn’t really setup to have 2 arms working in collaboration, thus we were forced to make some some additional steps as well as implement some fixes for the movement controls.

A nice trick during development was to have EC2 instances loaded with the Niryo stack in simulation mode. That way we didn’t need the physical robots to make progress and were able to iterate fast and parallelise the work.

Once we were able to move the robots I immediately wanted to try it out and move both arms at the same time. Unfortunately, the coordinates I gave were physically slightly under the table surface which the arms enthusiastically moved into with high speed and enthusiasm. It probably would have been an entertaining element on the show as well, but we decided that we need emergency stop buttons for ‘just in case’. Later we integrated them into the table on the show.

Emergency stop

Looking back, the biggest mistake, in my opinion, was probably the decision to configure all the network setup through ssh tunnels. All communication was hardcoded to localhost. It was fun at the development stage to switch different backends by tunneling to a different endpoint. At the show it turned to a kafkaesque nightmare when the network went down and everything got reconfigured. Tunneling in between nodes in multiple sessions in tmux like a blind mole in the caves of ssh will need some time to be remembered in an amusing way.

…and action!

In most cases the demo performed well. People were interested in seeing moving robots, and participating in an interactive challenge. Sometimes, depending on the person and background, the grip state classifier was somewhat unstable. It is surprising how different a closed palm (a fist) of a person can look. One can probably blame the bias of the few hundred palms that were painstakingly photographed in our office to collect the training data.

Since points needed to be entered manually some people decided to cheat their way up the leaderboard to unreachable heights when technicians were not paying close attention. But since cruelty is peculiar to human, there were others that made the fight in the tops of the leaderboard dynamic.

The leaderboard showing ninjas got to the top somehow

All participants we were presented with a 3D-printed MXNet key-tag to remind them what can help them open previously closed doors.


I want to thank all the great people that had participated in this demo and helped making it a success.

My closest collaborators:

  • The robotics centurion Pavel Danchenko without whom nothing would have moved at all (including me).
  • The ML magician Thomas Delteil who cooked all ML models, sometimes at night due to a 9-hour time difference.

My colleagues:

  • Jeremy Wyatt for providing a lot of useful hints.
  • Can Erdogan for sharing his robotics experience.
  • Per Goncalves da Silva and Stanislav Tsukrov for their great help of setting up the development environment and helping out with the robot communication.
  • Gavin Bell and Tim Januschowski for their patience.
  • Steffen Rochel for his general support.
  • Ralf Herbrich for his wise guidance on organisational issues.
  • Cyrus Vahid for his enthusiasm and the initial idea.
  • Silke Goedereis and Robert Belle for their help with public relations.
  • Aaron Markham for editorial help.
  • Amanda Lowe for her help with the exhibition.

The Qt Company and in particular:

  • Lars Koenig and Michele Rossi for believing in the partnership from the beginning and taking the whole organisation of the actual presentation.
  • Artem Sidyakin for implementing the UI and helping with technical Qt related questions.
  • Diana de Sousa for designing the beautiful screens.
  • Salla Venäläinen for great organizational help.
  • Santtu Ahonen for making the 3D model of the key-tags and his help on the show.

And others:

  • Edouard Renard for his great help on Niryo robots.
  • Aravinth Panch and MotionLab.Berlin for 3D printing the giveaway key-tags.
  • Big thanks to all palm holders to train the gripper state classifier.
A diverse team engaged towards the same goal

What’s next?

With this demo we have shown that MXNet can be used effectively on embedded devices for ML enabled applications. The use case, if developed further, could be expanded from robots teleoperation to RL and imitation learning. I think we could teach robots effectively in a convenient human-friendly way.

If you want to get involved, there are plenty of resources for MXNet and deep learning in general to start with:

  • The main website with links to tutorials, docs, installations instructions etc.
  • Dive into Deep Learning: An interactive deep learning book with code, math, and discussions.
  • GluonCV: a Deep Learning Toolkit for Computer Vision.
  • GluonNLP: Natural Language Processing made easy.
  • Keras-mxnet: If you happen to do your models in Keras you can try boosting performance by just switching the backed to MXNet.
  • Regular VC user groups meetings.
  • Collaborative meetups.

And plenty of other resources:


[1] TV spot for Iron Man 3
Deep Learning — The Straight Dope
[3] Mayor, A 2018, Gods and Robots: Myths, Machines, and Ancient Dreams of Technology, Princeton University Press, New Jersey
Artificial Intelligence Pioneers: Peter Norvig, Google
Simple Baselines for Human Pose Estimation and Tracking Bin Xiao, Haiping Wu, Yichen Wei; The European Conference on Computer Vision (ECCV), 2018, pp. 466–481



Anton Chernov
Apache MXNet

I help people to set and achieve their goals through leadership and technical expertise.