The evolution of intelligence in robots: Part 2

Simon Kalouche
Oct 22 · 14 min read

Part 1 of this post outlined some of the challenges we face in bringing intelligent machines to the real world. At the heart of these challenges is the cost-to-value ratio robots struggle with. This post builds on Part 1 by surveying recent machine learning, robotics, and computer vision research that make robots more versatile in the face of the diverse real world.

Supervised Learning

The most successful type of deep learning in industry is supervised learning, but using this type of learning comes at a cost. The crux of supervised learning is the need for labelled data which is usually laborious and expensive to collect. Services like Scale and Amazon’s mechanical turk exist to make collecting annotations a bit easier by offering outsourced and crowdsourced labor. However, determining ways to have the data labelled is only the first step.

There are many design choices to be made when determining the type of label for any task. Some options include motion trajectories, key poses, key points, and object poses to name a few.


Joint vs Cartesian Space

If you decide to use motion trajectories, should the trajectories be labelled in joint space (the position of the robot’s joints) or cartesian space (the position of the robot’s gripper or tool)?

Joint space enables end-to-end learning which can be utilized to minimize energy expenditure or avoid collisions to optimize the use of redundant degrees of freedom. However, joint space learning is more data hungry because the neural networks are forced to learn unnecessary things like inverse kinematics and low level motor controllers — both of which already have conventional solutions that run fast and work well without deep learning. Joint space learning is also robot specific since each robot has a different kinematic configuration of joints. Thus, joint space learning is not easily transferrable across different platforms in comparison to cartesian-space policies.


Regression vs Classification

The next step is deciding how to frame the learning problem. When our neural network makes predictions on how to act, it can regress to poses and velocities, or it can classify sampled actions as good or bad. Regression is continuous in action space allowing for more precise motion, but it suffers from learning the average of multiple valid actions, which may actually be an invalid action. Meanwhile, classification can handle multiple correct action modalities but is discretized and requires sampling and evaluating many possible actions which increases computation time.

End-to-end vs Pipelined Learning

As part of framing the learning problem we can choose to use end-to-end models where a convolutional neural network (CNN) ingests pixels from an image and — like a true black box — returns low-level joint commands, thus managing the entire robotics pipeline. Alternatively, we can use a pipelined approach where the robot has distinct modules for each processing step: sensor pre-processing, sensor fusion, perception, policy, motion planning, kinematics solvers and motor control. In the pipelined approach a CNN can be isolated to perform the perception part while other modules perform the remaining computations.

While neither is right or wrong, the real-world successes in robotics all use the pipelined approach today.

This is because the pipelined approach simplifies the learning problem which makes the total system more robust. It minimizes the chance of encountering weird bugs arising from a lack of transparency into how the black-box neural network makes its decisions.

The argument for end-to-end learning is that it can be more generalizable in its ability to use the same software and algorithms across many different scenarios and tasks. An end-to-end model might work well for low-cost, low performance robots where sensors are noisy, actuators aren’t precise and computational power is limited.

With the pipelined approach we can train a robot to perform robustly for a single challenging task using relatively little data and a decent amount of custom engineering. Alternatively with the end-to-end approach we can create a more generalizable model that can learn many tasks from the same software but requires significantly more data.

Below are some research directions which could eventually lead to neural networks performing increasingly more of the robotics stack.


Given the chicken and egg problem robots face, it is unclear if collecting the ImageNet analog for robots today is tractable. However, by leveraging modern GPU-accelerated and parallelized simulators, we can generate ‘years’ of synthetic robot experience in a matter of hours and at very low cost. In addition to being fast and cheap, simulators can offer the advantage of automatic data annotation eliminating the need for manual human labelling.

The big challenge with synthetic data is that simulators (Unity, Gazebo, Mujoco, PyBullet, V-REP) and their physics engines (PhysX, Bullet, etc.) do not accurately represent the visual and physical complexities of the real world. Simulators struggle to precisely model things like friction, stiction, gear train dynamics, contact models, and the behavior of deformable materials under loads.

Similarly, synthetic visual data captured by rendering virtual scenes belong to a different data distribution (i.e. looks different) than images captured from real cameras. This distribution mismatch complicates simulation-to-real (sim2real) policy transfer where models trained in simulation fail to work equally well on robots in the real world.

The robot model shown below is trained to perform acrobatic maneuvers in simulation. The learned behavior hasn’t yet been shown to transfer to a real robot.


While the recent Boston Dynamics videos might suggest otherwise, their robots accomplish these maneuvers using well-tuned controllers built on top of physics models in calibrated environments — not machine learning. Boston Dynamics has proven that mastering traditional control, state estimation, inverse dynamics and trajectory optimization (MPC, LQR, etc.) in a pipelined approach still trumps machine learning in robotic control … for now!

To summarize, synthetic data collected from virtual sensors in a simulation doesn’t look exactly like data collected from real sensors. This discrepancy leads to a fundamental challenge known as the ‘transfer problem’ or ‘reality gap’ where policies learned on simulated data are overfitting to the parameters of the simulator rather than generalizing to the real world.

Domain Randomization

One technique called domain randomization can help bridge the ‘reality gap’. By training a policy on millions of simulated episodes and adding randomized jitter or noise to the parameters of the simulated environment, we can prevent overfitting to any one set of visual and dynamics parameters. This forces the network to focus on learning the details important to completing the task rather than learning to cheat the simplified rules of the simulator.

The goal during sim2real transfer is to run a policy trained in simulation in the real-world with the expectation that, to the trained network, the real-world will look like just another instantiation of one of the many randomly simulated environments on which it has been trained.

This training method allows the network to perform well on both simulated and real observations while only being trained on the former.

Building on vanilla domain randomization, a paper from OpenAI introduced Automatic Domain Randomization (ADR) which has 2 keys benefits for learning. Instead of manually defining the the bounds of the randomization for each jittered parameter, ADR starts with a narrow distribution of parameter values and automatically widens that distribution only when the learned model can perform the task well on its current distribution of randomized simulated data. This 1) forces the model to gradually learn complex tasks across a wide distribution (easier than non-gradual learning) and 2) doesn’t require engineers to manually tune the bounds of the randomizations which is unintuitive and unscalable.

The authors of GraspGAN take a different approach at transferring the reality gap. Instead of training on psychedelic-looking, randomly simulated data, they train a GAN. The GAN’s generator, a neural network, ingests simple images from a simulator and learns to generate images that the discriminator, another neural network, cannot distinguish from real camera images. Using this method, they can create a large synthetic dataset that ‘looks’ more real. The visuomotor policy can then be trained on this real-looking (but still fake) data.

The left image shows the simulated data from a virtual camera. The middle image shows the output of GraspGAN which takes as input the simulated image and generates an image that looks like it was taken from a real camera. [source]

More recently RCAN, or Randomized-to-Canonical-Adaptation-Network, achieves the same 86% grasping success on unseen objects as the original QT-opt model but does so using less than 1% of the original 580,000 training examples collected on a real robot. The RCAN method crosses the reality-gap by using a pixel2pixel cGAN which learns to transform domain-randomized synthetic images into their equivalent non-randomized, canonical versions. This in-turn allows for real images to also be translated into simplified canonical images. The grasping policy, QT-opt, is then fed these simplified canonical style images, and does not have the challenge of dealing with diverse visual inputs.


While sim2real methods have made significant progress in the past few years there is a very important distinction to make about the ‘reality gap’. In academia ‘real’ is often not defined strictly. Most of the time academics define ‘real’ as anything that works on a real robot as opposed to a simulated one. This means that ‘real’ usually ends up being defined as a highly simplified lab environment, with a dumbed down version of a task that only needs to work 40% of the time. In industry ‘real’ is defined as the full task, operating in a complex and dynamic environment that needs to work 99.9% of the time. Thus, the true reality gap isn’t sim2real but rather lab2industry.

Hybrid Sim2Real Learning

An approach that can help bridge the gap from simulation and simplified lab environments to industry is to combine learning in simulation with learning in the real world. Simulated data can be used for tasks in which sim-to-real works well and real data can be used where simulated data falls short.

One of the first sim-to-real demonstrations for learning complex legged behavior does just this. The authors train an end-to-end controller for legged locomotion. Their key insight is to train the actuator’s torque controller, which contains extremely difficult dynamics to accurately simulate, on real robot data, and use the simulator to learn the position controller, which can be easily simulated with a kinematic model.

For some tasks though, training on synthetic data at all just doesn’t work well enough and so we must obtain and train on real data.

Real-world Reinforcement and Self-Supervised Learning

A key distinction between imitation learning and reinforcement learning is the strength of the ‘reward’ signal. Imitation learning uses expert demonstrations which primarily encode how a task should be done, as opposed ways a task should not be done. Reinforcement learning uses both, good and bad examples, but because there are usually a lot more wrong ways to do something than right ways, the trial-and-error data collection process is very inefficient.

Determining the success (i.e. reward) of each of thousands of possible actions can be challenging to automate. Rewards are typically hand-engineered or shaped for a specific task and it isn’t trivial to build a single highly generalizable reward function that works across many different tasks.

One type of RL, model-based RL, first learns a model which approximates the underlying system dynamics and then uses this model for planning or to subsequently train a policy. Model-based learning enables modifying the goal of the robot within an environment since the dynamics of that environment are understood.

Another type, Model-free RL, directly learns a policy without an explicit representation of the system dynamics at play. While more generalizable, model-free RL has the classic trade-off of also being more data hungry.

While promising, robotic reinforcement learning hasn’t really seen much success in industrial applications. There are several reasons why.

First, RL in the real-world is time consuming because, unlike in a simulator, the number of experiments can’t be parallelized with respect to time and hardware. Additionally, real robot RL requires a significant engineering effort to ensure things don’t break during data collection. This constraint naturally leads to less data diversity since robots can only explore within an engineered environment rather than any and every environment. Less data diversity usually means the robot can only perform well in environments similar to the one it was trained on which kind of defeats the purpose.


For a more detailed summary on the challenges of RL see Alex Irpan’s blog post.

Teleoperation and Imitation Learning

While self-supervised learning can be used to collect a lot of real robot experience via trial and error, these methods are data inefficient — they require exploring vast state and action spaces. Additionally, defining good and bad actions (i.e. reward shaping) can be unintuitive and result in unintended behaviors...see the boat controller below.


The search space for RL can be reduced significantly by sampling from a much narrower distribution centered around human demonstrations but acquiring ample human demonstrations requires intuitive methods of controlling robots. A good way to collect human demonstrations for robots is through teleoperation.

There are various interfaces for teleoperating robots, ranging from complex haptic controllers with virtual reality headsets to hardware minimalistic systems, like mobile devices and marker-less pose estimation. More dexterity typically means more hardware and cost, but a crowd-sourceable interface should utilize ubiquitous hardware. In an ideal scenario robot data collection can be a byproduct of useful, crowd-sourced telerobotic labor.

While crowdsourced teleoperation is conceptually appealing, one practical challenge with remote teloperation is handling latency. Controlling a robot 1 mile away with 20 ms latency would be a completely different experience compared to controlling the Mars rover located 34 million miles away with 7 minutes latency.


A way to mitigate latency all together is by using asynchronous control — NASA does this. The trade-off is a sacrifice in fine temporal motor control and the ability to react quickly which is why the teleoperators at the DRC couldn’t save their robots from falling.


3rd Person Imitation Learning

Beyond teleoperation all together, it would be convenient to teach robots in the same way humans learn — by watching others. The challenge in doing this with robots is the radical domain shift in learning a skill that is demonstrated in different morphology than that of the robot. A human arm and hand look and move differently than a robot arm. Asking the robot to perform the same task with a different ‘body’ only adds to the learning complexity. The advantage of 3rd person imitation learning is that we can train robotic policies from thousands of already existing YouTube videos of people doing stuff.

Learning from Play

Learning from play combines the advantages of simulation and teleoperation. Its idea is to have humans play by teleoperating virtual robots to interact within virtual environments. The distinction of play data vs. task demonstration data is that play is interested more-so in understanding an environment than in achieving any single resulting outcome state. This trait allows for efficient exploration of various ways to interact with and exploit an environment.


The motivation for learning from play is to enable robots to navigate and interact with environments by learning how to to accurately decode reusable latent plans rather than learning the prohibitively large encoding for the full state-action space. Additionally, the properties of play as compared to expert demonstrations allow learned policies to be more robust to perturbations and recover from failures more effectively.

What’s Next?

The holy grail of machine learning and artificial intelligence is something people refer to as AGI or Artificial General Intelligence. It essentially refers to a level of intelligence in machines that matches or exceeds a human’s ability to learn new skills on the fly with just one or a few demonstrations of any new task.

There is a lot of hype surrounding this term and while it’s not yet clear if and how it will be achieved, there is a loose consensus that our algorithms need to be pushed towards generalizability through hierarchy. Transfer learning, Meta-learning, and Neural Task Graphs are steps in the right direction. These methods train models on a subset of all tasks with the goal of creating a model that easily transfers or generalizes to any new task by sharing previously learned low-level skills like grasping, pushing, and moving. Nearly all manipulation tasks require some sequential combination of these low-level skills. Thus, any task can be represented as a composition of low-level skills coordinated in time and space by a higher level controller. This hierarchical framework enables transfer and generalizability because models need not be retrained on large datasets for every new task. Rather, the same low-levels skills are shared and the higher level composition of these low-level skills, which make up most tasks, are the focus of learning for each new task.


While not exhaustive, I’ve outlined a few of the most prominent challenges roboticists face in pursuit of developing more intelligent robots that can help us outside of factories. More importantly I’ve outlined some of the potential avenues of research I’m bullish on. For those of you who want to take a deeper dive into robot learning here is a recent survey paper that cites 440 other publications!

My Utopian view of automation

The ultimate goal of all these robots is not to automate away all of our jobs. Instead, automation should free humans from the trillions of hours we spend each year on monotonous manual labor that could otherwise be refocused towards solving our seemingly endless list of more impactful problems like curing disease, solving global warming, scaling clean energy and food production and making humans an interplanetary species. To me, that sounds better than using people — and the many superpowers we possess — to pick and place things all day, every day.

About Me

I’m a robotics PhD student in Stanford’s Vision and Learning Lab and the Founder of Nimble, a startup building a robotic hive mind to automate complex tasks in warehouses.

Thanks to my awesome friends and colleagues Chip Huyen, Evan Ackerman, Andrey Kurenkov, and Jordan Dawson and for their feedback and suggestions on this post!

Twitter: @simonkalouche

Data Driven Investor

from confusion to clarity, not insanity

Thanks to Jordan Dawson

Simon Kalouche

Written by

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade