The evolution of intelligence in robots: Part 1

Simon Kalouche
Oct 22, 2019 · 11 min read

When we think of the Skynet scenario, videos of increasingly nimble anthropomorphic machines from Boston Dynamics awe, or terrify, us. Just a couple years ago these humanoids developed an ability to get up when we knocked them down. Then, they began parkour-ing with finesse around our man-made obstacles. Now, they’re elegantly out-performing most humans in gymnastics. It all portrays a deceiving story of the rapid evolution of intelligence in robots. But in reality, these robots are still far from possessing the intelligence to fold our laundry, let alone become our overlords.

There’s a lot of hype playing into the robot takeover narrative. The purpose of this blog post is to present some exciting breakthroughs in robotics research while debunking fact from fiction.

Robots have been around and widely used in manufacturing since the 1960’s. While we’ve been calling them robots for all this time, a more fitting name would be ‘reprogrammable motion machines’. They are explicitly programmed to repeat trajectories exactly the same way, every time. They lack the intelligence to self-adapt if their environment or task changes even the slightest.

Now fast forward to today — almost nothing has changed. Nearly all deployed robot arms are still not intelligent and are confined to highly structured manufacturing environments. Yet, if we look beyond the walls of factories, the world is full of monotonous tasks across logistics, delivery, farming, construction, and transportation which are prime for automation. The reason many of these labor intensive jobs are not yet automated is because they inherently bear an enormous amount of variability — the achilles heel of robots.

One example of a repetitive yet enormously varying task is ‘picking and packing’ in eCommerce fulfillment centers. This job requires proper handling of millions of different products — all with varying sizes, shapes, weights, colors, textures, stiffnesses and fragilities. There isn’t a one-grasp-fits-all solution to handle any object. As humans, we take our innate ability to grasp, assemble, disassemble, reorient, fold, pack and generally manipulate any object for granted. For robots, this is very hard.


In addition to lacking general intelligence, robots are still very expensive. The mainstream arms from UR, Kuka, Franka, Yaskawa, Fanuc, and ABB start at $20k and can easily cost over $100k.

The inability to handle variability along with a high price tag makes it difficult to justify the economics of most robotic applications — this is the reason many robotic startups fail. If you replace a burger flipper in a fast-food restaurant with a robot, you are not replacing one employee. You’re replacing a lot less. In one minute a person can be flipping burgers. When they’re not flipping burgers they can be making fries, wiping tables, cleaning bathrooms or taking orders. Replacing a small fraction of a minimum wage employee is not financially compelling, especially given the cost and practical complexity of implementing such a piece of technology. A lot of robot applications suffer from this dilemma.

If your goal is to build a valuable robot application and deploy it successfully in the real world today, then my recommendation is to consider the following:

1. The price you charge your customer should be a fraction of the total cost of labor you are replacing over some reasonable period of time (usually no more than 2 years). Alternatively, the demand for the labor your robot can perform should be extremely high, to the point where the pool of available human labor is not willing or able to provide the total labor needed.

2. Robots aren’t people. Retro-fitting an environment designed for humans with robots will ALWAYS be less optimal, in the long-run, than designing the environment around the robot’s capabilities. Robots love structure, so give them as much structure as possible so long as it does not impose unreasonable cost or additional labor on the customer. In the same vein, we should not design robots to exactly mimic things we find in nature. Just because humans do things a certain way doesn’t mean there isn’t a simpler, more optimal solution enabled by modern engineering. See Rubik’s cube example…

3. Build a solution that works really well for one task. In most applications the customer expects a solution, not a piece of the solution they then need to stitch together with other technologies to make an actual solution. In addition, this solution is usually expected to work 99.9% of the time. As good as 95% success (in a simplified lab scenario) might sound in academia, that won’t cut it in industry. Following the standard learning curve and Pareto principle, the last 5% is the hardest to obtain and almost always comes down to engineering edge cases versus fundamental research. Focus on one whole product and deliver it super reliably.

4. Choose a task that robots are capable of reliably performing within a few years. If you can’t fully automate the task with 99.9% reliability today, then narrow the scope of the task, add more structure, or use teleoperators to handle edges cases. As long as the teleoperator-to-robot ratio is economical, teleoperators can give robots the dexterity they need today while helping train them to become increasingly intelligent and fully autonomous over time. Nimble, Phantom Auto, Kiwi, and others do this.

Boston Dynamics, the most advanced robotics company in the world, has struggled to successfully commercialize their robots because of their high cost and unclear value. While their videos amaze us all, at the end of the day two metrics, cost and value, determine the adoption of any technology. Robotics — no matter how cool — are no different.

This blog post discusses the progress being made to improve robot’s cost:value ratio (what I’ll call the Jetsons ratio) by orders of magnitude so that we can continue to deploy more robots in the real world in increasingly more challenging and useful tasks.

At the heart of the currently terrible Jetsons ratio is a classic chicken and egg problem. Robots are expensive because they aren’t yet mass produced. They aren’t yet mass produced because they don’t yet offer real value to the common consumer. They don’t yet offer value to the common consumer because they aren’t yet intelligent. They aren’t yet intelligent because there isn’t a large-scale dataset on which to train them. There isn’t a large-scale dataset because robots aren’t mass produced. We’re back to the beginning.

However, there is an out to this vicious cycle and it’s through riding the success waves of the commercial drone industry and deep learning research.

Recent market factors like the explosion of the consumer drone and scooter industries have justified the mass production of robotics-grade motors and electronics. While not specifically designed for use in articulated robot limbs, these components have catalyzed the emergence of a new generation of low-cost robots.

sources: [left][right]

Research from the legged robot community produced a new low-cost but high-performance actuator called the quasi-direct-drive actuator. Robots like GOAT, Minitaur, and MIT’s Mini Cheetah have been designed to balance the inherent trade-offs between force control (required for safe interaction with humans), high torque density (required for interacting with household objects with a reasonably sized robot), mechanical robustness, and low cost. Similarly, the $5,000 Blue arms utilize the same quasi-direct drive actuation scheme to enable a capable, force controlled manipulator at low cost.

The secret to these super low-cost robots was retro-fitting mass-produced ‘drone’ motors with custom drive electronics, low-cost magnetic encoders, single-stage transmissions and advanced field-oriented control. This combination forged the path for high-performance robotic actuators at a tenth the cost of traditional robotics drives from vendors like Maxon, AMC, Elmo, Harmonic Drive, etc. This innovation will be an inflection point in the coming robot revolution.

Deep learning, while not the answer to all our problems, offers the promise of liberating robots from manufacturing into use-cases with significantly more variability. Instead of programming robots explicitly for any and every scenario, deep learning, while data hungry, leverages experience to learn control strategies that can adapt to new scenarios on the fly without explicit instruction.

Unlike most deep learning applications, which perform visual understanding and reasoning, robots need to be able to act in response to their perceived environment. Doing so requires a precise spatiotemporal understanding of the world. This requires significantly more data than using a neural network to determine if a picture contains a dog, cat, or airplane.

The dearth of a diverse, large-scale robot dataset is at the heart of our chicken and egg problem and there isn’t much of a consensus on how to collect a widely useful one. Unlike videos, audio, pictures, and text which are abundant online and in everyday life, robots aren’t. Collecting data on real robots is time-consuming, potentially dangerous, and expensive.

With a deployed fleet of low-cost robots, learning is certainly scalable though. Each robot can learn from every other robot’s collective experience so that every new robot deployed in a distributed network need not be retrained — a true robot hive mind.


Before deploying a fleet of low-cost robots, we first need to answer 4 questions to ensure that we can properly learn from the collected robot experiences.

In the context of creating a personal, in-home robot we would want the same robot to work equally well in my home and yours. But the problem is every home is radically different. Different furniture, different lighting, different flooring and layouts, different door knobs and appliances … different everything. How do we determine the appropriate sample size and diversity of data to be collected so that robots don’t overfit to the set of homes on which they were trained and instead can generalize to open doors or clean up rooms equivalently well in any home?


Do we need visual data from cameras; 3D point cloud data from depth imagers or LIDAR; trajectory and motion data from encoders; tactile, proprioceptive or haptic data from load cells, soft artificial skin sensors, the GelSight sensor; or some complex combination thereof?

Different tasks most likely require different sensing modalities. Grasping an item may only require cameras but reorienting the object once it’s been grasped may be done more efficiently with some form of haptic information.

The same question exists for self-driving cars. On one hand, Elon Musk claims that fully self-driving vehicles can be achieved without expensive LIDARs. On the other hand, many top AI researchers and other self-driving car companies like Waymo disagree. There isn’t a strong consensus on the minimal set of sensors needed.


Possibly the most difficult question is determining how to inform robots if their actions during each experience are successful so that they can learn which actions lead to success and which lead to failure. Robots can be engineered in ways to do this automatically via self-supervision or reinforcement learning. Alternatively, humans can manually provide demonstrations or annotations indicating successful versus failed actions. The hard part is determining a generalizable framework for defining rewards such that each task doesn’t require its own list of finely tuned conditions or demonstrations in order to classify behaviors as good or bad.

Do we learn end-to-end models, or is it smarter to utilize human intelligence to frame tasks in ways that simplify what the neural network must learn? For example, with grasping we can try to learn end-to-end 6-DoF grasp poses directly from raw pixel inputs. Alternatively, we can frame grasping as an image segmentation problem using fully convolutional networks to classify each pixel as a good or bad area to grasp an object at and, as a post-process, use surface normals computed from depth images to obtain the full 6-DoF grasp pose. The latter method is more data efficient since the problem is simplified and the action space is reduced from 6 dimensions to 2. The classic trade-off to this simplified approach is that the method is no longer generalizable to many different manipulation tasks — each subsequent task will require significant re-engineering and data collection.


Similarly, OpenAI trained a high degree of freedom Shadow Hand, to solve a Rubik’s cube. However, instead of training the neural network to ‘just solve the cube’ in and end-to-end fashion, they broke the problem down into many sub-problems using a pipelined approach. To actually solve the cube’s puzzle they use a conventional cube solving algorithm — Kociemba’s algorithm. Instead of using a camera to determine the state and orientation of the cube through visual perception — like humans — they instead retro-fitted the cube with a variety of internal sensors. Instead of learning to generally manipulate any face of the cube, they use human intuition about which face the hand was best at rotating (the top face), and constrain the solution moves to only rotate the top face. In this way they took a complex problem like solving a Rubik’s cube and significantly simplified the learning task by narrowing the scope to simply learning 1) how to rotate just the top face of the cube and 2) how to reorient any face to be the top face of the cube. They left the rest to conventional engineering and non-learned algorithms.

While manual engineering subgoals makes the learned skills less generalizable, it makes the task at hand easier to learn and the performance more reliable. We’ll talk more about the end-to-end vs pipelined approaches in Part 2.

There are many system architectures whose design choices have inherent trade-offs and implications on how much data you need and how well your eventual model can generalize to different robots, different tasks and different environments. Knowing what to have the neural network learn and what kind of data it should learn from can significantly reduce the amount of data needed.

Part 2 of this post discusses clever, state-of-art research directions for ways in which we can make progress in addressing each of the 4 questions outlined in this post!

About Me

I’m a robotics PhD student in Stanford’s Vision and Learning Lab and the Founder of Nimble, a startup building the robotic hive mind to automate complex tasks in warehouses.

Thanks to my awesome friends and colleagues Chip Huyen, Evan Ackerman, Andrey Kurenkov, and Jordan Dawson for their feedback and suggestions on this post!

Twitter: @simonkalouche

Get smarter at building your thing. Join The Startup’s +785K followers.

Thanks to Chip Huyen and Jordan Dawson

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Simon Kalouche

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +785K followers.

Simon Kalouche

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +785K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store