(Picking Up)² Utensils with Reinforcement Learning

Published in

Dexai Robotics

5 min readNov 19, 2021

Versatile utensil usage is a key feature of Alfred, our beloved sous chef robot at Dexai Robotics. In order to prepare diverse recipes (and be food-safe certified by NSF), Alfred picks up utensils from ingredient containers by learning how to do it. This blog post will provide a glimpse into how Alfred picks up utensils autonomously.

To our best knowledge, Alfred is the first production robotic system to deploy a reinforcement learning policy for physical tasks.

Why is picking up utensils difficult?

For those familiar with the robotics literature, the utensil-change task may sound familiar to the classic peg-in-hole insertion. That is partially true. Fundamentally, picking up utensils is indeed a peg-in-hole problem. However, the small details make it much more difficult.

Alfred works in a highly dynamic, uncontrolled environment — the kitchen. Therefore, it faces unique challenges that traditional robotic applications do not encounter. Here are a few key features that make utensil change particularly challenging:

The utensils have a complex mating geometry
Alfred’s utensils are designed such that they can be actuated with a mechanical driveshaft. This way, Alfred can perform more complicated food manipulation, such as opening and closing a pair of tongs. However, the presence of a driveshaft also means it’s more difficult to mount the utensil correctly.
The utensils can move
Food safety regulations require the utensils to be in constant contact with the ingredients when not actively used. This means that the utensil is not always resting in the same place in the bin and that both the utensil and Alfred are moving parts (albeit the utensil is passive) in the utensil-change task. An inaccurate action from Alfred might push the utensil away. In contrast, in a classic peg-in-hole insertion task only one of the peg and the hole is allowed to move, and executing suboptimal actions does not increase the difficulty of the task subsequently.
The location of the utensil is unknown
Computer vision is the only way by which Alfred can locate the utensil. There is no ground truth utensil location available otherwise. It is especially challenging to measure tool locations purely using computer vision to the millimeter-precision required. In traditional industrial settings, the locations of the objects are often known exactly.
There is no direct force/torque sensing
To keep the system affordable, Alfred does not have a built-in end effector force/torque sensor, a prohibitively expensive sensor on which many peg-in-hole insertion methods rely. Alfred does not need that :)

The different utensils Alfred work with.

Sensing the utensils: perception and proprioception

Perception

To see where the tool is, Alfred uses an RGB-D camera to observe tags engraved on the top of each utensil. From tag observations, Alfred estimates where the tool is in the world to get ᵂXᵀ, the transformation from the world to the tool.

Utensils & tags, as seen by Alfred’s camera.

Proprioception

Proprioception is the sense of body position and movements (Wikipedia). Alfred has real time proprioception information through joint encoders. This gives the pose of the robot end effector (where the tool attaches) in the world ᵂXᴱ.

Combining the two

The transformation from the robot end effector to the tool is a simple product of the two terms.

Feeling the utensil: tactile feedback

Alfred uses the joint torque sensors to estimate how much force and torque is acting on the robot end effector. Since the utensil is the only object with which Alfred comes into contact, all the external forces and torques on Alfred come from the contact interaction. These external forces and torques are reflected in the measured joint torques τ through the Jacobian J:

However, the story doesn’t end there. With 7 joints, Alfred has 7 degrees of freedom (dof), but the 3D world Alfred works in only has 6 dof. This means the inverse of J is not unique, and we cannot determine a unique F just by inverting J. Fortunately, the correct answer to F can be obtained by the dynamically consistent Jacobian inverse J̅. This gives use the mapping from τ to F:

And that’s it! With a bit of rigid-body dynamics, Alfred can see and feel the utensil.

Learning to change utensils

To make Alfred learn to change utensils using the sensory inputs, we leverage reinforcement learning on hardware. During training, Alfred first performs a utensil change using the current policy — what Alfred thinks is the best way to change a utensil. Alfred then records the outcome and updates its knowledge on the policy based on the outcome. The entire training process is autonomous. After enough attempts, Alfred learns how to change utensils!

Alfred working late into the night to learn utensil changing.

While reinforcement learning algorithms have been around for a long time, applications on physical systems have only started emerging in recent years. Reinforcement learning algorithms tend to have poor data efficiency, and the time required to train a system is often prohibitively large. A common solution to the data efficiency problem is to train in simulation instead. But even in ‘high-fidelity’ simulations it is difficult to approximate the real world, especially in contact-rich scenarios like utensil-changing where contact dynamics are especially challenging to simulate. The result is policies that may work in simulation but not in the real world. Consequently, reinforcement learning has been largely restricted to robots in research labs, until now.

With our formulation of the utensil-change task, Alfred is able to overcome the data efficiency challenge and become the first of its kind — a production robot that learns to perform physical tasks.

Results

Training curve

The following figure shows how Alfred progresses over time. The failure rate decreases toward the end as Alfred learned to change utensils.

Utensil change failure rate over intervals of attempts. From left to right, the height of the ith dark blue bar represents the failure rate between [(i-1)Δt, iΔt). As the training progressed (i increases), the failure rate eventually dropped. The spikes in failure rate occurred as the algorithm explored different strategies, which ultimately led to a better policy.

Reinforcement learning in action

The following video shows Alfred attempting to change utensils before and after training. Before training, Alfred is more or less blind guessing what it should do. After training, Alfred performs the task much more smoothly and quickly.

Video comparison of Alfred changing utensils before and after training.

What’s next?

Learning how to change utensils is just a start. With Dexai’s fleet of robots preparing food every day, we have lots of data for learning kitchen tasks. What would you like Alfred to learn next?