Human Learning Journal: How to teach a robot to dance?

Valentin Kindschi
impactIA
Published in
6 min readDec 15, 2020

This weekly journal reports my up and downs as a machine learning and robotics intern at the impactIA Foundation. Try to have fun reading it, but know that I have little fun writing it. You have been warned.

In my previous articles, I wrote several times about our dancing robot, Dai. However, I never explicitly explained how it was possible for a machine to learn how to dance… Until now. By reading the following paragraphs, you will understand how robots decide what to do, how they learn from their mistakes and more importantly, how to teach them to dance.

Usually in robotics, when we want to make a machine intelligent and enable it to take dynamic decisions, we use reinforcement learning algorithms. These kind of algorithms work a lot like parenting. Let me explain: the goal is to teach a young robot, freshly built, how to behave. At first, we let it do what it wants, which are mostly random movements because it needs to discover what it is capable of. At the same time, we give it some guidance by offering a positive reward when it does something good and a negative reward when it does something that we consider bad. With time, the young robot, which loves rewards, will learn to perform actions that yield the most reward and will avoid the ones that aren’t profitable. It’s that simple. To illustrate this process, here is a nice video showing how it was applied to learn to play hide and seek:

In practice, to implement a reinforcement learning algorithm inside a robot we model the world and the potential actions using a Markov Decision Process (MDP):

Markov Decision Process

A MDP is a mathematical framework, which divides the time in discrete moments, and the world in discrete states. A state can gather many parameters, such as: the robot’s position in a room, the energy left in the batteries, the attractiveness of a pose, etc. Moreover, at each moment, our robot has to decide what to do from a finite set of available actions, for example: “go left”, “go faster”, “lift this box”, “get down”, etc. Each action will lead our robot from the current state to another one and this state transition is very important because it defines the reward the robot will get: if the next state is a better situation, the robot get a better reward and on the contrary, if the next state is worse, the reward won’t be good. Here is an example of a MDP diagram for a single state transition :

Simple Markov Decision Process diagram

On the diagram above, a robot in state 0 will most probably choose action B because it has the largest reward. However, at the beginning of the training, the robot does not know how much reward it will get for performing each action. Therefore, during training,(which is usually done in simulations to avoid breaking hardware), the robot will create its own “map of expected reward”, and update it after each action with the real reward. Its goal is to find a trade-off between exploring new states and exploiting good rewards.

In the end, after many iterations the robot optimized its own policy which gives the action to undertake in any given state to get the best reward. There are many algorithms to optimize this policy, depending on the number of states or actions and the modelling of the rewards. In fact, in the previous example we imagined that there exists a finite set of action and states. However, in reality there are an infinite number of states and actions, and it is not always possible to simplify the model with discrete ones.

How to teach dance to a robot ?

Our dancing robot Dai

Now that we have the tools to understand the basic principles of reinforcement learning, let’s get back to our initial question: how to teach a robot to dance? If we take the example of Dai, the goal was to give it the minimum guidance to let it express itself as freely as possible. During Dai’s previous exhibition, its major constraint was the pink square on the ground, which it could not leave, as you can see in the picture above. Let’s see how to use a Markov Decision Process to teach Dai how to dance.

The first thing to do is to define Dai’s states and actions. To do so, it is necessary to understand its hardware, and more specifically its engines: as you can see on the picture, Dai possesses six legs, which are in fact six wheels, mounted on three long sliders. Therefore, a state was defined to take into consideration the speed of each wheel engine, the speed and position of the slider engines, plus Dai’s position in space. The same logic is used to define the actions, which are to increase or decrease the speed of each engine.

Now that the states and actions are defined, a rewarding method must be found. As we want Dai to dance, it will receive better reward when it moves and when it does broad movement with its sliders. This way, Dai will be discouraged from standing still. In addition, to prevent it from leaving his square, it will receive negative rewards when its position is outside this area.

From there, we can start Dai and let it explore the available states and actions, creating its own policy. Of course, in the beginning it might go outside the square and it might not move a lot. But, the more time passes, the more it will understand where it is allowed to stay and what action to perform to get the best rewards.

Seeing Dai dance is very interesting because we can recognize some patterns, or repeated dance moves, that it seems to like because it gives good rewards. Moreover, as its policy is stored between each performance, it never starts from nothing and each time it performs it invents new moves or reinvents old ones.

It should be noted that, as we wanted Dai to be as creative as possible, we tried to have a very simple rewarding system, which leaves it a lot of freedom. We wanted to see Dai’s vision of dance. However, in a different project we could imagine using more constraining rewards, making a robot mimics classical dance or hip-hop by giving large rewards when specific type of movements are made.

Conclusion

To conclude, I presented in a simple way the basic working principle of reinforcement learning in robotics. The words I used in this article are the ones used in scientific research and I welcome you to shine explaining how a policy links actions and states to optimize rewards at your Christmas dinner.

If we listen to the media, robots can be scary. The only way not to be afraid of them is to understand how they work and how they think. You might not be able to design a robot chef to help you cook your turkey, but now that you can understand your vacuum cleaner, I am sure that you will stop thinking that it will devour you when you are asleep.

--

--

Valentin Kindschi
impactIA
Editor for

EPF Engineer in robotics, I am currently working in robotics and AI development at ImpactIA in Geneva, Switzerland.