How to train your Robot Arm?

Training a 6 axis robot arm using Unity ML-Agents

Published in

XRPractices

12 min readAug 9, 2020

The previous generation of robots are algorithm-driven. Robot arms specifically use the “Inverse Kinematics” algorithm to calculate and move the arm to the desired location. The modern-day robots are more and more becoming AI-driven. In this article we are going to see how to we train a Robot arm controller using ML such that it learns the Inverse Kinematics on its own.

The challenge of ML for robotics is the hardware. It comes with a cost, meticulous setup, and luxury of space. So it becomes really important and need of the day is to train ML models for robotics in a simulated environment. Unity has recently released the ML-Agents package which does just that.

If you are new to Unity and Unity development then I recommend you read the “Let’s get started with Unity” articles by Neel.

Let’s get Started with Unity…

Well let me tell you this won’t be the end all be all… This is just the beginning of a series of Journal log styled…

medium.com

Most of the ML training modules are written in Python using a vast variety of ML frameworks available. Unity ML-Agents does make use of Python and TensorFlow under the hood, but it provides a level of abstraction for Unity developers. ML-Agents is a framework using which you can train your models without even writing a single line of Python code.

The first step is to import the “ML-Agents” package using Unity’s Package Manager (Window → Package Manager). Install the latest version of “ML-agents”. The version of ML-agents used in this article is 1.0.3. If there is a change in version, the configuration parameters, method names and its parameters may change.

If you are using ML-Agents for the first time, then you may need to set up “Anaconda”. We are going to need Python 3.7 version of Anaconda. So download and setup anaconda in your environment accordingly.

Simulation Setup:

Robot arms have DoF (Degrees of Freedom) configuration. Typically they come as 4 DoF or 6 DoF settings. Each DoF classified into 2 types, either it is Bend or it is Rotate. For the simplicity of representation and learning, I have selected a 6 DoF model that has 4 Bends, 1 Rotate, and 1 end effector (Gripper). Each DoF is driven by a motor with a planetary (or other) gear system to operate the DoF. The DoF movements will always be in the range of 0 to 360 degrees. Some of the DoF’s movement may be restricted to say -90 to 90 degrees so that the movement does not extend beyond boundaries. Let’s call DoF as “Axis” going forward.

Here’s our model's axis setup:

Axis 1: is the bottom-most axis and can rotate 0 to 360 degrees [ Rotate ]

Axis 2: is the first bend axis and the range is -90 to 90 degrees [Bend]

Axis 3: second bend axis and the range is -120 to 120 degrees [Bend]

Axis 4: third bend axis and the range is -90 to 90 degrees [Bend]

Axis 5: fourth bend axis and the range is -90 to 90 degrees [Bend]

Axis 6: is the gripper, it has only 2 states — Open and Close

Training Scope:

The ML training scope for this exercise is to make the robot arm reach the target component from any state. The arm should not go below the ground. The target component is always placed above the ground and within the reach of the robotic arm.

AI Output Specification:

Given the location of a target component, The trained model should generate an array of actions — each action in the array represents the degree of rotation that needs to be made in each axis in order to reach the target component. Which is 5 actions in total, 1 Rotate and 4 Bends.

It is worth noting that in the ML/AI space everything is represented in normalised form, meaning the actions will be represented as floating point values ranging between -1 to 1. During the training and during AI deployment we need to extrapolate these values to Degrees.

Scene Setup:

The scope has 2 goals. One is to keep the robot arm gripper always above the ground so that it doesn’t collide with ground causing accidents. For the given simulation setup there are more chances that the gripper will easily hit the ground. So to train the controller to avoid the ground, we need to setup a ground plan and attach a Rigid body and a mesh collider. The robot arms gripper should have a trigger collider to detect collisions with the ground plane. Not only the gripper, but the other parts of the arm also should not hit the ground. So each axis other that axis 1 will have colliders added to detect collision with ground plan during training.

The Second goal is to make the gripper reach the target component. For that, the target component should be added with a rigid body and a collider. The ground plane, robot arm and the target component needs to be tagged with unique names in order to give positive or negative rewards to the AI when objects collider with each other.

Markov Decision Process:

Unity’s ML-agents training framework is built based on the “Markov Decision Process” which is often referred to as MDP. So you need to model your problem statement in MDP. MDP for an ML-agent works in the following manner, Foreach time step

Agent sees a state
Agent takes actions
Agent receives a reward
repeat from #1

Markov Decision Process (Image Courtesy: Quora.com)

The following methods in ML-agent framework represent the above steps

CollectObservations — The states that the agent observes
OnActionReceived — The actions are generated by the agent
AddReward — Reward or Penalty associated with the last action

The training sessions are organised in Episodes. An episode will run

until a fixed number of time steps
until the goal is reached — The success criteria. In our case, the arm successfully reaching the target.
until “Game Over” — a state where there is no recovery

The way MDP trains the ML is by continuously trying to maximise the reward before the end of episode.

Rewarding Scheme:

For each time step, against the actions generated by the ML agent, we need to give back a reward. The reward could be positive or negative (Penalty). When the ML model exhibits the desired behaviour, we give it a positive reward. When the ML model deviates from the desired behaviour we give it a penalty. We may choose to give hefty penalty only in case where the ML model gets into non-recoverable state like hitting the ground. Also we only give a hefty reward only when the arm reaches the target. The reward range should be kept in between -1 (Hefty Penalty) to +1 (Hefty reward). Keep the rewarding scheme simple, too much of penalty or too much of reward make the ML not learning what we are trying to achieve.

For our robot arm simulation case, I came up with the following reward scheme

When the arm hits the ground — Hefty Penalty (-1) and end episode
When the arm reaches the target — Hefty Reward (+1) and end episode
When the arm reaches closer to the target — Marginal reward (the difference in distance as reward)
When the arm moves far from the target — Marginal Penalty (how far is it from the target as penalty)

Writing the ML Agent:

Here in our case, the robot controller is the agent which need to be trained. So we attach a script to the Robot Prefab and name it as “RobotControllerAgent.cs”. The ML Agents should extend from class “Agent”, so we replace the default Monobehaviour to Agent in the script and start overriding the Agent methods as follows

public override void Initialize()
{
   ResetAllAxis();
   MoveToSafeRandomPosition();
   if (!trainingMode) MaxStep = 0;
}

The first method to override is the “Initialize”. In our case, we reset all axis positions and rotations to default, then have the agent try rotating each axis such that the gripper is above ground.

The next method to override is the “OnEpisodeBegin”. The ML Agent trainings are executed in episodes. A default episode setting for 5000 cycles of training. “OnEpisodeBegin” is the place where you reset old episodes and initialize a new episode. In our case, we reset the axis rotations, randomise the axis rotations and place the target component in a random location.

public override void OnEpisodeBegin()
{
   if(trainingMode)
      ResetAllAxis();

   MoveToSafeRandomPosition();
   UpdateNearestComponent();
}

The key override is method “OnActionReceived”, this method will be called by the ML-Agent during the prediction stage. This you can correlate to the output nodes of a neural network model. The ML will send its predicted actions, we need to receive the actions and apply them in the simulation and send back a Reward / Penalty to ML.

public override void OnActionReceived(float[] vectorAction)
{
   angles = vectorAction;

   // Translate the floating point actions into Degrees of rotation for each axis
   armAxes[0].transform.localRotation = Quaternion.AngleAxis(angles[0]* 180f, armAxes[0].GetComponent<Axis>().rotationAxis);
   armAxes[1].transform.localRotation = Quaternion.AngleAxis (angles[1]* 90f, armAxes[1].GetComponent<Axis>().rotationAxis);
   armAxes[2].transform.localRotation = Quaternion.AngleAxis( angles[2]* 180f, armAxes[2].GetComponent<Axis>().rotationAxis);
   armAxes[3].transform.localRotation = Quaternion.AngleAxis( angles[3]* 90f, armAxes[3].GetComponent<Axis>().rotationAxis);
   armAxes[4].transform.localRotation = Quaternion.AngleAxis( angles[4]* 90f, armAxes[4].GetComponent<Axis>().rotationAxis);

   float distance = Vector3.Distance(endEffector.transform.TransformPoint(Vector3.zero), nearestComponent.transform.position);
   float diff = beginDistance - distance;
   if (distance > prevBest)
   {
      // Penalty if the arm moves away from the closest position to target
      AddReward(prevBest - distance);
   }
   else
   {
      // Reward if the arm moves closer to target
      AddReward(diff );
      prevBest = distance;
   }
}

We generate a hefty penalty when the robot arm colliders or the gripper collides with the ground.

public void GroundHitPenalty()
{
   AddReward(-1f);
   EndEpisode(); // non-recoverable state so end episode
}

In the Ground plane will have the following script attached so that whenever an arm part collides with ground, it will call the GroundHitPenalty of the agent

public class PenaltyColliders : MonoBehaviour
{
    public RobotControllerAgent parentAgent;

    private void OnTriggerEnter(Collider other)
    {
        if (other.transform.CompareTag("RobotInternal"))
        {
            if(parentAgent != null)
                parentAgent.GroundHitPenalty();
        }
    }
}

So when do we generate a Reward? when the gripper trigger enters the component collider, we generate a Reward back to ML. The agent has a JackpotReward. If the gripper is properly pointing towards the up vector of the target, then we give it a bonus too.

public void JackpotReward(Collider other)
{
   if (other.transform.CompareTag("Components"))
   {
      float bonus = 1f * (int)Mathf.Clamp01(Vector3.Dot(nearestComponent.transform.up.normalized, endEffector.transform.up.normalized));
      float reward = (1f + bonus);
      if (float.IsInfinity(reward)) return;
      AddReward(reward);
      EndEpisode();
   }
}

This Jackpot reward will be called by the gripper trigger

public class EndEffector : MonoBehaviour
{
    public RobotControllerAgent parentAgent;
    private void OnTriggerEnter(Collider other)
    {
        if (other.transform.CompareTag("Components"))
        {
            if(parentAgent != null)
                parentAgent.JackpotReward(other);
        }
        else if (other.transform.CompareTag("Ground"))
        {
            if(parentAgent != null)
                parentAgent.GroundHitPenalty();
        }
    }
}

The last method to override from Agent is the “CollectObservations”. These are the state values that we want the ML to associate. In a traditional NN problem these values will represent the input nodes.

public override void CollectObservations(VectorSensor sensor)
{
   sensor.AddObservation(angles);
   sensor.AddObservation(transform.position.normalized);
   sensor.AddObservation(nearestComponent.transform.position.normalized);
   sensor.AddObservation(endEffector.transform.TransformPoint(Vector3.zero).normalized);
   Vector3 toComponent = (nearestComponent.transform.position - endEffector.transform.TransformPoint(Vector3.zero));
   sensor.AddObservation(toComponent.normalized);
   sensor.AddObservation(Vector3.Distance(nearestComponent.transform.position,endEffector.transform.TransformPoint(Vector3.zero)));
   sensor.AddObservation(StepCount / 5000);
}

Once the required scripts are ready, we add “BehaviourParameters” script to the agent and set its configuration and Agent configuration values in the inspector.

In the Behaviour parameters,

Vector Observation → “Space Size” — is the number of state values that the ML is observing. Note that the Vector3 observations count as 3 elements, and array will count as its size.

Vector Action → Space Size — is the number of predicted actions that we are expecting from AI.

You need to add a “Decision Requester” script to the game object and set the decision period as 5. This means that the we ask the Agent to produce a new action only after the decision period. This helps the ML to learn better, by sending the same set of actions and rewards 5 times such that the node optimiser function adjust its weightage.

Trainer Configuration YAML:

Once the scene setup and scripts are done. The next step is to alter the trainer_config.yaml that gets generated while you import ML-agents into your project.

behaviors:
    default:
        trainer_type: ppo
        hyperparameters:
            batch_size: 128 
            buffer_size: 1280
            learning_rate_schedule: linear
            learning_rate: 3.0e-4
        network_settings:
            hidden_units: 128
            normalize: false
            num_layers: 3
            vis_encode_type: simple
            memory:
                memory_size: 128
                sequence_length: 128
        max_steps: 5.0e5
        time_horizon: 64    
        summary_freq: 10000
        reward_signals:
            extrinsic:
                strength: 1.0
                gamma: 0.99
    Robotarm:
        trainer_type: ppo
        hyperparameters:
            batch_size: 128
            buffer_size: 1280
        network_settings:
            hidden_units: 128
            num_layers: 3
        max_steps: 5.0e6
        time_horizon: 128

There are 2 sections, the default and Robotarm. Please note that the name “Robotarm” here should match the name given in “Behaviour Name”

The key configuration values are

batch_size, hidden_units — set to 128 representing 128 nodes per layer in Neural Network

num_layers — number of hidden layers in NN, set to 3 layers

max_steps — Total number of training steps, I set it to 5 million for our case

The Robot Farm:

Before we begin the training, we setup a robot farm with multiple agents in the scene. In our simulation case, we setup a total of 12 agents.

Training:

To begin the training, we need to start the ML-Agent Python server. If the mlagents is not already installed then follow the below steps

cd <Project Folder>
conda create -n ml-agents-1.0 python=3.7
conda activate ml-agents-1.0
pip install mlagents

The above step is a one time process. If you have done it already, then the following set of commands will start the python server

cd <Project Folder>
conda activate ml-agents-1.0
mlagents-learn ./trainer_config.yaml --run-id ra_01

Once the python server is started, you can hit the “Run” button in unity. Now the agents in the scene will run the simulation and communicate with the python server for each time step. In the beginning, you may see that the robot arm movements are erratic. But as the time goes on, you will see them get better and better. Here’s how it may look like in the beginning of the training.

Erratic arms movement at the begging of training

Cobra Pose:

we may need to do a bit of trial and error to arrive at the right YAML configuration values. I started with the batch_size, hidden_units value of 64, but then this setup had issues in learning. 64 nodes per layer setup kind of stuck in a loop which I call “Cobra Pose” during training. The below GIF would explain it.

So I had to increase the number of nodes in layer, batch_size and num_layers. The optimal values are 128 for batch_size and hidden_units. num_layers is set to 3. This made the model better and it had enough nodes to model the scenarios and it never got into cobra pose again.

The progression of training might look like the following during the training

TensorBoard:

The text representation of ML learning doesn’t make much of a clarity. As I already indicated, the ML-Agents make use of Tensorflow under the hood. So you can make use of TensorBoard utility to monitor the training progression. Here are the steps to start the TensorBoard report server

cd <Project Folder>
conda activate ml-agents-1.0
tensorboard --logdir results

One the TensorBoard server is initialized, you can point to http://localhost:6006 in your browser to see the visual representation of your training progression. For a 5 million step training, my training progression looks like the following in TensorBoard.

Successful Learning:

A successful training session, towards the end should look like the below one. The arm will progressively try reaching the target every episode.

Arm successfully reaching to target towards end of training

Using the learned Model (AI):

The training sessions will auto terminate after the max_steps value mentioned in the YAML. The trained NN will get persisted under

<Project Folder>/results/<run-id>/<behaviour name>.nn

In our simulation case, it is found in

<Project Folder>/results/ra_01/Robotarm.nn

Copy the “Robotarm.nn” file and paste it inside the “Assets/NNModels” folder of the project.

Create a new scene, and add the Prefab to the scene. Set the Model value in the inspector to the copied NN file. Set the training mode to false.

Once it is done, Hit Play button. You can see the arm moving to the target location.

Thanks for reading till the end. This entire project is available as open source in GitHub https://github.com/rkandas/RobotArmMLAgentUnity.

Add claps and leave me a comment if you want to see more articles in this ML-agents space.