Building smart robots using AI + ROS: Part 1

Welcome. Intro to RL and OpenAI gym.

karthic Rao
Kredo.ai Engineering
6 min readDec 8, 2017

--

Motivation for writing blog series on AI + Robotic Operating Systems:

The Robot Operating System (ROS) is a flexible framework for writing robot software. It is a collection of tools, libraries and conventions that aim to simplify the task of creating complex and robust robot behavior across a wide variety of robotic platforms.

ROS is used to create application for a physical robot without depending on the actual machine, thus saving cost and time. These applications can be transferred onto the physical robot without modifications.

The decision making capability of the robots can be aided with AI. The cases where the robot agent has to learn optimal strategies in high dimensional state space often means that it is impractical to generate sufficient training data with real-world experiments. These are the kinds of tasks where reinforcement learning excels. For lot many other perception tasks we could use Deep learning.

We could train a robot using reinforcement learning or Deep learning in simulation, with the potential of then transferring this to a real robot. Today this technique is widely used in training drones, autonomous vehicles, robotic arms, warehouse robots and the list goes on. There never hasn’t been a better time to take a deep plunge into the area.

I witnessed that the materials around using AI to train Robots in ROS are scarce on internet, thus decided to write series of articles on them.

The articles in the series will be categorized into 3 parts,

1. Focused on opening up an intuitive insight into Reinforcement learning (RL), Deep Reinforcement Learning(DeepRL) and Deep learning(DL).

2. Focused solely on fundamentals of ROS.

3. Training Robot agents using RL, DeepRL and DL techniques in ROS simulation.

This article is about introduction to Reinforcement learning and Open AI gym.

Reinforcement learning, in simple terms, is a mathematical approach where an agent interacts with an environment by taking actions in which it tries to maximize an accumulated reward.

An agent in a current state (St) takes an action (At) to which the environment reacts and responds, and it returning a new state(St+1) and reward (Rt+1) to the agent. Given the new updated state and reward, the agent chooses the next action, and the loop repeats until an environment is solved or terminated. In the process the agent is expected to learn the optimal set of actions to be taken to achieve the goal.

From David Silver’s Intro to Reinforcement learning.

But what are these environments?

It depends! For a drone which is being trained to maneuver, the open space where its being trained is its environment. The velocity, its location and fuel levels could define its state. For agent which is trained to play chess, the chess board is its environment.

What do you actually mean when you say that the agent interacts with the environment?

The environment should allow the agent to take actions in any given state of the environment, which should result in transition of the state and return a reward for moving onto the new state.

What are these rewards? Why should the environment return reward for moving onto a new state?

Rewards are just signals by which we communicate to the agent on what it should ultimately accomplish, its the means by which the agent understands the goal, it also uses the reward values to figure out the optimal way to achieve the goal.

What are some important differences between Reinforcement leaning and supervised learning ?

In case of a reinforcement learning the reward will usually be delayed. For example, in a game of chess the reward for winning is obtained only after series of moves which led to the positive outcome. This is different from the supervised learning setting wherein the feedback is immediately available by means of labels.

The action taken at a given point in time has an impact on the future observations made, For example the direction at which the robot decides to move has an impact on its future world view.

What are some of the challenging scenarios where Reinforcement learning thrives to figure out the optimal strategy?

  • Rewards are delayed, a mouse following lumps of cheese may lead it to a trap! So immediate rewards didn’t actually matter much.
  • In situations where the outcome of an action is uncertain, you may not know how the environment responds to a given action. Consider the scenario of a stock market, same actions/investments may not lead to similar outcomes.
  • You may not be able to perfectly sense the state of the world. In many of the scenarios the observation of the environment may not capture all the influential parameters.
  • The reward may be stochastic/probability-distribution.
  • The environment may change during the learning process.

Inroder to train an agent using RL, one needs an environment where the agent can interact with, where the state of the agent is well defined, actions can be taken and rewards are obtained right?

Yes, that’s true.

Before I apply RL algorithms to real world problems I need an environment to learn, try, test and experiment with the RL techniques!!! What should I do?

Well!! OpenAI gym is here for your rescue!!!!!

OpenAI Gym is a toolkit/library for developing and comparing reinforcement learning algorithms. This is the gym open-source library, which gives you access to a standardized set of environments.

Gym is a collection of environments designed for testing and developing reinforcement learning algorithms. It saves the overhead of having to build an environment to try out RL techniques. Gym is written in Python. There is also an online leaderboard for people to compare results and code.

Here are some details about the gym library.

import gym

Imports the open AI gym python module

gym.make("cartpole-v0")

It fetches the carpole environment, https://gym.openai.com/envs has the list of all the available environments.

Here is the description of the cartpole environment and the goal, a pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart’s velocity.

At any point in time the state of the system in the cartpole environment is defined by 4 variables,

There are only 2 possible actions that be performed.

So the goal of the task is to look at the states and decide on one of the two actions inorder to balance the pole.

Information about any given environment can be obtained by env.observation_space , various methods on this object helps you to understand the state and environment better. Similarly methods on env.action_space object helps one to understand the action space. These methods are very helpful since the state representation and the possible actions differ with each environment.

One needs to run the reset function before running the episode,

env.reset()

Now, Here is how to perform an action and obtain new state and its associated reward as a consequence of the action,

state, reward, done, info = env.step(action)

Let’s run an episode through the environment, let us pick a random action each time and see how long can we hold the pole upright.

Since the actions are random we’ll not be able to balance the pole for long,

That’s it for now! In the next blog I’ll be discussing about the intuition behind Q value and using Q learning to solve a simple Open AI gym environment. Thank you for reading. Feel free to comment and express your opinion.

Additional Resources:

--

--

karthic Rao
Kredo.ai Engineering

Co-founder at Stealth. Code with Love, Learn with Passion, Feel the music, Live like a hacker.