Deep Reinforcement Learning Demystified (Episode 0)

I recently became interested in learning more about deep reinforcement learning. Recently, big news headlines were made as deep reinforcement learning was used to build a compter program that mastered different Atari games, and the AlphaGo program could beat the a top human of the Go game. My main resources of learning are the UC Berkeley deep reinforcement learning course, and Yantex Practical RL course lectures as well as their recommended reading materials.

Over the past century, computer scientists have pursued the goal of building intelligent machines that can percieve the surrounding world and make rationale and smart decisions just like humans do. In the last decade, the progress rate is acccelereted by virtue of the huge investment and attention to AI by both academic researchers and industry giants companies. This trend is expected to continue and grow even further in the future. Intelligent machines will penetrate more aspects of our daily live during the upcoming years. For example, it is estimated that by 2020 there will be 10million self-driving cars on the roads.

The goal of this series of articles, is to provide an introduction about “Deep Reinforcement Learning” which is an important approach to building intelligent machines. I will cover the basics of reinforcement learning from the ground up as well as providing practical examples to illustrate how to implement various RL algorithms. No prior knowledge about reinforcement learning or deep learning are required to follow these articles, but any previous experience about machine learning will be useful. I will provide code examples in Python and will be using OpenAI gym for evaluting our algorithms.

(Episode 0)

This first article (episode 0) aims to describe what Reinforcement Learning is and provide examples for where it can be used. I also cover the essential terminologies for RL and provide a quick tutorial about OpenAI gym as I am to use it in the programming exercises.

What is reinforcement learning ?

Reinforcement learning is a category of machine learning and it is best understood as If we have an agent that interacts with an environment such that it can observe the environment state and perform actions. Upon doing actions, the environment state changes into a new state and the agent recieves a reward (or penalty). Reinforcement learning aims at making this agent learn from his experience of interactions with environment so that it chooses the best actions that maximizes the sum of rewards it receives from the environment.

Source: Nervana Deep Reinforcement learning with OpenAI gym.

Reinforcement learning is not something entirely new, In fact most of methods in reinforcement learning were invented decades ago. However, it recently gained a lot of interest when it was combined with deep learning (thus deep reinforcement learning was born !) and making benefit of the large scale computation power to achieve successes such as playing Atari games from raw screen pixels and mastering the Go game.

Here are an example for a typical reinforcement learning problems:

Playing chess: in this problem, the agent is a computer program that plays chess, and the environment is the chess board status (i.e. pieces on board and their locations). The agent (computer player) observes the board state and makes a move. As a result of its moves, the board state changes and eventually the game will end and the agent will either win (recieve a reward) or lose (recieve a penalty). Reinforcement learning here can be used by an agent to improve its game playing skill. Such that the computer program that initially plays poorly after playing A LOT of chess rounds, it will become able to plan well and choose the moves that lead to winning the game.

Notice that in RL, the agent initially does not know what actions lead to win / lose but it has to do ( exploration) by trying random actions and then it will memorize the effect of actions it made. Exploration helps the agent to learn more about the world it is interacting with. After enough exploration has been made, the agent may choose one of a good action it has explored before; this is know as exploitation. In reinforcement learning, there is always a tradeoff between whether the agent should re-use one of its good actions or try another new action (hoping it will lead to better result). This trade-off is known as exploration-exploitation dilemma.

Markov Decision Processes

Reinforcement learning problems are mathematically described using a framework called Markov decision processes (MDPs). MDPs are extended version of Markov Chain which adds decisions and rewards elements to it. The word Markov here refers to that Markovian property which means that future state is independent from any previous states history given the current state and action. This means that current state encapsulates all what is needed to decide the future state when an input action is recieved. This is a reasonable assumption in many problems and it simplifies things a lot. For example, in chase game, the chess board configuration after a move is being made can be decided based on the current board configuration and the action being made now and we don’t need to worry about previous chess board configurations or past actions.

Formally, MDP is defined as a tuple of 5 items <S, A, P, R, 𝛾 > which are:

  • S: Set of observations. The agent observes the environment state as one item of this set.
  • A: Set of actions. The set of actions the agent can choose one from to interact with the environment.
  • P: P(s' | s, a) transition probability matrix. This models what next state s' will be after the agent makes action a while being in the current state s.
  • R: P(r | s, a) reward model that models what reward the agent will recieve when it performs action a when it is in state s.
  • 𝛾: discount factor. This factor is a numerical value between 0 and 1 that represents the relative importance between immediate and future rewards. I.e, If the agent has to select between two actions one of them will give it a high immediate reward immediately after performing the action but will lead into going to state from which the agents expects to get less future rewards than another state that can be reached after doing an action with less immediate reward ?

The goal of RL agent, is to solve the MDP by finding optimal policy which means finding the sequence of action it can make to maximize the total recieved reward.

OpenAI Gym

OpenAI gym ias an open-source python framework developed by OpenAI, the non-profit AI research company, as a toolkit for developing and evaluating reinforcement learning algorithms. It gives us a set of test problems, known as environments, that we can write RL algorithms to solve. This lets us able to dedicate more of our time to impelment and improve the learning algorithm rather than spending a lot of time simulating the environment. In addition, it provides a medium for people to compare and review the algorithms of others. The list of environments available in OpenAI gym can found here. Sounds cool right ?

Installing and Running OpenAI gym

For more detailed description about how to use and run OpenAI gym I suggest reading the official documentation page.

A minimal install of the gym can be done by:

git clone
cd gym
pip install -e . # minimal install

Now, after the gym has been installed you can instantiate and run an environment:

This above code snippet will first import the gym library. Then it creates an instance of the CartPole environment which is a classical problem in RL. In the cart-pole environment, it simulates an inverted pendulum mounted upwards on cart, the pendulum is intially vertical and your goal is to maintain it vertically balanced. The only way to control the pendulum is by choosing a horizontal direction for the cart to move (either to left or right).

The code above, runs the environment for 500 time steps, and it choses a random action to perform at each time. As a result, you see in the video below that the pole is not kept stable for long. The reward is measured by the number of steps before the pole becomes more than 15 degrees away from the vertical position. The longer you remain within this range, the bigger your total reward is.

OpenAI Gym documentation

Solving Cart-Pole by policy search

We will present a simple solution to solve the Cart-Pole problem based on random search. The idea is the following, assume your agent has a function that can be used to determine what action to perform (i.e. move the cart left or right ) based on the observation it receives from the environment. In reinforcement learning world, the function that maps observations to actions is known as the policy function. The goal of the agent learning process is to select the optimal policy which produces the maximum possible total reward.

To find such a policy, we first choose a represntation of our policy. In this problem the observation is a vector of 4 numbers and the action space consits of only two possible actions either 0 or 1. so we choose the policy as a parameters of a hyper-plane in 4 dimensional space such that points to left of the hyper-plane will result in action 0while points to its right will result in action 1. In other words, each policy is represented by 5 numbers w1, w2, w3, w4, b such that the policy decision to select next action is made according to:

if w1 * x1 + w2 * x2 + w3 * x3 + w4 * x4  + b > 0:
cart moves to right
cart moves to left

Now, how to select a good values for the policy parameters ( w1, w2, w3, w4, b ) ? It turns out we can find a good value for these parameters just by random search. This method is not efficient when the search space for parameters becomes big. However, in this simple problem, we can find a good solution, if we start from random set of policy parameters values and evaluate each one of them and select the best one we have found.

The python program below shows an implementaion for this policy search approach.

get_random_policy(): a function that returns a random set of policy parameters.

run_episode(..,policy, t_max,..): a function that starts an episode of maximum length t_max timesteps and returns the total reward that an agent following the given policy has recieved.

policy_to_action(..,policy, obs) : a function that selects an action for the agent who follows the given policy will chose when it observes the state observation given by obs

Here is a video output for the agent performance after it has found the policy using the algorithm implemented above.

Solution of OpenAI gym CartPolve-v0 problem using policy search.

Next time, we will improve over this naive search solution to have a more intelligent solutions that can be used for solving more complex problems.


1- John Schulmann, Deep Reinforcmenet Learning Tutorial



4- Simple reinforcement learning to learn CartPole