Reinforcement Learning — Basic Understanding

Understanding the basic concepts behind Reinforcement Learning using value-based methods

Ioannis Anifantakis
Analytics Vidhya
17 min readSep 2, 2019

--

Introduction

Reinforcement Learning (RL) is a Machine Learning field which gained much attention since 2015 after Google’s Deep Mind team demonstrated self-taught DQN agents learning to walk, mastering Atari games and beating pro-human players in the game Go.

Watch DeepMind walking agents
DeepMind learning to play Breakout

RL is the science behind self-taught software agents who explore a previously unknown world and eventually excel in that world. It is incredible how you can introduce an agent to a hostile and utterly unknown environment, where the agent only knows what available actions it can perform, but nothing more, and through trial and error discover policies that improve with each consecutive move, reaching super-human performance.

Are we not all agents exploring our world since birth? Aren’t babies agents that are completely helpless at birth with no understanding of their surroundings, and by using their brain, process memories and experiences to increase their success (reward) in the world and progressively master specific tasks in life?

This article will attempt to give some basic understanding (leaving out the math), so you can start your journey towards the science of Reinforcement Learning and explore this fantastic continuously-evolving science field.

Biological Analogy to Reinforcement Learning

Reinforcement Learning is based on trial and error by evaluating different actions from those we already think that work well.

Over time, after several failed attempts to produce beneficial evolutionary variations to the agent’s state/action policy, we might discover one that is rewarding (literally) to the agent’s performance on the environment.

In biology, “this policy variation” is known as… “mutation”.

Life on earth is the living proof of what you can achieve with mutations when you have billions of years available to play with your genes.

Most mutations are harmful (even fatal), but now and then some mutation can turn to be beneficial and lead to the improvement/evolvement of a species.

It is thanks to such mutations; our DNA formed new branches that distinguished us from prior ancestor species.

Occasional “exploration” in our genome (DNA mutations) is the building block of species evolution

Similarly, in RL, the “evolution analogy” is the agent’s improvement over the choices of the actions to take for each situation (state).

Us, together with all current and past organisms are nothing but evolutionary mutational branches of the very first organism. Our universal common ancestor… a single ancient procaryotic cell.

Just 1.5% of our genes make us human!

We share 98.5% of our DNA with chimpanzees, 3/4 with dogs, 1/2 with fruit flies and 1/3 with daffodils, and yet we are so different in every aspect!

If in biology DNA mutation dictates alteration on gene expression, equivalently in RL, Q-Table mutations dictate alteration on the actions selection.

Reinforcement Learning — Key Terms and Cocepts

Scenario: Let’s assume you wake up at some foreign city, you exit your hotel, nobody speaks English, your cellphone just ran out of battery (so you cannot use maps), and you are starving. You need to get to some restaurant as soon as possible!

Environment: All city blocks are identical (grid world). The environment is what you are going to interact with.

RL in a nutshell: You will start at some random state of the environment (at a random block in our grid world), perform actions to the environment’s state (walk to some new block in the grid), and the environment will respond with a new state (the a new block you end up at the grid) and with some reward that indicates how beneficial (or not) your action (to move to this new block) was for the previous state (the previous block to which you decided to take this specific direction).

Actions: In this grid world the entire set of actions (Action Set) you can perform is moving [North, South, East, West]. Initially, you have no reason to favour any action over another. You don’t know for your existing state (the block you stand) which action (eg, going east or west) can get you closer to your goal (restaurant).

You live in an unknown world. You are only given the available actions. Its time to explore!

So you inevitably start exploring. You pick random actions (equiprobable policy) and wander around the city blocks until you end up at some restaurant.

Terminal State: So after visiting several blocks, you ended up at a restaurant. When you visit such a terminal state the episode terminates, and you can now sit back and evaluate how you can improve your strategy (policy) so in future episodes you can get more effectively to your goal.

Episode: An episode is your journey every time from the moment you are reborn into that world until you reach a terminal state. You might need thousands of episodes to truly understand world dynamics so you can progressively improve your strategy (policy) until you come up with the best one.

Reward: While wandering in the grid world, you get a negative reward of -1 for every minute you spent walking. Walking normal roadblocks gave you a negative reward of -1 while crowded ones gave you a negative reward of -3 because it takes longer to cross them. Reaching a restaurant will grant you a positive reward of +10.

Your cumulative reward sums up all positive and negative rewards together for each episode. Negative rewards can be thought of as “penalties”.

The negative rewards in non-terminal states are necessary, because if you only got a positive reward at the end, your agent would get the same total reward no matter how long it wandered in the city, and so it would never actually learn.

So by getting negative rewards for as long as we have not “reached the terminal state,” we motivate the agent to find the fastest way out and thus figuring ways to minimise the penalties in order to increase the overall final cumulative reward.

Thus, increasing the Cumulative Reward should translate to solving the Reinforcement Learning problem!

This logic, by which we assume that by increasing the total reward leads to the solution of RL, we call “Reward Hypothesis.

Exploring an unknown world

The State: Every time you move to a new block, you explore what the environment (the world) allows you to see. Most of the times that “view” of the world is quite limited. Usually, you cannot know what lies ten blocks ahead, so the state is the information that the environment allows you to “see” so you can estimate what your next action should be.

The state can be thought of as your observation of the environment at a given world location. But the state is not only information about the world, but also about yourself (you are part of the wold too).

For example, if you are “learning how to walk,” the state could be your location in the world, ground info, elevation etc. But the state would also include the directions of your joints, your velocity, and more.

So the State is your agent’s “senses” at a given time.

The state is the “view” that the environment allows you to perceive

Q-Table: To learn from these experiences, we need to keep track of some log about how rewarding (or not) each action is for every state.

The Q-Table can be visualized as a two-dimensional matrix, where rows reflect states, and columns reflect actions. The scores in this matrix represent how positively or negatively each action influences your goal for all possible states.

A simple Q-Table where two possible actions are graded for 4 possible states

So the next time you are at the same state, it might be useful to look at this log as an indicator of what action seems to work best for that state. That way, you will build your strategy (policy) which will dictate what is the best action to choose for each state.

This log contains the average of your expected total reward from all your previous experiences at each state of the table.

Should we trust our Q-Table from the very beginning? No! Initially, the scores at your Q-Table don’t reflect good choices! This is because you started with the equiprobable policy (exploring by taking random actions and log their performance to the Q-Table), and obviously, these reward scores should not be considered accurate just yet.

So initially we should just explore by picking random actions and keep populating the Q-Table with results. The more the episodes, the more we should trust our Q-Table and progressively lower the randomness of choices and picking more frequently q-table action suggestions based on our state (exploitation).

However, even after thousands of episodes, we should still take some random actions now and then to keep exploring and perhaps discover some action that might lean to be more beneficial for a given state.

By gradually lowering the exploration toward our states (but never zeroing it) after each episode, our q-table will adjust progressively letting us come up with an optimal policy (π*) that leads us to the restaurant at the shortest possible time.

This is the epsilon-greedy-policy. So the epsilon-greedy-policy says that by most part, we will follow our q-table proposal for selecting the highest-rated action for each state, but always allow for a small chance to pick a random action instead. After we select such a random choice (instead of the q-table proposal) we should then strictly follow the policy for all the remaining steps to check whether this “detour from the policy” was beneficial or not.

We then record the improvement or worsening due to that “detour” at this random state/action pair to our Q-Table, and keep going for maybe… a few thousand times! Then the Q-Table will hold more accurate scores about how well each action at each state does!

The greedy-policy is always following the directions of the q-table blindly, while epsilon-greedy-policy follows mostly the q-table, but allows for some “random choice” now and then to see how it affects the agent’s score.

That means that the greedy-policy (following the q-table without exploring) is not beneficial as we will never explore any further (thus stop learning), while the epsilon-greedy-policy will follow most of the time our policy, but occasionally will pick a random action to check if it favors our agent to its goal.

A “greedy-policy” is like blind-walking — You need “epsilon-greedy policy” instead to allow for some exploration!

Epsilon-Greedy-Policy does not blind-lead the agent (like Greedy-Policy does).

Epsilon-Greedy-Policy allows for random discoveries over the original policy, introducing new strategies that increase the cumulative reward and eventually lead to the optimal policy (π*).

State+Action → Reward+next State (SARS)

Don’t panic, this SARS thing is what we have been looking all this time, it is more of a recap rather than some new concept!

As we have seen so far, the projection that the environment provides us, (which includes our mechanics — like our joints or velocity) is the state that comes to our senses.

For each state, we said there is a set of available actions.

By performing some action at a given state, we progress to a Next State, while at the same time we are given some positive or negative Reward, so we can evaluate how beneficial or not our action was at the previous state.

This is the “loop” by which the agent progresses in the environment until it reaches a terminal state.

  1. S1+A1 →R1+S2
  2. S2+A2 →R2+S3
  3. S3+A3 →R3+S4
  4. … all the way to the St (terminal state)

The sum of all rewards (R1+R2+R3…+Rt) on each episode is the Cumulative Reward of that episode, which we want it as high as possible.

Based on the current State the Agent performs “Action” — The environment returns “Next Sate” &“Reward”

State/Action/Reward: So if we learn to walk, and our left foot is at front (that’s our state) and we act by moving our right foot at front as well (our action), then we end up falling down (our new state), and we get a negative reward in order to understand that our action of moving both foots front, affects us negatively on our effort to learn how to walk.

Thus this experience for that given state (where the left foot is at the front) even if it was negative, provided some valuable (not to do) information.

So, let’s write down that bad score at our Q-Table for this state/action pair!

Episodic or Continuous?

So far what we have seen is an episodic task. Such tasks have a well-defined ending state (in our example the ending state was reaching a restaurant — the equivalent of exiting a maze).

Terminal States (ending states) might not always mean something good (like reaching our goal). In a self-driving car analogy, we could terminate our simulation by getting to our destination (positive final state reward), or by crashing (negative final state reward). In a game of chess, we could terminate by winning, or by losing, etc.

Some tasks, however, might not have an ending point, but continue forever. Those are known as continuous tasks.

Discount Rate γ

Now, we said that when we populate our Q-Table, we write down at each state/action cell of the table what is our estimated total reward at this state if we take that specific action and then stick to our policy from then on.

This “total estimated reward” is calculated by taking into account, all (expected) future moves rewards (the average of which we have already summed throughout our past lives after taking the same action at that state and recording all the rewards all the way until some terminal state).

But should an action we do now be as determinantal for 500 states/actions later?

The discount rate γ exists in order to make closer rewards more meaningful for our imediate responces, than more distant (in the future) rewards.

It is meaningful to keep your focus mostly on the importance of immediate outcomes when deciding for an action, rather than that of distant outcomes.

Example: If you learn how to walk, it is more important to give your best effort on taking actions so you don’t fall now, rather than acting differently by considering that 50 meters away the road would be slippery. Yes, since you know the road will be slippery in the near future, your policy should include a “slow down”, but what matters most now is that you avoid that crack at the pavement so you keep your ballance and don’t immediatelly fall.

So the discount rate γ makes the more we look into the future, the less the future rewards to matter (so we keep most our focus on the current situation)

However, some simple episodic tasks, like reaching the restaurant don’t require discount rate γ to be applied.

So where does this Discount Rate γ really apply?

The discount rate gamma (γ), is something you will always apply to continuous tasks, but not necessarily to episodic tasks.

As we know, with our q-table, we log the expected total reward for all actions at every state.

That means that for most episodic tasks it is possible just to average the total rewards from this state/action, and log it to our q-table. But to log the total reward from that state on, all the way to the end, this automatically implies that we have reached some terminal state St, otherwise there would be no defined “total reward”.

But in continuous tasks (tasks that never end), we cannot keep such log if the future is infinite (so there is no “total reward” if there is never an “end”). Now the discount rate γ will be used as a decaying factor, by which we take into account less and less future expected rewards until they vanish. We use that to our benefit to “see” up to some point in the future and be able to deal with continuous tasks, rather than looking at an infinite future, and thus infinitely freeze our agent from the very beginning.

As we said, it is possible to use discount rate γ in episodic tasks, but many times we can just do without it, especially for simple tasks that don’t involve many coinciding states.

Markov Decision Process (MDP)

The States, Actions, Rewards, their mechanics (known as One-Step Dynamics), together with the discount rate (γ) define a Markov Decision Process (MDP).

The agent is provided only with the set of available actions and the discount rate γ. All the remaining MDP parts are up to the agent to discover via exploration. So a good q-table creation will provide a good estimation to get the agent closer to the understanding what the underlying (unknown) MDP is.

A Finite MDP is the probabilistic (stochastic) equivalent of Finite State Automata (or Finite State Machines) — in case you come from Computer Science field.

MDP Diagram of a Recycling Robot (*Source: Udacity)

In the example above we see an MDP Diagram of a Recycling Robot. We are given two possible battery states here [high, low]. We are also given an action set of [search, wait, recharge]. In this “stochastic” model, we see also by what chance an action can be favoured over another, and with what Reward.

For example, by the above MDP diagram, at high state, we see that searching has 70% (.7) chance to keep the battery at the same high state and provide a reward of 4, while it has a 30% (.3) chance to change the robot’s battery state to low with a reward of 4.

One very important observation about MDPs is that your existing choice of action over a state does not depend on previous action/states!

Look at the MDP above! Selecting the action “search” for your state “high”, is not affected by any previous state/actions! It only depends on the one-step dynamics that give you a chance by which you will select this action instead of some other!

Deep Reinforcement Learning (DRL)

Is that all? Definitely not! We have only touched some of the fundamental concepts of reinforcement learning (intentionally avoiding mathematics). There are plenty more to see (including math) before we can even start talking about Deep Reinforcement Learning.

However, many should wonder where do neural networks fit in RL. I mean we can populate a q-table with state/action log and use it to choose actions depending on our existing state! So why bother adding extra complexity by involving Neural Networks in the first place?

Well, usually the tasks are way more complex. This will either mean a q-table that is unrealistically large, or that our problem cannot be expressed with a classic q-table at all!

Imagine you play Tennis. The ball cannot come just to your right or your left, so it cannot be represented by a simple [left, right] state! It can come anywhere between your right and left, and top to bottom. This means that we deal with real numbers that can be any floating-point value (real numbers have infinite decimal space) to represent each state, turning this to an “infinite valued q-table” (which simply cannot be defined). So in “continuous spaces” things can fall apart!

One fix could be to discretize our continuous spaced world and create imaginary squares, in order to turn this to a grid-world problem that can be dealt with a q-table.

Discretization: Turn continuous data to a grid data so we can approach the problem via q-table

However this is not feasible always, and as we mentioned before, even if we perform discretization, this might produce unrealistically large q-tables which need an equally unrealistically large number of episodes in order to effectively fill-up the log entries for each action/state pair.

So here we could approach such a problem using function-approximation that can correspond states to actions.

Deep Neural Networks are defined as Universal Function Approximators.

To see the big picture (if you come from a Deep Learning background) — with Neural Networks (NN) — in Image Classification problems we turn pixels to labels.

Well, in the case of Reinforcement Learning, the idea is to turn states to actions.

So the main idea is to create something that will approximate a q-table over infinitely real value sets of states, rather than a discrete set of states.

In the Neural Network, we have: “Input=State” and “Output=Action”

Since our Neural Network weights (θ) are randomly initialized, they can be thought of as an instance of an equiprobable policy. Our policy reflects this θ weight configuration, so we will call it π(θ). Thus by changing the θ weights to some new configuration θ` the new policy will be called π(θ`), and so on.

In Deep Learning, classification answer to “what is something”, while regression answers to “how much” is something.

We also said in our neural network “state” goes in (input), action comes out (output).

We can consider a DQN as a regression model, which takes a state as input, and tries to approximate reward scores for each possible action output (much like a classic q-table). This is why it is much likely to use an error function such as MSE (Mean of Squared Errors) that we commonly use in regression networks — if you have worked with NN you know what I am talking about.

But even though we use regression techniques to train this neural network, since the outputs represent discrete action scores, and we favor the action with the highest such score, the result ends up to classification. (much like selecting the highest scored action from a q-table for a given state).

The reason about all this clarification, is that people sometimes wonder why approaching a classification problem like this, uses MSE function. Now you know you are not really classifying, you just make two steps in one. Getting the q-table actions values for a state, and favor the topmost action (and by the way don’t forget, sometimes you have to choose randomly rather than the topmost action to favor exploration).

It is important to note that the described approach of the NN responding with scores over most favored actions is applied when we have a discrete set of possible actions, and we estimate by how much likely we favor a single action over the others based on the highest action score (so we simulate scores analogy of a q-table for actions per state).

So if we can move North, South, East, or West, we may have a softmax output layer of size 4, and we see which one of each 4 possible action gets the highest score.

However, if our output action should include multiple changes (like moving 10 robot joints silmutaneously by 10 different magnitudes as a single action response) we no longer deal with action probabilities, so our output layer could consist of 10 tanh functions (for example) providing magnitude of change for each of the 10 available joints.

DRL in a nutshell…

In Deep Reinforcement Learning, you represent the Q-Table with a neural network. You try to keep your estimates about how actions affect the world over your state, by altering the weights of the neural network (gradient descent).

Thus through function approximation, we try to predict the probability of the best-fitted action for a given state and thus simulate a q-table.

The big difference between classic offline classification training, where the training dataset is provided to you, is that with this online training it is YOU the one that builds the dataset with every step of your exploration, as it produces results to be evaluated by the dataset. Such a transitioning creation of a dataset can cause obvious issues to a neural network and can be dealt with various techniques.

value-based vs policy-based methods

As the subheading of this article states, this article provided a value-based methods approach.

And yes, we have been looking at this value-based methods approach up to this point by estimating the optimal policy (π*) by using a q-table (either directly, or via deep-learning estimation).

It is also possible to search for such a policy without some q-table reference for state/value functions, by using policy-based methods that directly change the weights of the neural network, and looking into how the expected reward improves.

Then using Gradient Ascent (yes, ascent, not a typo!), by adjusting the Neural Network weights towards the policy that directly seems to produce better estimated overall reward, is another way to go.

Of course, policy-based methods deserves an article of its own, so we will not look into it more here.

Just keep in mind that what we have seen so far is a very early approach to understanding the value-based methods, which make use of a q-table.

If it just baked your noodles, know that some crazy science combines value-based with policy-based methods into what is known as Actor-Critic-Methods.

--

--

Ioannis Anifantakis
Analytics Vidhya

Senior Software Engineer @ Deloitte Digital — Android Apps Engineer, mobile developer