Let’s Talk Reinforcement Learning — The Fundamentals — Part 2

Vigneshwaran S(hankaran)
The Startup
Published in
7 min readSep 1, 2020

This is a continuation of the article Let’s talk Reinforcement Learning — The Fundamentals — Part 1. You can continue reading this article even if you have not read Part 1 if you can recognize the terms below.

See if you can recognize these terms: action, value, reward, k-bandit problem, exploitation vs exploration tradeoff, action selection, epsilon-greedy, upper-confidence bound. If you know at least 5 of these terms, you are good to go. If you are not sure don't worry you can always read Let’s talk Reinforcement Learning — The Fundamentals — Part 1 and come back.

Photo by Scott Carroll on Unsplash

In part 1 we saw basic things like rewards and estimating the action values. The K-armed bandit problem(humanoid doctor) gave us an intuition of what reinforcement learning is, but it is not enough to tackle real-world problems. Let us get rid of the humanoid doctor and introduce a new example: The story of a deer.

Fun Fact: Deer adore fruits and nuts. They love pecans, hickory nuts, and beechnuts acorns in addition to acorns. A couple of favorite fruits are apples, blueberries, blackberries, and persimmons.[1]

Deers love to consume pecans. Our deer is at the “road not taken” situation. The left path has grass and the right path has pecans. Now the deer must take the right path to eat its favorite food. The reward is that it gets its hunger satiated and also it is like a feast. However, the left path is also good as it satiates its hunger. The reward is high on the right path. If it was a K-Bandit problem, we would have chosen the right path. Now see the fate of the deer. The deer takes the right path and starts eating pecans but within seconds, a lion starts chasing and eventually kills the deer. Oopsie. Let us not allow it and let’s teach the deer reinforcement learning using Markov Decision Processes.

The Markov Decision Process(MDP) gives us a way on estimating the rewards that we may get in the future. A bandit deer would have chosen the right path. But in order to make the deer take the left path, we must consider the actions as states. With each action taken the problem changes into a new state with rewards from that point of time.

Now you can see the two sequences that are possible from the initial state i.e, the point where the deer has to take a path. Technically the agent generates a set of series of possible states at every discrete time steps and selects the best from the set.

Just like Bandits, the outcomes of MDP is stochastic. Now probability theory comes to our rescue as possibilities are involved. With the transition dynamics probability function p(next state, reward|state, action), we can predict the joint probability of the next state and the set of rewards given the current state and action. Notice that the future state and rewards are only dependent on the current state and action. This is called Markov Property. As a side note, I would point out that unlike here, in Natural Language Processing previous states do matter a lot and they have techniques like Long Short Term Memory Cells and Transformers to deal them.

The important part of MDP is the modeling of the environment with all possible states, actions, rewards. This is done in the form of discrete graphs. Such graphs make it easy for us to implement in the form of vectors. From the story of the deer, the most important inference is to maximize the sum of rewards from a time step.

Note that Gt is a random variable. We have so much randomness from a single action as it has many possible states and the MDP is stochastic. This is why we maximize the Expectation rather than the actual sum.

The agent breaks up the series into episodes with a terminal point. Consider the case of deer. Here, the whole process of the deer trying to consume is an episode and actual consumption is the terminal point. This is called an episodic task. The whole point of the episodic task is to deal with what happens after the agent-environment interaction ends. When the agent encounters the episodic task, it resets itself to the start state.

I now want to pitch in a very important topic that you may want to dive deeper into. Reward Hypothesis. “Michael Littman calls this the reinforcement learning hypothesis. That name seems appropriate because it is a distinctive feature of reinforcement learning that it takes this hypothesis seriously. Markov decision processes involve rewards, but only with the onset of reinforcement learning has reward maximization been put forth seriously as a reasonable model of a complete intelligent agent analogous to a human being.”[2]

What is this? Let me try to put it in words. Think of air conditioners and what can possibly be the reward? Is it going to be the temperature or is it going to be the cost of electricity? Now think of the stock market, and here we have a pretty solid reward — The currency. So, it is not easy to define rewards for many cases in RL. But our brain somehow does it just like that. Think of a very rare objective that could be achieved. Let’s fix the objective to be Kim Jong Un and his path to winning the Nobel Peace Prize. I know it can’t happen, but let us consider an agent in the position of Kim. Now if we are so rigid with the rewards, like +1 for the noble prize and 0 otherwise. How can we monitor the agent’s progress? It is wise to split up and reward the agent with a low value of say +0.001 for even acts like keeping the people of North Korea happy which may lead to the ultimate goal of the agent. What is the point I’m struggling to convey? It is the selection of rewards, how you do it, and how well it turns out to work.

We discussed episodic tasks. But there are also cases where an agent-environment interaction has no episodes or terminal points. An automatic air conditioner is a good example of this problem. Here, the agent has to constantly monitor the environment and adapt to the temperature to maintain the need of the user. These kinds of infinite tasks are called Continuing Tasks.

The problem here is we are trying to sum up infinite rewards to maximize Gt. We need to model the problem into a finite one. Fortunately, discounting will help us deal with this. We can discount future rewards with γ which is at least 0 and lesser than 1.

The powers of gamma help in reducing the impact of future rewards. Logically it makes sense as you would want to consider your immediate rewards more than the future rewards like in the case of currency exchanges. Still, it looks like infinite right? Let’s pull some mathematical trick in here.

Let us assume that Rmax is the maximum reward that an agent can achieve. Now we upper bound Gt by replacing all rewards with Rmax. Now we are going to rewrite the geometric series (summation of k to infinity of γ to the power k) to 1 / 1 - γ. Yay! we did it. Therefore Gt is finite.

I hope that with all the mathematics we discussed the deer will make a decision that is good in the long run.

Takeaway

i) Actions influence rewards but also it influences the future states, future action, and future rewards

ii) The goal of an RL agent is to maximize the expected reward after time step t and not the immediate reward.

iii) The difference between episodic tasks and continuing tasks

Reference

[1] Naomi Millburn “What do deer love to eat?”

[2] Richard Sutton “The reward hypothesis”

--

--