A First look at Reinforcement Learning

Cheng Xi Tsou
Nerd For Tech
Published in
8 min readJun 27, 2021

--

One of the types of learning that we hear about in Machine Learning is Reinforcement Learning, where an agent learns a goal in an environment, known or unknown, through reward and punishment. Unlike learning methods such as Supervised and Unsupervised learning, Reinforcement learning does not require data at all. In my class CS4100, the course briefly touched on the practices of this learning method so I wanted to explore a bit further. A lot of the applications of Reinforcement learning are in games or complex and computationally expensive real-world problems, so it is hard to find something to “meaningfully” apply reinforcement learning to. Nonetheless, this is still a super interesting topic where an agent can build a policy that maximizes the reward function without knowing its environment. I wanted to explore a high-level view of reinforcement learning so I’ll be playing around with OpenAI’s gym library where it has prebuilt environments with defined observation and action spaces with assigned reward values to different states.

A simple task I tackled first was the FrozenLake game. The game is simple, the environment is a 4 by 4 tiled square where the player starts at the S square and navigates to the G square. Falling into a hole, or the H square means game over. The implementation of this game in the gym library has a simple reward function. 1 if the agent reaches the goal, 0 if they don’t. The action space is just the four cardinal directions and the observation space is the 16 possible squares that a player could be in. The game also has a catch: the floor is slippery and there’s a chance you could go in a random direction every move. However, for the sake of seeing whether the built policy works, I’ll be turning the slippery mode off.

Non-slippery frozen lake

So how does an agent build its policy? There are many different ways to implement Reinforcement learning such as value or policy iteration, solving MDP, TD-learning, but I will exploring Q-learning.

In Q-learning, an agent wants to maximize a ‘Q’ value, which is a value that is mapped to from a particular state and action. At the start, a Q table will be initialized where the cardinality of Q(s, a) is action space × observation space. An agent plays the game over many episodes and builds the q table along the way.

Firstly, an agent chooses a move. A strategy I’ll be using the epsilon-greedy algorithm, where for some chance epsilon, an agent will choose a random action. Otherwise, it will choose the action with the highest Q value. Epsilon is a balance between exploration vs exploitation, and usually, there's an epsilon decay factor where the chance for exploration goes down as the number of iterations increases.

Epsilon greedy algorithm

After the agent chooses an action, we will use the equation below so the agent can “learn”. In the equation, max_a Q(S_t+1, a) is the q value of the best action for the next state after the agent executes its action. Q(S_t, A_t) is the q value of the current state. The q value for the next state is multiplied by gamma, a discount factor since it is one time-step away, a concept where future rewards are worth less than immediate reward. R_t+1 is the reward after you execute the action. Together, reward + the difference between discounted future utility and current utility is what the agent thinks the current q value should be. The target value is multiplied by alpha, the learning rate to nudge the current q value towards the “true” q value, which the agent will learn over time.

Update equation

After letting the agent learn for 10,000 episodes, we built a policy that chooses the best action for each state. Note that since FrozenLake without slippery floors is deterministic, so this is possible to solve with dynamic programming.

Our policy after 10000 iterations. Each letter corresponds with an action: Down, Up, Left, Right.

Plotting the reward over episodes gives us a very weird-looking graph. Since the reward is binary, either you reach the goal or not. You can see that the puzzle is consistently solved in around 4000 episodes.

While all this is simple and computationally cheap, the FrozenLake problem only requires a q-table of 4*16 = 64 size. What if we wanted to use Q-learning for a more complex environment? Let's take the CartPole problem as an example.

OpenAI gym’s CartPole-v1

The CartPole problem is a game where a pole is balanced on a cart that can be pushed to the left or right. The possible actions for this problem are only applying a force of +1 or -1. However, the observation space is continuous. A given state in the observation space is made of the cart position, cart velocity, pole angle, and pole angular velocity. A q-table would not work in this case. This is where Deep Q-learning comes in.

In Deep Q-learning, instead of a lookup table Q(s, a), we can think of Q(s, a) as a function. If we review the Q update equation, it's similar to how a neural network learns. The “target” which is reward + discounted future utility is what the q value should be. Q(S_t, A_t) is what the model thinks the current q value is. The difference between them would be the loss, and we can slowly update the weights of the network with a learning rate. With a neural network, we can fit the network by supplying states and target q value, and the network will return the action space and q values for each action when we use it to predict.

Update equation

The reward system for the CartPole game is simple. For each time step that the pole was upright, you would get +1. After trying to implement Deep Q-learning for the CartPole game, I was getting an average of 10 points despite running the algorithm for 100 episodes. In fact, after a while, the epsilon decay causes epsilon to approach 0, so the Q function converged to a sub-optimal policy. The problem was as we were fitting the network with the latest state, the model will slowly “forget” about past obstacles it had and go back to square one. Although the model does learn how to get past a certain obstacle, later states will not encounter that obstacle, so the network’s weights will shift away from the solution to that obstacle.

A solution for this is experience replay. We keep “memories” of past states and actions and replay those memories to keep the model informed of its past. An implementation of experience replay would be to keep an array of (state, action, reward, new_state, isOver) tuples, where the network is fit with a random batch of states instead of just fitting the latest state. Using this method, the agent achieved an average score of 430.26 over 50 plays when testing out the policy.

Agent training over 150 episodes

Let's compare this to random play, with a score of 21.48 on average.

Agent playing with random action strategy

Let’s compare this to human play. I average a score of 30.8 over 45 play-throughs, but unlike random play, the human play will “learn” over time.

Human playing

If we look at the rate of learning for human play versus deep q-learning, we can observe that the rate of learning is faster for the machine! Obviously, the benchmark for human play is only one player which is me, so it isn’t an accurate measure of human performance. Let's see if we can improve our Reinforcement learning algorithm. A few things we could tune would be batch size to allow for longer “memory”, the reward system, the hyperparameters for our neural network, and probably more, but here we are going to change the reward system. Instead of a simple +1 for each time step, we will scale the reward to match the pole angle’s range. CartPole-v1 terminates after the pole angle goes beyond [-0.418, 0.418] radians, so we will let reward = 1-|a/0.418| with a being the pole’s angle in radians to give more weight to the pole angle being more upright. Here are the results:

Agent playing with new reward system

Looks like much better performance. Notice how after 10 episodes, the score never drops below 100, compared to the original strategy, where the score still occasionally plummets after 120 episodes. I hypothesize that this may be because the pole is incentivized to stay at 0 radians, so there is less room for the pole is fall, whereas the old reward system still gives the same reward if the pole is about to fall but within bounds. Testing this policy over 50 plays, we averaged a score of 489.88. This is a whole 50 points over our average with the old system.

Overall, this was a fun first look at reinforcement learning. In future posts, I might try to look at some of the more complex environments such as Atari games in OpenAI’s gym, or try to write an environment myself. When I was browsing the gym’s environments, there is a Box2D section where it's a bunch of scenarios simulated with the Box2D physics library. This was a surprise as I’ve worked with Box2D when I was creating my RPGConcept (see my posts from 2018)! Maybe I’ll make my environment in Java. Ultimately, I want to be able to do something in Minecraft using Malmo, an AI experimentation platform built on top of Minecraft by Microsoft.

Code: https://github.com/chengxi600/RLStuff/blob/master/Q%20Learning/QLearning_DQN.ipynb

--

--