Solving the FrozenLake environment from OpenAI gym using Value Iteration
So I was trying to learn about Reinforcement Learning, and then I came across this thing called ‘Value Iteration’. I really couldn’t wrap my head around Value Iteration. It was very difficult for me to understand how it worked and how it could help an agent to find the optimal policy. Then I got an idea.
What better way to understand “Value Iteration” than to use it to solve some game or environment. Thus I began my journey to find some game easy enough problem to solve. And then I stumbled upon this fairy from OpenAI.
Let me explain the game/environment first.
There are 64 states in the game. The agent starts from S (S for Start) and our goal is to get to G (G for Goal). So just go. Nope. Its a slippery surface. The F’s and the H’s in between are pretty weird stuff. So F means Frozen Surface. You can walk on them. But H means Hole. If you fall in a H, BOoom, GAME OVER for you and start from S again. So just go through all the F’s dodging the H’s to reach the G right. Nope. There’s more. Since this is a “Frozen” Lake, so if you go in a certain direction, there is only 0.333% chance that the agent will really go in that direction. I mean, the movement of the agent is uncertain and only partially depends on the chosen direction. So you won’t always move in the direction you intend. For a more detailed explanation of FrozenLake8x8 , Click Here.
Understanding OpenAI gym
Okay, so that being understood how do we load the environment from OpenAI. For doing that we will use the python library ‘gym’ from OpenAI.
You can have a look at the environment using env.render() where the red highlight shows the current state of the agent.
env.action_space.sample() chooses an action randomly from all the possible actions. And env.step(action) takes a step according to the given action. Here we can see the action was ‘Right’, so the agent went right from S to F (This may not always be the case, since the movement of the agent is uncertain so sometimes when the action is ‘Right’, the agent might go Down or Up also.)
The Interesting Part
Okay that was the easy part. Now comes the difficult part or should I say interesting part. How does the agent navigate this slippery Lake and get to the Goal without falling in a Hole?
Lets do this step by step. First lets code the “Value Iteration” function.
So in value iteration the story goes like this. For a particular state, first we calculate the state-action values for all the possible actions from that state, and then update the value function of that state with the greatest state-action value. This is different from “Policy Iteration” where we calculate the expected/mean state-action value. The Value Iteration terminates when the difference between all the new State values and the old State Values is a negligibly small value.
Code Code Code
Below is the code I used for the value iteration function.
The majority of the code is easily understandable. Let me explain the non-intuitive parts.
env.nS and env.nA gives the total number of states and actions resp. But the most interesting is env.P ; env.P outputs a dictionary like this. Here 0 in env.P is the first state of the environment.
Here as you can guess, the keys of the dictionary 0,1,2,3 are the actions we can state from state 0. And further each action contains a list, where each element of the list is a tuple showing the probability of transitioning into the state, next state, reward and if done=True done=False. (done=True if the next state is a Hole or the Goal). So env.P is a list containing all the states where each state contains a dictionary which maps all possible actions from that state to the next_state if we take that action, probability of going into that next state, reward and if the Game terminates there or not.
Here you can see 54 is a Hole so done=True. Also 63 is the Goal so done=True.
So as per the Value Iteration formula, we iterate through all these actions and calculate the action-state value using the formula:
Prob * (reward + discount_factor * state_value of next state)
which are all provided in env.P . Then we update the value function of the state with the highest state-action value. We iterate through all the 64 states of the environment, till the difference between the new State Values and Old State Values after each iteration is negligibly small or if we have crossed the maximum number of iterations.
Extracting the Policy from the Value Function
Now that we have the value function of all the states, our next step is to extract the policy from the Value Function.
We do this using a similar technique. For a particular state we calculate the state-action values of all the possible actions from that state and choose the action with the highest state-action value.
So did we reach the Goal or fell in a Hole?
Finally! Now that we have the policy we can follow that policy and see if our agent reaches the goal or falls in a hole.
We run the agent for 1000 episodes and calculate how many average steps it took to get to the Goal. We also calculate how many times it could not reach the goal and fell in a hole. Finally we get this answer after running the above function.
I think the agent did pretty good :)
You can also check out FrozenLake-v0 which is a smaller version and has only 16 states and check how many average steps it takes the agent to get to the goal. For my full code to solve the FrozenLake8x8 environment go to my GitHub repo here : https://github.com/realdiganta/solving_openai/tree/master/FrozenLake8x8
Also as I continue my journey into the exciting land of Reinforcement Learning, I would solving more OpenAI environments in the near future. So stay tuned for more.