Value Iteration & Policy Iteration

Published in

Analytics Vidhya

7 min readFeb 25, 2022

Although AI is not a new topic, the implementation of new algorithms, the creation of faster computers, and investments in this field boost the popularity of artificial intelligence-related jobs today. This situation causes many people to be interested in artificial intelligence whether they are in the IT business or not. Because some people are more likely tend to do something that is popular and looks so fancy. Of course, this is not something bad at all but it may cause misunderstandings sometimes. For example, for some people, artificial intelligence is just machine learning. So whenever the topic of a daily conversation comes to ai, they talk about regressions, classification, etc. Since machine learning-related jobs are so popular, they search about it and find the algorithms that are taught at almost every course for beginners.

Artificial intelligence is something much bigger than machine learning and not just regression models and not just Google’s PageRank algorithm. Also, it does not have to be data-oriented. Some techniques do not use the data as the very first instrument of the process. That is one of the best points of ai to me. Because it is not just statistics or math, it also has some points of computer science in itself. You can extract some information from a data set. But you can also extract other information from new or existing techniques. For instance, you can use graph representation as an ai technique and create an autonomous system that calculates the similarity between sentences in a document and classify them as different topic classes. You do not need a CSV file for some people are used to see and you do not look for a hidden pattern in this file that includes the input data set. What I want to mention is we need to see from a general perspective when we talk about artificial intelligence.

What is an Agent?

Let’s assume that you are the one that is responsible for deciding on a task that may end with success or failure and you do not have any information about the task (you did not do anything similar before). The first thing you do is take an action big or small if you have no one around to ask about it. Then you get an observation about it. For example, in a cold room, you have the task of getting warm and there is a stove in the room. Action 1 is touching the stove to warm your cold hands. But you failed the task. Because that is not something you want and now your hands got damaged very badly. So the outcome of Action 1 is very bad for you. And now you got action in your action list at least. Now it is time for Action 2. You need to stand nearby the stove and wait to warm up. The result of Action 2 is a success. Because you are warm and happy now and you know that Action 2 is much better than Action 1. So for such a situation when you have to choose between them, you would choose Action 2 without any doubt.

In this scenario, you are the agent itself. Because you made decisions (Action) and observed the outcome of each decision (Reward, positive or negative). After the first two actions, you have two actions on your to-do list and you know their rewards (Observation).

Figure 1: Action, reward, and observation

You are the agent and the technique you use to find an optimal way of getting warm is Reinforcement Learning. According to this technique, an agent is used to have experiences about the environment where it is created. It memorizes each place, action, reward and has the ability to make iterations to update the outcome of each step (you can earn while losing something in some scenarios. So your current step may affect multiple final steps. Since that means there is dependency, you may need updating your observation values).

What is a Policy?

Now assume that you are a maze runner that you run between some blocks which you can move forward, backward, up, or down. You are currently in Block 1 which is your starting point and have to take a step to another block. Because you want to run from the maze. So your behavior at this point is your policy for your current block. Your policy might be one of ↑, →, ↓ or ← .

Figure 3: Policies of each block in the maze to runner (or agent)

What is an Iteration?

You see some arrows in the maze which is represented in Figure 3. You can ask how these arrows are set since there must be an automatic agent for this task and automatic agents do not have human senses so they require some mathematical explanation for their results.

Each block of the maze has a block score as the backend process of these policies. The agent creates its policy for its current state by checking the state (block) values around it and it tries to move its most demanding neighbor block which has the highest value around its one-step further block. (Selecting algorithms may change in some studies but generally, the mentioned one is assumed). The agent creates the policy ↑ for the second block which has the state value -15. Because the biggest number in its next step list is -14. So it wants to move upside.

An iteration is the calculation of all blocks in the maze at time T1. So we can call this iteration the first iteration. And agent performs another iteration at time T2. But this time he used his previous block values because they are the agent’s observations in the task and it created a connection between its experience and new observations.

When do iterations end?

As the topic of this article, there are two iteration types which are Value Iteration and Policy Iteration. For each iteration block values and policies are updated. Sometimes policies may change due to change in block values.

In Policy Iteration, iteration keeps being performed while at least one policy is updated during the last iteration. If no policy updates, iteration stops no matter if block values are updated or not.

In Value Iteration, a limit for the change of state values between the last and previous iteration is set. In almost every scenario, the state values will change and it keeps performing until a segmentation error. So a Convergence method is used to determine whether considerable change between old and new values occurred. Whenever a convergence is detected, the iteration is terminated.

How to calculate State Values in an Iteration?

There are some methods to calculate a State Value. One of the most known methods is The Bellman Equation which is shown in Figure 6. In the equation, Vi+1 represents the next value of the state s after current iteration i. For each direction (up, right, down, left) the state value s of the state in that direction is multiplied by the probability of the agent’s movement to that direction (agent’s movement has a probability list also like with probability 0.8 agent move upside and 0.1 for right and left side). Then you add a punishment value to your direction score. That prevents the agent from extending the path. Because if you punish the agent, it hurries up to find an optimal path in the shortest time possible.

You calculate each direction score and take the maximum from these values as the new value of the state and create your policy which the arrow points to the state which has the maximum score.

This article is written for general knowledge about Value Iteration and Policy Iteration approaches in Reinforcement Learning. If you are interested in the topic and want to have deeper information, it is recommended to do a specific search about the topic online.

You can reach my python code that I solve a maze problem by using both Value Iteration and Policy Iteration by using the link below.

https://github.com/yesyigitcan/AI-Value-Iteration-Policy-Iteration-and-Q-Learning