Solving Dynamic Programming Problem with Reinforcement Learning.

Deeksha Aggarwal
8 min readSep 9, 2019

--

Dynamic Programming is all about solving a big recursive problem by dividing it into sub-problems. Wherever we see a recursive solution that has repeated calls for same inputs, we can optimize it using Dynamic Programming. The idea is to simply store the results of sub-problems, so that we do not have to re-compute them when needed later. To make a computer learn on its own without being explicitly programmed is termed as Machine Learning. So the question arises how we can solve big problem optimally by using the concept of machine learning along with dynamic programming? Well the answer is Q-Learning which is one of the great reinforcement learning algorithm.

In this blog post I will be showing how one of the dynamic programming problem can be solved optimally by using reinforcement learning and implement Q-learning in python from scratch. Some basic understanding of reinforcement learning is required as a prerequisite such as what are states, action, policy and rewards.

Contents :-

  1. Problem Statement
  2. Solution by using Dynamic Programming
  3. Solution by using Reinforcement Learning
  4. Results
  5. Conclusion and GitHub Link for code

The goal is to calculate minimum number of jumps to reach the end given that maximum of up to K jumps are allowed. Consider an environment where a person has to reach the end of a path consisting of N number of skippable walls. The objective is to return the minimum number of jumps required to
reach the end from the starting position. The person can skip a maximum of K walls at once.The necessary condition is to reach at the end position and not beyond it and if he does reach beyond the end position then he has to start again from the starting position.

Naive Recursive Approach
A naive approach is to recursively call for all the positions reachable from first position. The minimum number of jumps to reach end from first can be calculated using minimum number of jumps needed to reach end from the positions reachable from first.

minJumps(start, end) = Min ( minJumps(k, end) ), for all k reachable from start

2. Solution by using Dynamic Programming

We can see that there are overlapping sub-problems. For example, minJumps(3, 9) will be called two times as position[3] is reachable from position[1] and position[2]. So this problem has both properties (optimal substructure and overlapping sub-problems) of Dynamic Programming.

In this method, we build a jumps[] array from left to right such that jumps[i] indicates the minimum number of jumps needed to reach position[i] from position[0]. Finally, we return jumps[n-1].

3. Solution by Reinforcement Learning

A typical reinforcement learning loop. Source :Wikipedia

Given above is typical RL setup. The main aim of the agent, is to maximize a certain numeric reward by performing a particular sequence of actions in the environment while travelling through different states. Where the next state and the reward totally depends on the action the agent took. The biggest challenge in RL is the absence of supervision (labelled data) to guide the agent. It must explore and learn on its own. The agent starts by randomly performing actions and observing the rewards each action brings and learns to predict the best possible action when faced with a similar state of the environment. In our case, we can define the reinforcement learning set-up as below:

Environment: It consist of the path having N number of walls.
Agent: The person who have to jump the walls.
Interpreter: Observes whether the end position has been reached or not.
States: The distance between the end and the current position of the person after each jump.
Actions : The number of walls jumped at once up-to K walls at maximum. For example:
Action 1: Skip 1 wall.
Action 2: Skip 2 walls.
Action 3: Skip 3 walls.
Action 4: Skip 4 walls.
Reward : For every jump a cost of -1 is levied and on reaching the end position, a reward of 0 is given.

Q-learning is a model-less implementation of Reinforcement Learning where a table of Q values is maintained against each state, action taken and the resulting reward. It acts as a reference table for our agent to select the best action based on the q-value. A sample Q-table given below can give us the idea how the data is structured:

A sample Q-table Source: Wikipedia

Hence, Q-learning is a technique of RL, where we try to approximate a special function which drives the action-selection policy for any sequence of environment states and hence converges and gives us the maximum reward possible.

Exploration vs Exploitation problem arises when our model tends to stick to same actions while learning. We introduce ɛ, which decides the randomness of actions. We gradually decay its value to reduce the randomness as we progress and then exploit rewarding actions. In our case the initial ɛ value is 0.9 and decay rate is 0.9998 per episode.

Discount Factor γ is parameter which decides how far into the future our model looks while taking an action i.e. how a previously taken wrong action might affect the current reward. Thus, γ solves the credit assignment problem indirectly. In our case the model learned that stray jumps will inhibit it’s ability to jump in the future when we set γ=0.95.

Few additional parameters that we will be using later are given below:

The next step is to build the Environment for our agent. In the below code, the Utils class constructor locate the agent at the starting position and defines two functions namely action() and move(). The move() function makes the agent jump for k number of steps. where k is decided by the action() function which call the move() function based on the action choice of the agent. For example, if action choice is 2 then the agent will jump by 2 walls in the forward direction. The q-table having 10 states and 4 actions is initialized with random values.

The next and final step is to Train The Agent. For each episode, we will make the agent to start from the start position and reach the end by taking at-most k jumps. The value of k will change for each episode randomly between 1 to 4(MAX_SKIPS). For example, if k=3 then this means that the agent can take action/can jump 1, 2 and 3 walls only. This is done to make the agent robust and make it learn well in different scenarios.

For each iteration/step per episode, the agent will interact with the environment and gather rewards. These rewards are stored in a list named “episode_reward[]” which gets appended to the main total reward list “episode_rewards[]” at the end of each episode. The agent will take action based on the ɛ value. That is, the agent will choose the action that has the highest Q value for that state with a probability of 1-ɛ (exploitation) and a random action with a probability of ɛ (exploration).

Here “obs” is the current state and “new_obs” is the next state which is nothing but the remaining distance between current and end position. Based on the current position of the agent, the rewards are given(either END_REWARD or MOVE_PENALTY).

Now to update the Q values will consider the below formula:

Q-value update formula
Simplified Q-value update formula

The max_future_q is grabbed after we've performed our action already, and then we update our previous values based partially on the next-state's (“new_obs”) best Q value.

Over time, once we've reached the objective once, this "reward" value gets slowly back-propagated, one step at a time, per episode.

4. Results

Code snippet for plotting the moving average per 30 episodes.

For the evaluation purpose, moving average of rewards was calculated per 30 episodes and the graph below shows the convergence over time.

The Plot shows the increase in average rewards with episodes.
The Plot shows the increase in average rewards with decrease in epsilon value.

5. Conclusion and GitHub Link for code

Dynamic Programming has two properties to solve a complex problem i.e. Optimal substructure and Overlapping sub-problems. Markov Decision Processes satisfy both of these properties:

  • The Bellman Equation gives recursive decomposition which tells us how to break down the optimal value function into two pieces. i.e the optimal behavior of one step and then the optimal value of the remaining steps.
  • The Q-Learning stores and reuses solutions. Cache with all the good information of the MDP which tells you the optimal reward you can get from that state onward.

Hence, we can conclude that RL can solve DP problems optimally given well designed environment in the form of Markov decision process. One thing should be taken care that as the number of states increases or the environment gets complex, more computation power will be needed to explore all the states and update Q-values/Q-table.

Below is GitHub link for the above code

Thanks for reading. Please share, comment and clap if you like the post.

For Freelancing or to get B-Tech, M-Tech major or minor project ideas and code, contact to my team at youraiprojects@gmail.com. We would love to help you out.

--

--