How to find the shortest path using reinforcement learning

Sara Ghibaudo
Betacom
Published in
8 min readOct 11, 2021
Photo by Rodrigo Kugnharski on Unsplash

Introduction

Welcome to the second article of a series covering reinforcement learning (RL) and its applications. Today, we will see a more complex application of this technique of Artificial Intelligence.

Let’s get started!

Theoretical background

Before looking at a real application of reinforcement learning, recall some theoretical definitions is important, but, if you are not interested in the technical details, you can skip this section.

This technique studies how humans learn from experience through interaction with the environment. In particular, the latter receives as input the agent’s state s and the chosen action a and it returns a’s reward r and a new state s’. Moreover, at any step, the agent chooses an action according to the policy 𝞹: 𝞹(a|s) = P(Aₜ=a | Sₜ=s).

It is possible to identify three phases of reinforcement learning. In the first one, called prediction, we want to predict the state-value function v(s) = E(Gₜ | Sₜ=s) and the action-value function q(s,a) = r + v(s’). Here, r is the reward of the action a, and s’ is the arrival state. We define these two predictions by using the Monte Carlo method and by sampling experience following the policy 𝞹. In the improvement phase, the second one, we want to find a policy 𝞹 better than 𝞹. Finally, in the control one, we are interested in the optimal state-value function v*(s) = max v(s) and in the optimal action-value function q*(s,a) = max q(s,a), where the maximum is computed among all the policies 𝞹. In the end, we can define the optimal policy as the one achieving the maximum state-value or action-value function.

Let’s focus on one application.

How to find the shortest path

The first application of RL that I want to show is a generalization of the problem of exit from a maze that we studied in the previous article (if you miss it, check it out https://medium.com/betacom/an-introduction-to-reinforcement-learning-b749abd1f281 ).In particular, we are interested in finding the best path from a starting point A to an arrival one, B, and so we are looking for the shortest trajectory that links the two points. But, as in the maze there are obstacles given by walls, also in this situation there are elements unknown by the agent that can affect his journey. For example, we can consider the drive in Italy from Turin to Naples where he can encounter traffic lights or successions of cars that slow down the trip. There can also be works on the road that force him to change direction or he can arrive on a highway that is a faster trajectory. On another trip, he can also go to an island and he has to take a ferry boat and pay for it. As you can imagine, there are a lot of unknown aspects that can occur during a trip and the best way to face them is to repeat many times the same task learning to avoid the worst roads. So, each time the agent repeats the travel, he starts to understand which are the busy streets and on which of them there are works. In this way, he substitutes those parts of the travel with other ones. In the end, he can discover that the best pathway can be not the more “direct” one, but a more complex one that can avoid traffic and roadworks. Let’s take an example by looking at the following graph.

As you can see, the image shows the links between Italian cities through two-way arrows like the one between Turin and Florence or unidirectional arrows such as the one between Turin and Milan that allow the presence of one-way streets. Moreover, on some of them, there are some special symbols: traffic lights, roadworks signs, ferry boats, and highways. Let’s see their effects. The traffic lights slow down the journey, the roadworks force the agent to come back to the city from which he arrived and change his direction. The highway speeds up the journey, and the ferry boats, that are used to arrive and leave Palermo, are expensive. Note that the one between Palermo and Turin has a red cross on it, this means that it does not leave.

Before proceeding with the example let’s introduce the definition of a graph. In this article, we are using a directed graph G that is defined by a set of vertices V also called nodes (the cities in our case) and by a set of oriented edges (the rows).

Let’s enter into the details of this application.

Firstly, we define the agent as the driver and the environment as the road. The agent’s state s is the city where he is located at each time t, and the actions are the directions in which he can move. To identify them, we can denote them with the name of the arrival city. Looking at the figure, we can see that the available actions are different across the metropolis. For example, if the agent is in Turin he can go to Milan, Venice, Bologne, or Florence but he cannot select the direction which leads him to Palermo or any other city. Instead, if he is in Florence he can go to Turin, Florence or Rome.

At this point, we have to define the rewards for each action. Suppose that, if the agent follows the road without obstacle he receives a reward of -2, in this way he is led to find the shortest path to arrive at his final destination. Otherwise, if he meets a traffic light he gets -5 since he has to wait a long time and he wants to avoid them, meanwhile, if he takes the highway he gets -1 because he can go faster. If there are roadworks or if the ferry boat is not available, the environment returns a reward of -10 and the successive state is set equal to the original one. So, as we already stated, the roadworks and the non-leaving ferry have the same effect as the walls in the maze. Moreover, if the agent chooses to take the ferry boat, he has to pay and so he receives a reward of -3. Eventually, when the driver reaches his destination, he gains +100.

In the following image, we can see the rewards for each action. Of course, the agent cannot access this information since the environment is unknown to him. Now, suppose that he goes from Turin to Naples following the yellow path: Turin-Milan-Bologne (Attention! There are roadworks and the salesman comes back to Milan)-Venice-Bologne-Florence-Palermo-Rome-Naples.

In this particular case, the agent chooses very bad actions: he encountered roadworks, traffic lights and he takes a ferry, but he also uses a highway so, in the end, his return is -1 -2*3 -3*2–5 -10 +100 = 72. It is small, so, in the next iterations, he will choose different actions to improve it. If we observe the graph we can easily see that the best pathway is Turin -Florence-Rome -Naples where the return is +94, a very high result. So, even if there are highways that speed up the trip, the best solution is the one that does not use them and, in the beginning, the agent has to choose a sub-optimal action avoiding the highway.

Let’s see how to implement this example.

Before looking at the main function, we have to introduce eps_greedy_action, used to define an ε-greedy action from a given state. So, with probability ε, the agent will select a random action, and, with probability 1-ε and if we have already defined a greedy action for that state, the greedy one.

def eps_greedy_action(proposed_action, eps, actions, state):    p = np.random.random()    if p < (1 - eps) and proposed_action is not None:        return proposed_action    else:        return np.random.choice(actions[state])

After that, we have to define all the states, the available actions from each of them and their rewards. After that, we can execute a function called trip which specifies an episode, so we can collect the chosen actions and the action-value for each couple (state, action) from Turin to Naples.

import numpy as npdef trip(actions,rewards,proposed_action,eps,returns,V,Q):   state = 'Turin'   history = []   while state != 'Naples':       action = eps_greedy_action(proposed_action[state],epsilon,                                                                              available_actions, state)       history.append((state,action))       if rewards[(state,action)]!=-10:           new_state = action       else:           new_state = state       state = new_state   G = 0   for (state,action) in list(reversed(history)):       G = G + rewards[(state,action)]       t = history.index((state,action))       if state not in [h[0] for h in history][:t]:           returns[state].append(G)           V[state] = np.average(returns[state])

if rewards[(state,action)]!=-10:
new_state = action else: new_state = state Q[(state,action)]=V[new_state]+rewards[(state,action)]return Q

In the previous function, we specify the starting point as Turin and we initialize an empty list called history that will contain the visited states. After that, the agent enters the while loop and selects the next action until he reaches Naples. The action is chosen using the function eps_greedy_action and, at the end of the loop, we specify the new state. Now, we compute the final return G and we append this incremental value to the return of each visited state. We compute the state-value for all of them by averaging their returns as shown in the pseudocode in the previous article (https://medium.com/betacom/an-introduction-to-reinforcement-learning-b749abd1f281 ). Finally, we define the action-value for the states that we visited in this episode by updating the value in the dictionary Q and we return it.

Once we defined this function, we just need to repeat it many times.

for t in range(10000):    Q = trip(available_actions,rewards,proposed_action, 0.7,returns,V,Q)    for state in states:        actions_rewards = {}        for a in available_actions[state]:            actions_rewards[a] = Q[(state,a)]        proposed_action[state] = max_dict(actions_rewards)[0]

Thank to this for loop, we can repeat 10.000 times the function trip. For each iteration, we update the value of the proposed_action for each state to the action with the greatest reward. In this way, in the next iteration, we will select this action (the greedy one) or a random one. In this example, we choose the greedy action with a probability of 0.7 but it can be modified to balance exploration and exploitation as explained in the previous article. With these hyper-parameters, the final trip could be Turin-Florence-Rome-Naples that, as already stated, is the best one. Therefore, the traveller learns to avoid roadworks and non-leaving ferry as we desire.

Conclusion

In this second article on reinforcement learning, we see an application and implementation of this method. It can be generalized to even more complex situations, like the travelling salesman problem that we will analyze in the next article.

See you next time!

--

--