Travelling salesman problem with reinforcement learning

Published in

Betacom

8 min readOct 25, 2021

Introduction

Welcome to the third article of the series covering reinforcement learning (RL) and its applications. Today, we will discuss another application of this technique of Artificial Intelligence.

Let’s get started!

Theoretical background

Before looking at a real application of reinforcement learning, recalling some theoretical definitions is important, but, if you are not interested in the technical details, you can skip this section.

This technique studies how humans learn from experience through interaction with the environment. In particular, the latter receives as input the agent’s state s and the chosen action a and it returns a’s reward r and a new state s’. Moreover, at any step, the agent chooses an action according to the policy 𝞹: 𝞹(a|s) = P(Aₜ=a | Sₜ=s).

It is possible to identify three phases of reinforcement learning. In the first one, called prediction, we want to compute the state-value function v(s) = E(Gₜ | Sₜ=s) and the action-value function q(s,a) = r + v(s’). Here, r is the reward of the action a, and s’ is the arrival state. We predict these two quantities by using the Monte Carlo method and by sampling experience following the policy 𝞹. In the improvement phase, the second one, we want to find a policy 𝞹’ better than 𝞹. Finally, in the control one, we are interested in the optimal state-value function v*(s) = max v(s) and in the optimal action-value function q*(s,a) = max q(s,a), where the maximum is computed among all the policies 𝞹. In the end, we can define the optimal policy as the one achieving the maximum state-value or action-value function.

For more technical details please check this article.

Now, let’s focus on one application.

Travelling salesman problem

In this article, we are interested in studying the travelling salesman problem which is an issue similar to the one analyzed in the previous publication. In particular, we want to find the best way to visit all graph’s vertices and come back to the starting one by looking for the shortest trajectory. In this application, the graph’s vertices are the cities, the agent is the salesman and the environment is everything else. During the journey, the driver has to pay attention to unknown elements that can affect his trip such as traffic lights, highways or successions of cars. So, in order to follow the best way to reach the goal he tries different pathways so that he can understand which are the worst road and try to avoid them.

Let’s take an example by looking at the following graph.

This image shows the links between Italian cities through two-way arrows like the one between Turin and Florence or unidirectional arrows such as the one between Turin and Milan that allow the presence of one-way streets. Moreover, on some of them, there are some special symbols: traffic lights, roadworks signs, ferry boats, and highways. Let’s see their effects. The traffic lights slow down the journey, the roadworks force the agent to come back to the city from which he arrived and change his direction. The highway speeds up the journey, and the ferry boats, that are used to leave and arrive in Palermo, are expensive. Note that the one between Palermo and Turin has a red cross on it, this means that it does not leave.

Now, let’s suppose that the salesman starts from Turin, so he aims to visit all the cities and then come back to that one. We can employ different methods to solve such an issue, and one of them is reinforcement learning. Therefore, we have to define some quantities. Firstly, we take the agent as the salesman and the environment as Italian roads. The agent’s state s is the city where he is located at each time t, and the actions are the directions in which he can move. To identify them, we can denote them with the name of the arrival city. Looking at the figure, we can see that the available actions are different across the metropolis. For example, if the agent is in Turin he can go to Milan, Venice, Bologne, or Florence but he cannot select the direction which leads him to Palermo or any other city.

At this point, we have to define the rewards for each action. Suppose that, if the agent follows a road without obstacle he receives a reward of -2, in this way he is led to find the shortest path to come back to the starting point. Otherwise, if he meets a traffic light he gets -5 since he has to wait a long time and he wants to avoid them, meanwhile, if he takes the highway he gets -1 because he can go faster. If there are roadworks or if the ferry boat is not available, the environment returns a reward of -10 and the next state is set equal to the original one. So the roadworks and the non-leaving ferry have the same effect as the walls in the maze introduced in the first article on reinforcement learning. Moreover, if the agent chooses to take the ferry boat, he has to pay and so he receives a reward of -3. Finally, the reward given when he comes back to the first city is +200.

This problem is similar to the one analyzed in the last article where we would like to select the best path from a starting city to the final one. You could think to solve the current one by “bonding” two solutions of it, but this is not the right thing to do. Indeed, for example, if the agent goes from Turin to Naples through Florence and Rome and he comes back through the same path, he does not visit all the cities before coming back to the source. Therefore, this problem is different from the previous one and the maze. Indeed, the episode does not end when the salesman arrives in a particular state but, he has to visit all the other cities before concluding the episode.

In the below image we can see one possible solution highlighted in red.

In this case, the road is Turin -Milan-Venice-Rome (Attention! There are roadworks and the salesman comes back to Venice) -Venice-Bologne-Rome-Naples-Palermo-Florence-Turin. So the return is -1 -2*5 -3*2 -10 +200 = 173 which is a quite good result. We can imagine that the agent arrives at this solution after trying different ones. For example, suppose that in the past episodes he arrives in Palermo following the same path, but, instead, going directly to Florence he chooses the direction of Turin. Here the ferry boat is not leaving, so, he loses 10 and he comes back to Palermo. Therefore, he changes direction and, first, he goes to Florence and then to Turin, thus the return will be 163, a worst one. Another situation can be the following: in Palermo, he can follow the path Rome -Florence -Turin, so the return is 171, very similar to the previous 173 but smaller, so worst. Moreover, Rome has already been visited, so he does not need to return to the capital.

Let’s look at the implementation of this example.

Before going through the details, we have to introduce the function eps_greedy_action that define an ε-greedy action for a given state. So, with probability 1-ε and if we have already defined a greedy action for that state, it selects the greedy action and a random one otherwise.

import numpy as npdef eps_greedy_action(proposed_action, eps, actions, state):    
    p = np.random.random()    
    if p < (1 - eps) and proposed_action is not None:        
        return proposed_action    
    else:        
        return np.random.choice(actions[state])

At this point, we define all the possible states, the actions available from each of them and their rewards. After that, we define the function trip that samples an episode and compute the action-value function for each of them.

def trip(states,available_actions,rewards, proposed_action,
          eps,returns,V,Q):    start = 'Turin'
    end = 'Turin'
    state = start
    history = []
    non_visited_cities = states[1:]
    while state != end or non_visited_cities != []:
        action = eps_greedy_action(proposed_action[state], eps,
                                   available_actions, state)
        history.append((state,action))
        if rewards[(state,action)]!=-10:
            new_state = action
        else:
            new_state = state
        state = new_state
        if new_state in non_visited_cities:
            non_visited_cities.remove(new_state)
    G = 200
    for (state,action) in list(reversed(history))[1:]:
        G = G + rewards[(state,action)]
        t = history.index((state,action))
        if state not in [h[0] for h in history][:t]:
            returns[state].append(G)
            V[state] = np.average(returns[state])
        Q[(state,action)]=V[action if rewards[(state,action)]!=-10
                                else state]+rewards[(state,action)]
    return Q

This function specifies the starting and ending point as Turin and it initializes an empty list called history that will contain the visited states. Moreover, it creates a list called non_visited_states that contains all the cities except Turin since it is the first visited one. After that, the agent enters the while loop and selects the next actions until he visits all the cities and returns to Turin. In this way, we are building an episode. The actions are chosen using the function eps_greedy_action and, at the end of the loop, the algorithm trip specifies the new state. Then, it computes the final return G that is initialized to 200, which is the reward that the agent gets when he returns to Turin, and the algorithm appends this incremental value to the return of each visited state. It computes the state-value for all of them by averaging their returns as shown in the pseudocode in the first article. Finally, it defines the action-value for the states that the agent visited in this episode by updating the value in the dictionary Q which is returned.

Once we defined this function we have to iterate it many times.

for t in range(10000):    
    Q = trip(available_actions,rewards,proposed_action,0.7,returns,
               V,Q)    
    for state in states:        
        actions_rewards = {}        
        for a in available_actions[state]:            
            actions_rewards[a] = Q[(state,a)]        
        proposed_action[state] = max_dict(actions_rewards)[0]

Thank to this for loop, we can repeat 10.000 times the function trip. For each iteration, the loop updates the value of the proposed_action for each state to the action with the greatest reward. In this way, in the next iteration, the agent will select this action (the greedy one) or a random one following the eps_greedy_action. In this example, we choose the greedy action with a probability of 0.7 but it can be modified to balance exploration and exploitation.

In contrast with the first example, it is difficult to identify the best path, but we can imagine that by increasing the number of iterations, the chosen trip will improve. In the end, the salesman will learn the best path by avoiding roadworks and non-leaving ferries and visiting just once each city to find the faster pathway that passes through all of them.

Conclusion

In this third article on reinforcement learning, we discussed one of many interesting uses of this method. Indeed, in literature, there are a lot of other applications, for example in games and industrial production planning. Therefore, it is a very powerful tool but, of course, there are situations in which it is not the optimal solution. For example, when we know the underlying model and there are few available actions using reinforcement learning is not a good choice.

See you next time!

Travelling salesman problem with reinforcement learning

Introduction

Theoretical background

Travelling salesman problem

Conclusion

Written by Sara Ghibaudo