Shortest Path and Travelling Salesman Problems in Optimization perspective

Dr. Tri Basuki Kurniawan

Published in

TheLorry Data, Tech & Product

10 min readApr 2, 2021

An optimization exploration and journey using python programming

What is the Shortest Path Problem?

Shortest Path Problem (SPP) is classical problems in combinatorial optimization with various theory and practice applications. Given a directed graph G=(V, E) with node-set V of cardinality n, edge set E of cardinality m and costs c(e) ∈ R for all edges e ∈ E, the single-source single-sink version of the sum shortest path problem finds a path from the source s to sink t, (s, t) which minimizes the sum of the edge costs.

In graph theory, the shortest path problem is finding a path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized.

Source: https://en.wikipedia.org/wiki/Shortest_path_problem#:~:text=In%20graph%20theory%2C%20the%20shortest,its%20constituent%20edges%20is%20minimized.

For example, the shortest path from A to F for the weighted directed graph shows on the left is A-C-E-D-F which has the most minimal cost.
We can see here, the path A-B-D-F has cost 25 andA-B-C-E-D-F has cost 27, while the path A-C-E-D-F has only cost 20.

What is different with the Travelling Salesman Problem?

Another similar problem is the Travelling Salesman Problem (TSP). It solves the problem when a salesperson needs to visit each node (or city) only once and find the minimum cost path. TSP share a similar problem with SPP, but in TSP, all nodes or cities should be visited exactly once, whereas, in SPP, we can choose only the necessary node, not all nodes.

In TheLorry, as a logistic company, we need to send the parcel to our customers. After we translate our customer‘s address into geolocation and we do the clustering (the article describe the process, can be read from here), we continue with the process to find the shortest path (actually, we use TSP, since we need to deliver every parcel to all customers) for each driver.

Optimization Perspective

SPP and TSP can be considered a nondeterministic polynomial (NP) hard problem and combinatorial problem in an optimisation perspective. There are many possible solutions in a search space for that problem. For example, from the graph we use previously, there are three (3) possible solutions (since we use a directed graph).

Another example is if we assume to have a graph consisting of 5 nodes, as collection {A, B, C, D, E}. We need to find the shortest path from every node and end to every node as well, but not travel to itself, and our graph is undirected (means we can travel from each node to another node). We will have many possible solutions as a subset from a parent collection. It can be calculated using a factorial function (without using repetition):

n! = 5! = 5 x 4 x 3 x 2 x 1 = 120

There are 120 possible solutions.
Imagine if we have a graph with 10 nodes.

10! = 10 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 3,638,800

More than 3.6 millions possible. How if we have a graph with 20 nodes.

20! = 20 x 19 x ... x 5 x 4 x 3 x 2 x 1 = 2,432,902,008,177,000,000

How, if we have a graph of more than 20? It is possible if we check every possible solution by manual? It is impossible.
How about checking using a computer? It is still almost impossible or needs too much time to solve and find the best solution.

So, how optimization doing the job? The optimization is not to check all possible solutions; instead, only try to test or check a few good solutions and try to find other better solutions. This process is continued until some stopping criteria are met.

Another aspect is, some optimization algorithm has a mechanism called exploration and exploitation.
In the beginning process, the algorithm is trying to find many search space areas to find a good enough solution (called exploration). Then the algorithm will exploit the current solution more to find a better solution. This mechanism will be successfully applied if the algorithm consists of multi-agents that collaborate and work together.

Let’s show the code

You can download the example data which we use in this tutorial from here.

To begin our journey, please open your new Jupiter notebook and start type this code:

# TSP processingimport pandas as pd
from haversine import haversine, Unitfrom ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcpimport time
import warnings
warnings.filterwarnings('ignore')

First, we need to import pandas since we use dataframe intensively in our program. Then we import the haversine module to calculate the distance between two points. In the next module, we import OR-Tools as our TSP solver (please refer to this link to get more information about OR-Tools for TSP in python).
In the other modules, we need to import time and warning and then set all warning to ‘ignore’.

Ok, next we continue with our code. We need to define two functions to use OR-Tools,

# need to create data model first
def create_data_model(df):
    """Stores the data for the problem."""
    data = {}
    depot_lat = 2.971718
    depot_lng = 101.608376    _data = df[['lat', 'lng']]
    
    _data.loc[-1] = [depot_lng, depot_lat]  # adding a row
    _data.index = _data.index + 1           # shifting index
    _data.sort_index(inplace=True)    results = []
    for i in range(len(_data)):
        lat = _data.iloc[i]['lat']
        lng = _data.iloc[i]['lng']
        from_node = (lat, lng)
        result = []
        for j in range(len(_data)):
            lat = _data.iloc[j]['lat']
            lng = _data.iloc[j]['lng']
            to_node = (lat, lng)
            if j == 0:
                distance = 0
            else:
                distance = round(haversine(from_node, to_node, \ 
                    unit=Unit.METERS))
            result.append(distance)
        results.append(result)
    
    data['distance_matrix'] = results
    data['num_vehicles'] = 1
    data['depot'] = 0
    return data

In this process, we create our data model. First, we add information about the depot, that the route will be started, and put it on the first location position. In the second step, we create a distance matrix consists distance for each point to another point, and we save that variable into data['distance_matrix] = results. We use haversinefunction from haversine’s module.
Another information was we need to define is num_vehicles and depot We set to 1 and 0, respectively, which means we only used one (1) vehicle for our route and started from the first data as the depot.

Next, we type the Running_TSP function, as shown below:

# running TSP process
def Running_TSP(_df, cluster_number):
    start_time = time.time()
    
    """Entry point of the program."""
    # Instantiate the data problem.
    data = create_data_model(_df)    # Create the routing index manager.
    manager = pywrapcp.RoutingIndexManager(len( \
        data['distance_matrix']), data['num_vehicles'], \
        data['depot'])    # Create Routing Model.
    routing = pywrapcp.RoutingModel(manager)    def distance_callback(from_index, to_index):
        """Returns the distance between the two nodes."""
        # Convert from routing variable Index to distance matrix \  
             NodeIndex.
        from_node = manager.IndexToNode(from_index)
        to_node = manager.IndexToNode(to_index)
        return data['distance_matrix'][from_node][to_node]    transit_callback_index = routing.RegisterTransitCallback( \   
        distance_callback)    # Define cost of each arc.
    routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)    # Setting first solution heuristic.
    search_parameters = pywrapcp.DefaultRoutingSearchParameters()
    search_parameters.first_solution_strategy = (
        routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)    # Solve the problem.
    solution = routing.SolveWithParameters(search_parameters)# Print solution on console.
    if solution:
        """Prints solution on console."""
        index = routing.Start(0)
        plan_output = 'Route for vehicle {}:\n'.format(cluster_ \  
            number)
        route_distance = 0
        idx = 0
        _cluster = []
        _order = [0] * len(_df)
    
        while not routing.IsEnd(index):
            node_index = manager.IndexToNode(index)
            plan_output += ' {} ->'.format(node_index)
            previous_index = index
            index = solution.Value(routing.NextVar(index))
            route_distance += routing.GetArcCostForVehicle( \ 
                previous_index, index, 0)
            
            _cluster.append(cluster_number)
             if node_index > 0:
                _order[node_index - 1] = idx
            idx += 1
            
        plan_output += " {}\n".format(manager.IndexToNode(index))
        plan_output += "Route distance: {} km\n".format( \ 
            route_distance/1000)
        plan_output += "Running_TSP --- {} seconds ---\n".format( \  
            (time.time() - start_time))
        print(plan_output)
        
        _df['order'] = _order
        return route_distance

In this function, first, we accepted two (2) parameters, named _df and cluster_number as our processed data and for the identity of our clustering, respectively. As optional, we took note to calculate the time we needed for each TSP processing time.

Next, we call create_data_model function to create the model data, and we save it into data.

data = create_data_model(_df)

After that, we need to create the routing index manager and create a routing model by using this statement:

    # Create the routing index manager.
    manager = pywrapcp.RoutingIndexManager(len( \
        data['distance_matrix']), data['num_vehicles'], \
        data['depot'])    # Create Routing Model.
    routing = pywrapcp.RoutingModel(manager)

By create a manager, we used data model as parameters, and then we create our routing model called routing. Later, we register a transit callback function, distance_callbak function into our routing object, and then define each arc's cost in our problem.

The next step is to set our algorithm to use the first solution strategy, which is PATH_CHEAPEST_ARC constant, which means just getting the minimum cost found in the first process.
The last process was running the algorithm and try to get the solution.

    # Setting first solution heuristic.
    search_parameters = pywrapcp.DefaultRoutingSearchParameters()
    search_parameters.first_solution_strategy = (
        routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)    # Solve the problem.
    solution = routing.SolveWithParameters(search_parameters)

The rest of the code checks if the solution is successfully obtained by our algorithm and then prints it. We construct the output by adding route node by node, and put the route of order into a _order variable and then put it back into our _df.

_df['order'] = _order

Then, we print the cost (or distance) for that route in km and the processing time for this process, as well.

Ok, we continue our code with the main process, consist of reading our data from csv file and then do pre-processing before we call running_TSP function, as shown below:

# -------------------------------------------------------
#
### This code to show the TSP from ORTools can be use ###
#
# -------------------------------------------------------file_url = 'dataset/geolocation_result.csv'
_file = file_url.strip(".csv")
file_result = _file + '_tsp.csv'
data = pd.read_csv(file_url)_job = data.groupby('cluster')['no'].count().sort_values( \ 
    ascending=[False])
print(_job)pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199total_distance = 0
start_time_tsp = time.time()
for nc in range(len(_job)):
    _clusterid = _job.index[nc]
    
    _filter = (data['cluster'] == _clusterid)    
    _dataCluster = data.loc[_filter]
    
    route_distance = Running_TSP(_dataCluster, _clusterid)
    total_distance += route_distance / 1000
    
    for i in range(len(_dataCluster)):
        idx = _dataCluster.index[i]
        value = _dataCluster.iloc[i]
        data.loc[idx, 'order'] = int(value['order'])print("All TSP--- %s seconds ---" % (time.time() - start_time_tsp))
print("Total Distance: %s km" % total_distance)data['order'] = data['order'].astype(int)
data.to_csv(file_result)

In the code above, first, we read our data from geolocation_result.csv in datasetfolder. We put that reading data into data variable. We also define a file name to save our result later.

Later, we group by our data based on cluster field, and we count our no field for each cluster, then we print it.

_job = data.groupby('cluster')['no'].count().sort_values( \ 
    ascending=[False])
print(_job)

The result is shown below:

Later, for each clustering, we filter that data only for the current clustering number (clusterid) and then we save it into _dataCluster variable.

After that, we call running_TSP function and passing _dataCluster and clusterid variable as the parameters. Then the function will return route_distance value as a total distance of TSP result. We sum all route_distance and convert it into km and save it into total_distance variable.

Later, we need to convert _dataCluster order position into dataframe data and convert its value into an integer.

In the last step, we print all TSP processing time and total distance for all TSP route. Finally, we save the route order of TSP into a new csv result file.

For the last code, we need to visualize our TSP result in the map. Please type the code as shown below:

import folium
from folium.features import DivIconcolors = ['red', 'blue', 'green', 'purple', 'orange', 'darkred', 'lightred', \
     'beige', 'darkblue', 'darkgreen', 'cadetblue', 'darkpurple', \
     'pink', 'lightblue', 'lightgreen', 'gray', 'black', 'lightgray', 'red', 'blue', 'green', 'purple', 'orange', 'darkred', 'lightred', \
     'beige', 'darkblue', 'darkgreen', 'cadetblue', 'darkpurple', \
     'pink', 'lightblue', 'lightgreen', 'gray', 'black', 'lightgray' ]lat = data.iloc[10]['lat']
lng = data.iloc[10]['lng']
map = folium.Map(location=[lng, lat], zoom_start=14)for _, row in data.iterrows():
    folium.CircleMarker(location=[row["lng"], row["lat"]], 
        radius=12, weight=2, fill=True, fill_color=colors[
        int(row["cluster"])], \
        color=colors[int(row["cluster"])]).add_to(map)
    
    folium.Marker(location=[row["lng"], row["lat"]], 
        icon=DivIcon(icon_size=(150,36), icon_anchor=(5,10),
        html='<div style="font-size: 10pt; color : {}">{}</div>' \
        .format(colors[int(row['cluster'])], int(row['order'])))).
        add_to(map)map

First, we import the folium module and DivIcon feature to use later in our code. Then, we need to define some different colour for each cluster.
To put the map in the initial position, we should retrieve a position from our data, first and then we set the map object with that location. We set zoom_start parameter to 14.

Later, we set the marker in the map for each location data from our dataframe data and set the colour based on their clustering number. In the second statement, we draw DivIcon to numbering the marker with their order number.

The result of our code is shown here.

Summary

In this tutorial, we explore how to deal with geolocation data and then we do the TSP for each cluster to find the shortest path for each route in that cluster.

We show the code to do the pre-processing, and then step by step, we give a tutorial on how to use the OR-Tools module to find the shortest path. Later, we show the code to save the result into csv and then we show code to visualize the route result into the map.

We hope this tutorial explains how to deal with geolocation data and do the TSP to find the shortest path in the route optimization problem.