Programmatically finding distances between ports using Pub. 151 and A*

Published in

Qwyk TechIntel

5 min readJun 1, 2018

At Qwyk we aim to provide customers with all relevant data around shipping cargo across the globe. One of these data points is the Carbon footprint of a given shipment or container, and to calculate the footprint, we need distance information.

While airplanes take a ‘relatively’ direct path (they follow airways but a flight from Amsterdam to Hong Kong’s path is only about 6% longer than the great circle distance (GCD) between those two points,) ocean going vessels take paths significantly longer than the GCD, even when going direct between two ports, because of the way our continents and shorelines are arranged.

Using Pub 151. and A* we’ve come up with an implementation that allows us to calculate a relatively accurate distance between two ports.

Pub. 151 and A*

Pub. 151 Junctions and Ports reachable from Aberdeen, Scotland

Pub. 151, or ‘Distances between Ports’ is a Maritime Safety Information publication by the United States National Geospatial-Intelligence Agency providing tables that allow for the calculation of distance in International Nautical Miles between two significant ports. It uses 25 so-called Junctions, as well as a table of ports/junctions with each of their neighboring junctions and ports, including the distance to each of these. Using this information, one can use manual path-finding to find the route between two ports over a number of junctions and calculate the distance.

The A* search algorithm is a pathfinding/graph traversal algorithm used to plot an efficient direct path between points (nodes.)

Readying the Pub. 151 Data

While Pub. 151 is a PDF document, we were lucky to find a github repo that had already converted much of this data into a JSON format. Unfortunately, in the process it seemed that consistency between naming of the junctions, the header ports and the Junction Points and Ports of these had gone somewhat lost, and we had to deal with destinations that had a condition, such as:

Istanbul, Turkey (south of Greece), 1,393
Istanbul, Turkey (via Corinth Canal), 1,376

In these cases we elected to always take the shortest distance and drop the longer one(s).

After cleaning up the data, and loading it into MongoDB we ended up with two collections looking roughly like this:

pub151_nodes, which holds information about each node

{
 “id” : “1007”,
 “key” : “aberdeenscotland”,
 “name” : “Aberdeen, Scotland”,
 “location” : {
    “type” : “Point”,
    “coordinates” : [
       -2.075,
       57.141666666666666
    ]
 },
 “locode” : “GBABD”
}In this collection we also mapped the UN Locode of each port, since we’ll be using those for our lookups.

pub151_edges, which holds all nodes, its neighbours, and its destinations, essentially a collection of edges for each node

{
 “node_id” : “1007”,
 “neighbours” : {
    “7” : 715,
    “10” : 406,
    “12” : 106,
    “19” : 416
 },
 “destinations” : {
    “1042” : 376,
    “1056” : 465,
    “1158” : 309,
    “1172” : 767,
    “1187” : 435,
    “1373” : 277,
    “1379” : 401,
    “1385” : 380,
    […]This is the same data for Aberdeen as in the first image in this post, however encoded with Node IDs for the neighbours and destinations. The keys represent the ID of each neighbouring node, and the value their respective distance from the current node.

The A* implementation (using Python 3.5)

To implement the A* algorithm to search through the data we prepared, I used the astar package available on PyPi as well as the haversine package. While we could’ve written the implementation of A* and the haversine function ourselves, those packages are there to be used.

In order to implement the astar package, one inherits from the AStar class provided and overrides 4 functions: neighbors(node), distance_between(node1,node2), heuristic_cost_estimate(current_node, goal_node) and is_goal_reached(current_node, goal_node)

The neighbors function is very straightforward, we use the node_id to get the right data from our edges collection and return the neighbours dictionary (the junctions)

def neighbors(self, node):
   edge = [n for n in self.edges if n[‘node_id’] == node][0]
   return edge[‘neighbours’] if ‘neighbours’ in edge else []

(These examples will omit pretty much all error handling, collection checks, etc. for clarity)

Similarly, the distance_between function allows us to find the edge using the first_node, and return the distance from the neighbors dictionary using the second_node id. This distance is the actual weight for the path, meaning it will optimize for the shortest overall distance and not just find any random path between the start- and end-node. (From my admittedly limited understanding the main difference between A* and Dijkstra’s shortest weighted path algo is that Dijkstra’s doesn’t implement a heuristic cost function, which we’ll talk about next)

def distance_between(self, n1, n2):
    n1 = [n for n in self.edges if n[‘node_id’] == n1][0]
    return n1[‘neighbours’][n2]

The heuristic cost estimate and goal reached are a little bit more complex.

A* uses the heuristics cost estimate to weigh the next path it searches using that heuristic, which we can customize. Just like we, when trying to find the route from Aberdeen to Dubai, mentally assert that we should probably go via Port Said and take the Suez Canal, and not the long way around using the Panama Canal or around South Africa via the Cape of Good Hope, we can tell A* which next node is likely to get it to the destination first. We do this by simply calculating the distance between each of the neighbors and goal node, hence the need for the Haversine function. While this isn’t perfect, it makes the algorithm directional along the direct great circle distance, longer distance equals a higher cost etc.

def heuristic_cost_estimate(self, current, goal):   if current in self.stored_heuristics:   
      return self.stored_heuristics[current]   current = [n for n in self.nodes if n[‘id’] == current][0]   if current[‘location’] is None:
      return Infinite   current_coords = current[‘location’][‘coordinates’]   distance = haversine(
      self.destination_node[‘location’][‘coordinates’], 
      current_coords
   )   self.stored_heuristics[current] = distance   return distance

We store the distance between each node and the goal in a dictionary so we don’t need to evaluate it each time: just once, and after that can look it up.

The goal reached function is less work, for each node, we simply evaluate whether the goal exists in the nodes list of destinations, meaning that our goal node can be reached directly from the current junction node:

def is_goal_reached(self, current, goal):   current_edge = [e for e in self.edges 
                   if e[‘node_id’] == current][0]   return goal in current_edge[‘destinations’]

Output

And as the result, we run the graph and are returned with an array of Node ID’s that we can use to resolve those nodes and their distances.

>>> db = MongoClient(‘********’).masterdata
>>> origin = “GBABD”
>>> destination = “PHCEB”
>>> graph = Graph(origin, destination, db)
>>> graph.find_route()
[‘1007’, ‘7’, ‘20’, ‘13’, ‘18’, ‘1270’]

Final output of a A* search between Aberdeen and Cebu

Missing Stuff

The Pub. 151 has some edge cases which don’t work along the normal path finding, For example, for anything departing out of Bremen, the documentation says to simply add 34nm to Bremerhaven distances. We still need to encode this and apply in the path finding.