Tackling the Traveling Salesman Problem with Graph Neural Networks

Michael Atkin
Stanford CS224W GraphML Tutorials
12 min readMay 15, 2023

By Senem Isik and Michael Atkin as part of the Stanford CS224W course project.

The Traveling Salesman Problem is a classic problem in computer science with a variety of real-world applications in logistics and genetics. In this article, we show how GNNs can be used to solve the problem.

Have you ever dreamed of embarking on a road trip across the United States? Picture yourself cruising down Route 66, exploring national parks, and stopping at roadside attractions along the way. But with so many incredible destinations to choose from, planning a road trip can feel overwhelming. How can you see as much as possible without backtracking or zigzagging between destinations? This scenario resembles the famous traveling salesman problem that has been a thorn in the side of computer scientists for decades.

A TSP tour that visits a city in all 49 contiguous U.S. states.

Enter graph neural networks, a cutting-edge machine learning technique that has shown promise in tackling this very challenge. In this article, we’ll explore how this cutting-edge technology can help you optimize your route so you can see all the sights you’ve dreamed of.

Here’s an overview of what’s in this post:

  • Description of the Traveling Salesman Problem
  • Overview of graph neural networks
  • Application of graph neural networks to the Traveling Salesman Problem

Here’s a link to our code implementation:

Traveling Salesman Problem

The traveling salesman problem (TSP) is a classic problem in computer science that involves finding the shortest possible route that a salesman can take to visit a given set of cities and return to the starting point. The problem is often illustrated as a set of points on a map representing cities, with the distances between each pair of cities known. We are looking for routes that visit each city exactly once and return to the starting point — we call these routes “tours”. The objective is to find the shortest possible tour.

The TSP has a variety of real-world applications. For example, a delivery driver may need to visit a set of locations to drop off packages and want to find the most efficient route to minimize travel time and fuel costs [1]. Similarly, a salesperson may need to visit a set of potential clients and wants to minimize travel time and maximize the number of visits per day. The TSP is also relevant in genetics, where it is used to analyze DNA sequencing data and determine the most likely order in which genes occur.

Developing new methods for solving the TSP is essential, as it remains an intractable problem for large input sizes. The TSP is NP-hard, meaning all known algorithms that solve it take exponential time. In practice, TSP cannot be solved optimally for problem instances with more than thousands of nodes. State-of-the-art TSP solvers combine linear programming with handcrafted heuristics to find approximate solutions, but execution times become prohibitively long for problem instances with more than tens of thousands of nodes [2].

Source: XKCD

Graph Neural Networks

Graph neural networks (GNNs) are a type of neural network that can operate on graphs. GNNs can be viewed as a generalization of traditional neural networks. Traditional neural networks operate on fixed-dimensional vectors, which are a particular kind of graph. Conversely, GNNs can operate on any graph. To illustrate this, consider your favorite language models, such as ChatGPT or PaLM. These models take as input a sequence of tokens — in other words, a path graph! Similarly, vision models take as input a grid of pixels — a grid graph! GNNs can operate on all of these kinds of graphs, and more.

Left: path graph. Middle: grid graph. Right: unstructured graph. A GNN can handle all 3 of these! (source)

GNNs are a good fit for our problem because the TSP is naturally represented as a graph. We can treat cities as nodes in a graph. Then, the graph will be fully-connected with each edge containing a positive-valued number as a feature representing the distance between the two cities it connects. Then we can simply pass the graph representation of the problem into the GNN as input.

Solvers for other NP-hard problems have seen efficiency and accuracy gains from incorporating neural methods [3] [4], so there is clear potential for deep learning to improve on existing solvers. Discovering a scalable solution for the TSP on large graphs would be revolutionary for fields such as logistics and genetics.

Lastly, it has been proven that GNN training can learn to mimic dynamic programming [5]. Dynamic programming is the backbone of solvers for combinatorial optimization problems such as the TSP, so GNNs are a natural fit for the problem area.

GNN layers

Let’s go into a bit more detail about how a GNN works. At a high level, a GNN takes a graph as input, which is represented as a set of nodes and edges. The goal of the GNN is to learn a function that maps each node to a vector embedding that captures its structural and relational information.

Source: From CS224w Lecture 3

Each GNN layer consists of the following steps:

  • Message computation: we apply a linear transformation to the embedding of each node. These are the “messages” each node sends to its neighbors.
  • Aggregation: each node receives messages from each of its neighbors. These messages are combined via some permutation equivariant function such as mean, max, or sum.
  • Update: The update operation takes the aggregated message and updates the node representation. Many models will incorporate an additional linear transformation and a residual connection at this stage.

Most GNNs will stack multiple GNN layers. When this is the case, the embedding produced by the previous layer is passed in as input to the next layer. This allows node information to propagate multiple hops across the graph, rather than just among neighboring nodes.

Visualization of stacked GNN layers. (source)

This basic structure captures the key insight behind GNNs — the representation of a node should be influenced by the representations of its neighboring nodes. However, the particulars of each step (message, aggregation, update) are up to you! There are a wide variety of architectures, all of which use different message, aggregation, and update functions. Some of the most popular are GCN, GraphSAGE, and GAT. All of these architectures, as well as many more, are implemented in PyG, so feel to experiment!

GNNs can be used out-of-the-box to produce expressive node embeddings — however, in many cases, GNNs are used as part of a larger architecture. This allows the node representations to be used for downstream tasks, such as node classification, link prediction, or graph clustering.

Applications of GNNs to TSP

Dataset

For each TSP graph, we sample n nodes uniformly from the unit square [0, 1]². In order to produce a label for the data, we calculate the optimal tour for each graph using the Concorde TSP Solver [2]. Here’s a visualization of a single training example:

Visualization of a single training example. The blue nodes denote the cities, and the red edges denote the optimal tour.

Graph Transformer

We define each node’s input feature as its (x,y) coordinates. After this initialization, we use one layer of a Graph Transformer [6] to obtain the node embeddings according to the following rule:

where the attention coefficients are defined as:

We set the dimensions of each tensor to be (hidden_dimension, 2) where the default value for hidden_dimension is 300.

This model is fully implemented in PyG! Let’s go through each step of this GNN layer as implemented in the source code:

1. Message computation: this step involves the attention coefficient computation and is done under the message() function defined as:

def message(self, query_i: Tensor, key_j: Tensor, value_j: Tensor,
edge_attr: OptTensor, index: Tensor, ptr: OptTensor,
size_i: Optional[int]) -> Tensor:

For each node i and j, the attention coefficient is calculated via:

alpha = (query_i * key_j).sum(dim=-1) / math.sqrt(self.out_channels)
alpha = softmax(alpha, index, ptr, size_i)
self._alpha = alpha
alpha = F.dropout(alpha, training=self.training)

After computing attention, we compute the messages, then scale each message by its corresponding attention coefficient.

out = value_j
out = out * alpha.view(-1, self.heads, 1)
return out

2. Aggregation: This step is part of the forward() function defined as:

def forward(self, x: Union[Tensor, PairTensor], edge_index: Adj,
edge_attr: OptTensor = None, return_attention_weights=None):

The aggregation step is carried out by the propagate() function call. The propagate function is implemented by PyG, and encapsulates the message-passing process. It does so by calling three functions: 1) message, 2) aggregate, and 3) update. We have already defined the message function above. For aggregation, we use the sum operation and for update, we pass the message directly. Before passing the node embeddings into the propagate function call, we apply linear transformations to get the queries, keys, and values for the attention computation.

H, C = self.heads, self.out_channels
query = self.lin_query(x[1]).view(-1, H, C)
key = self.lin_key(x[0]).view(-1, H, C)
value = self.lin_value(x[0]).view(-1, H, C)

out = self.propagate(edge_index, query=query, key=key, value=value,
edge_attr=edge_attr, size=None)

3. Update: this step is implemented within forward(). We update the embedding of the node with a residual connection:

x_r = self.lin_skip(x[1])
out = out + x_r

For further details, refer to the source code. Note that it’s useful to understand the inner workings of GNNs, but in practice, PyG abstracts away the details for you!

We opted to use a Graph Transformer because it uses an attention mechanism, which has shown exceptional results across fields [7] [8]. Graph Attention Networks [9] also utilize an attention mechanism. However, we opted to use Graph Transformers, as Shi et. al. [6] found that Graph Transformers slightly outperform GAT for most tasks.

Residual Gated GCN

At this stage, we produce input edge embeddings. For each edge, we obtain its embedding via a linear transformation on the distance between the two nodes it connects.

Next, we feed the node embeddings produced by the Graph Transformer and the edge features into a Residual Gated GCN [10]. We choose to use a Residual Gated GCN in particular because it is a simple, scalable, and well-studied method. Additionally, other researchers have achieved competitive results applying the model to the TSP [11]. One layer of Residual Gated GCN is shown by the equations below:

One layer of a Residual Gated GCN. (source)

where each W represents a linear layer, ReLU is the rectified linear unit, and BN stands for batch normalization. In our implementation, we include 5 of these layers for solving TSP instances of 10 nodes. We recommend scaling the number of layers as the input graph becomes larger.

The key variation from standard GNNs is that we also update edge embeddings. This is essential because our ultimate goal is to predict which edges participate in the tour. Otherwise, this is a fairly standard architecture. (In fact, PyG has a Residual Gated GCN layer implemented for node embeddings!) Batch normalization and ReLU have shown strong results across a variety of deep learning disciplines.

The major difference between Graph Transformers and Residual Gated GCNs is that the Graph Transformer only updates the node embeddings, while Residual Gated GCNs update both the node embeddings and the edge embeddings. We experimented with both architectures on their own, but got our strongest empirical results by combining them.

Post-processing

After processing all of the GNN layers, we apply an MLP on the edge embeddings to produce a scalar for each edge. We then convert these scalars into probabilities using a softmax layer. Our model output is a probability heatmap over each edge in the graph, representing the likelihood that each edge participates in the optimal tour. Finally, we use the familiar deep learning recipe: calculate the loss by comparing the model output to the ground-truth solution (we use binary cross-entropy), then backpropagate the loss through all model parameters.

Diagram of our model architecture. (source)

Note that our model output is just probabilities for each edge — this is not yet a valid tour! In order to make it a valid tour, we need to implement a search procedure to select the edges that will occur in our solution. One could just greedily select the edges with the highest probability. However, we opt to use beam search, an algorithm that explores several partial solutions in order to find the best one. This is motivated by the observation that for many combinatorial optimization problems, combining deep learning and search leads to strong results [12].

Results

The standard metric for evaluating the quality of the TSP tour is the optimality gap. This is the ratio between the tour length predicted by the model and the optimal tour length.

However, this would mean we compare the model output after beam search is applied. As noted by Joshi et. al, [11], a strong post-processing search method can obscure the deficiencies of the model itself. We would like to evaluate the quality of the GNN layers, not the search method. As such, we directly compare the probability heatmap produced by the model to the ground truth solution using binary cross-entropy, i.e. we evaluate the model on the test loss.

Here are our loss metrics:

The training is somewhat unstable, but produces strong results at epoch 25. In our experimentation, we found training instability was a constant regardless of what architectures or hyperparameters we used. Joshi et. al. also found similar results [11]. A promising avenue of research is exploring what about the problem structure leads to training instability, and how to mitigate it.

Insights

Here are some visualizations of the model’s predicted tours:

We can visually observe that our model is able to determine the optimal tour!

The main limitation of our model is that it relies on labeled data, which we obtain from a TSP solver. If we wish to surpass the current capabilities of TSP solvers, we cannot rely on labeled training data.

A promising extension is switching from supervised learning to reinforcement learning. We can define the reward signal as the predicted tour length, and encourage the model to find shorter and shorter tours. Then, we can train on unlabeled TSP graphs, without relying on a solver. In any case, GNNs will continue to be an essential part of the model, since they are the framework best suited to capturing the graph structure of the problem.

We hope you enjoyed! If this article sparked your interest in GNNs, there are a variety of other resources available. Professor Jure Leskovec’s lectures are publicly available on YouTube. If you would like to dive into use cases, PyG has extensive documentation and tutorials.

References

[1] Jan Karel Lenstra and AHG Rinnooy Kan. Some simple applications of the travelling salesman problem. Journal of the Operational Research Society, 26(4):717–733, 1975

[2] David Applegate, Robert Bixby, Vašek Chvátal, and William Cook. Implementing the dantzig-fulkerson-johnson algorithm for large traveling salesman problems. Mathematical programming, 97:91–153, 2003

[3] Iddo Drori, Anant Kharkar, William R Sickinger, Brandon Kates, Qiang Ma, Suwen Ge, Eden Dolev, Brenda Dietrich, David P Williamson, and Madeleine Udell. Learning to solve combinatorial optimization problems on real-world graphs in linear time. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 19–24. IEEE, 2020.

[4] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. Advances in neural information processing systems, 31, 2018.

[5] Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. What can neural networks reason about? arXiv preprint arXiv:1905.13211, 2019.

[6] Shi, Yunsheng, et al. “Masked label prediction: Unified message passing model for semi-supervised classification.” arXiv preprint arXiv:2009.03509 (2020).

[7] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[8] Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

[9] Veličković, Petar, et al. “Graph attention networks.” arXiv preprint arXiv:1710.10903 (2017).

[10] Bresson, Xavier, and Thomas Laurent. “Residual gated graph convnets.” arXiv preprint arXiv:1711.07553 (2017).

[11] Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint arXiv:1906.01227, 2019

[12] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.

--

--