Tutorial on Graph Neural Networks for Computer Vision and Beyond (Part 1)

Boris Knyazev
Aug 4 · 16 min read
A figure from (Bruna et al., ICLR, 2014) depicting an MNIST image on the 3D sphere. While it’s hard to adapt Convolutional Networks to classify spherical data, Graph Networks can naturally handle it. This is a toy example, but similar tasks arise in many real applications.

In many practical cases, it is actually you who gets to decide what are the nodes and edges in a graph.

Two undirected graphs with 5 and 6 nodes. The order of nodes is arbitrary.

1. Why graphs can be useful?

A figure from (Antonakos et al., CVPR, 2015) showing representation of a face as a graph of landmarks. This is an interesting approach, but it is not a sufficient facial representation in many cases, since a lot can be told from the face texture captured well by convolutional networks. In contrast, reasoning over 3D meshes of a face looks like a more sensible approach compared to 2D landmarks (Ranjan et al., ECCV, 2018).

2. Why is it difficult to define convolution on graphs?

2.1. Why is convolution useful?

“Chevrolet Vega” according to Google Image Search.

Ideally, our goal is to develop a model that is as flexible as Graph Neural Nets and can digest and learn from any data, but at the same time we want to control (regularize) factors of this flexibility by turning on/off certain priors.

2.2. Convolution on images in terms of graphs

An image from the MNIST dataset on the left and an example of its graph representation on the right. Darker and larger nodes on the right correspond to higher pixel intensities. The figure on the right is inspired by Figure 5 in (Fey et al., CVPR, 2018)
Examples of regular 2D and 3D grids. Images are defined on 2D grids and videos are on 3D grids.
Example of a 3×3 filter on a regular 2D grid with arbitrary weights w on the left and an edge detector on the right.
2 steps of 2D convolution on a regular grid. If we don’t apply padding, there will be 4 steps in total, so the result will be a 2×2 image. To make the resulting image larger, we need to apply padding. See a comprehensive guide to convolution in deep learning here.
Regular 28×28 grid (left) and an image on that grid (right).
A 28×28 filter (left) and the result of 2D convolution of this filter with the image of digit 7 (right).

Nodes are a set, and any permutation of this set does not change it. Therefore, the aggregator operator that people apply should be permutation-invariant.

Illustration of “convolution on graphs” of node features X with filter W centered at node 1 (dark blue).

3. What makes a neural network a graph neural network?

Fully-connected layer with learnable weights W. “Fully-connected” means that each output value in X⁽ˡ⁺¹⁾ depends on, or “connected to”, all inputs X⁽ˡ⁾. Typically, although not always, we add a bias term to the output.
Example of a graph and its adjacency matrix. The order of nodes we defined in both cases is random, while the graph is still the same.
import numpy as np
from scipy.spatial.distance import cdist
img_size = 28 # MNIST image width and height
col, row = np.meshgrid(np.arange(img_size), np.arange(img_size))
coord = np.stack((col, row), axis=2).reshape(-1, 2) / img_size
dist = cdist(coord, coord) # see figure below on the left
sigma = 0.2 * np.pi # width of a Gaussian
A = np.exp(- dist / sigma ** 2) # see figure below in the middle
Adjacency matrix (NxN) in the form of distances (left) and closeness (middle) between all pairs of nodes. (right) A subgraph with 16 neighboring pixels corresponding to the adjacency matrix in the middle. Since it’s a complete subgraph, it’s also called a “clique”.
Graph neural layer with adjacency matrix A, input/output features X and learnable weights W.
2D visualization of a filter used in a graph neural network and it’s effect on the image.
import torch.nn as nn  # using PyTorchnn.Sequential(nn.Linear(4, 64),  # map coordinates to a hidden layer
nn.ReLU(), # nonlinearity
nn.Linear(64, 1), # map hidden representation to edge
nn.Tanh()) # squash edge values to [-1, 1]

To make GNNs work better on regular graphs, like images, we need to apply a bunch of tricks. For example, instead of using a predefined Gaussian filter, we can learn to predict an edge between any pair of pixels.

2D filter of a graph neural network centered in the red point. Averaging (left, accuracy 92.24%), learned based on coordinates (middle, accuracy 91.05%), learned based on coordinates with some tricks (right, accuracy 92.39%).


Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade