# Introduction to Graph Neural Networks: An Illustrated Guide

14 min readMay 17

--

Hi everyone! Welcome to the illustrated guide about Graph Neural Networks (GNNs). My name is B. Scarleth Gutierrez, I have a master's degree in AI and work as a Machine Learning Engineer š©š»š¤.

In this post, we will start by looking into the basic concepts of graphs like how a scenario can be modeled with a graph and which kind of problems can be solved. We will also compare the differences between regular neural networks and the GNNs followed by an explanation of the propagation process. Then, we will review the General Framework of GNNs and finally, there will be a section with the maths behind the general framework.

In summary, the topics are divided as follows:

1- Basics of Graphs

2- Graph Neural Networks (GNNs)

3- Propagation in GNNs

4- The General Framework of GNNs

5- The Maths

# 1- Basics of Graphs

Before jumping into the mechanisms of the Graph Neural Networks, we will start by refreshing some basics on graphs.

First of all, graphs are non-euclidean data structures used to simulate data from complex real-world scenarios where regular structures such as images, audio, or sequential text might fail. Some of these complex scenarios include brain networks, chemical components, and the classic example of social networks.

Think about a social network like Instagram or Facebook. You can consider your own account as an entity or in simple words, as a group of data. This group of data contains different information like your profile picture, your list of friends, your likes and dislikes, etc. Your social media friends, although they might have similar information to yours, represent other entities or groups of data with specific profile configurations. Also, whenever you add another person as a new friend, it is as if you are creating a new link between the other personĀ“s group of data and your own group of data.

To this point I would say it sounds kind of intuitive but, can you think about a way to model this scenario using, for example, text? You should also keep in mind how different your own profile might be including the number of friends you have, than the one of letās say, a famous singer. A second thing to consider is the variety of relationships between profiles, there is not just the status of āfriendshipā but you can also have different configurations to set how much content people can see from your profile. And here is where graphs come into the game, they allow us to easily scale up data and handle relationships, among other interesting things. Therefore, social network scenarios are regularly modeled as graphs. So, the next question would be how do we create a graph?

In the literature, we define a graph G as G = {N, E}, this means G contains a set of nodes N that represent the elements of the graph and a set of edges E, which determine the relationships between the nodes. Besides the topological information, nodes, and edges can also be associated with features to indicate the characteristics of the entities and the relationships they represent. For example, a molecule of water can be modeled as a graph with three nodes: one for the oxygen and two for the hydrogen. Each node can include information related to the electric charge or the diameter of the atom. On the other hand, edge features can be used to distinguish between strong or weak bonds of the atoms they connect.

Also, treating the molecule as a graph allows us to exploit the natural structure of the molecule without having to transform the sample into a different data type like an image or text. This is useful to better capture fine-grained relationships within the nodes and the different graph samples.

Another example is the brain network, which can be produced by associating certain regions of the brain with nodes and linking these nodes through the electrical signals they share. In reality, we can use any relevant information to determine when an edge should be added between the nodes. So, in this example, we could also connect nodes if they belong to brain regions that are relevant for certain processes like recognizing the face of a friend, learning a new word, etc.

As you can imagine, there are different types of problems we can solve with graphs depending on the part of the graph that is relevant. There are node-level, link-level, and graph-level tasks. For example, going back to the example of the molecule, a model can learn to detect what kind of bond should be present between atoms (link level task). In the example of the social network, we could use the information from the complete graph to distinguish between the profiles of politicians and athletes (graph-level task).

Anyway, I guess we have seen enough examples to understand or at least give us an idea of how graph structures can be used. All right? Now, letĀ“s introduce graph neural networks and their main advantages.

# 2- Graph Neural Networks (GNNs)

Graph Neural Networks (GNNs) are a special type of neural network architecture specifically designed to work with graphs š

So far so good, right?

If graphs are not images nor text, then it makes sense to use something different than convolutions (e.g. CNNs) or recurrent neural networks (e.g. LSTMs, RNNs, GRU) to process this type of data. But donāt take my word as a ground truth yet, let me bring to the table some important features of graphs. Think once again about the example of the social network, is the graph of friends from person A be same as the one from person B? What if we compare those two graphs against, letĀ“s sayā¦ another hundred graphs, will they have exactly the same number of nodes? No, they will be very different from each other. Thatās why we say graphs are non-structure data representations, they allow unfixed sizes and forms, even under the same context like the graph of friends.

Images, on the other hand, if coming from the same source are more likely to share length, size, resolution, etc. so they fit better than graphs to apply convolutions. Another important feature of graphs is that they are order-invariant, this means the order of the nodes is not relevant to represent the overall graph. With images we have the opposite behavior, meaning that if we process a pixel in a different order than its original position, the image wonĀ“t look the same anymore. Well, maybe Iām exaggerating a little bit here, but I hope the point itās clear, reordering the pixels of an image directly affects the data it represents. So, instead of processing a graph with a CNN which expects data to be fixed as in an image, we can apply a GNN to learn not only from the features but also from the structural information that can vary from sample to sample.

# 3- Propagation in GNNs

But how does a GNN learn? Now we are getting closer to the most interesting part, we just need to understand one more concept before discussing how the magic works. Have you heard about propagation? I believe you might be familiar with the propagation of a regular or fully connected neural network, if not, donāt worry we just need to get the basic idea.

During propagation of regular neural networks, the data enters through the neurons of the input layers and is forward processed until reaching the last layer of the network. We could say there is a hierarchy in the propagation process defined by the sequence of layers that receive and send data to the next layers, say layer 1 passes its outputs to layer 2, layer 2 to layer 3ā¦ and so on. Well, In the case of graph neural networks, as well as in other types of networks, propagation is performed without having the same sense of hierarchy between neurons. Instead, the propagation occurs simultaneously in all the nodes of the graph and there is no initial node or initial layer of propagation.

The general idea is to take advantage of the connections between nodes to create representations from multiple sub-structures of the graph as if we were to focus on understanding certain sections individually. This is done by updating the information of each node over a pre-defined number of times, in literature this is usually stated as the number of propagation layers K (Note: donāt confuse layers in GNN propagation with layers in a regular neural network). The new value of a node, will be influenced by itself and also by the nodes it shares an edge with, so its neighbors.

And how do we know how many times we need to propagate data through the GNN? Well, the answer is ā¦ (drumrolls please š„ ) we donāt know for sure. This is a hyperparameter we should play around with when training a model, but there are of course some tricks and tips to keep in mind.

Letās see some examples using the graph from image 2, imagine what would happen if the data is propagating just one time. In such a case, we will have a couple of tiny sub-graphs, but because of the size of the original graph, one iteration might not provide enough information to recognize patterns and identify important features. With one iteration, only the nodes that are directly in the neighborhood of the central node will affect its value, so the rest of the nodes will be out of the scope of this new representation. On the other hand, if propagation occurs for a high number of times, say ten for this example, nodes grow into containing data of the whole graph. Is this good? mmmā¦Nop, no really.

Having a representation of the complete graph in every single node wonāt tell us how the nodes and the sub-structures of the graph interact with each other. Therefore, choosing an appropriate number of layers is fundamental for the learning process of the model. Ideally, we would expect to balance between covering enough area in the neighborhood without including most of the nodes of the graph on it. In literature an accepted range of propagation layers is from 2 to 7, usually, no more layers are needed.

And finally, letās get into details about the General Framework of Graph Neural Networks.

# 4- The General Framework of GNNs

In the previous section, we briefly discussed the influence of the neighborhood of a node on the new value it will take during propagation. This is actually the core part of how GNNs learn, we just need to add some more details to get the overall picture.

Formally, this process of sharing data between nodes is done in two steps known as Aggregation and Combination. During the first step, the information from all the nodes that share an edge with the central node is collected and transformed to create a representation of the neighborhood. Then, in the second step, the result of this aggregation is combined with the current value of the node to produce its new value. As you might notice, to this point nothing has changed from what we discussed above. We only separated the process into two steps and we will continue to see how they operate internally.

For both aggregation and combination, the input is passed through an order-invariant function (also known as permutation invariant function) like the sum, average, maximum, etc. which can also be combined with matrices of learnable weights (For example, with X as input and W as the learnable weights: W * max(X), max(W*X), etc. ). These order-invariant functions are commonly used in graph neural networks because the order in how nodes are processed will not affect the output at all. So, they match well with the order-invariant nature of graphs.

And now, my favorite part about GNNs. Are you ready? here is where the magic comesā¦ š§āāļø

Similar to convolutional neural networks, there are different types of GNNs layers. For example, some of the most popular is the convolutional GNN and the graph attention network. And all of these GNNs layers have something in common, can you guess what is it? Exactly, they apply the two functions we have been talking about, they operate through aggregation and combination. That is why the process that involves these two steps is called the General Framework of GNNs. Because it is like a kind of recipe or a blueprint that indicates the minimal operations a GNN layer needs to follow.

The next question here would be, how do we differentiate between the different types of layers then? Simple, they differ in the order-invariant functions they use, plus additional operations in some specific cases. But overall, we can say a nice trick to better understand the mechanisms of a GNN layer is to split the algorithm into two parts trying to identify the step of aggregation and the step of combination.

Now, letĀ“s return to the topic with an example. Take a look at the image above, on the left side you can see a graph with five nodes and some colorful figures. For now, we can think about these figures as the attributes of the nodes and we will eventually replace them with some values. But first, the process of the general framework must be clear :)

If we focus only on node a, the updating process of this node will look like the image on the right side. The information from nodes b, c, and d will be passed to an aggregation function and the current value of node a will be combined with the result of the previous operation, returning the new value for node a. Easy piece, right?

Before continuing, take some time to think about how this process might look for the rest of the nodes and how it takes place for all the nodes in parallel. If you need some help with this, you can cheat a bit š and see the animation below.

# 5- The Maths

The last part of this reading is about the maths behind the GNNs. But since we are now experts in propagation and the general framework there is nothing to worry about šØš¼āš. Letās begin the partyā¦

We will start by replacing the colorful figures with more authentic values. In the literature, the nodeĀ“s attributes are represented as a matrix X in R^NxC where N indicates the number of nodes in the graph and C is the number of features for each node.

In the image above, there is again the graph with the five nodes. This time the figures were replaced with arrays or lists of numbers. You can see that each node has three attributes, they can represent for example the number of neurons, the probability of a vote, acidity, etc. In any case, the matrix X for this graph will have 5x3 elements. The values of X are used by the General Framework to initialize the matrix Hā°= X at the beginning of the propagation process (timestep 0). LetĀ“s see the formulas first, and then we can apply the operations over the numbers we defined in the image.

For every iteration k = {1, 2, ā¦, K} being k is the current layer, k-1 the previous layer, and K the maximum number of propagation layers, the following steps are executed:

where N(v) is the set of neighbors of node v (the central node) and H^(k-1) is the matrix of the nodeĀ“s features in the previous iteration k ā 1. For every node u in the list of neighbors N(v) from node v, aggregate the features of the node from the previous layer H^(k-1)_u and store the result in the matrix a^k_v (aggregation for node v in iteration k). Then combine a^k_v with the previous attributes from node v defined as H^(k-1)_v, to obtain the new attributes for node v in the matrix H^k_v. Once the matrix of attributes is calculated, it is fed into a differentiable trainable function Éø (e.g. fully connected layer or a multi-layer perception), or in other words, it can be multiplied by a matrix of learnable weights W. The configuration you use next depends on the problem you are trying to solve, whether it is a graph-level task or a node-level task. There are additional layers that can be added as well like dropout or global pooling, but those details can be left for another post.

Then, are we done? Pretty much yes. Those are all the calculations we need :)

And now that we know how the General Framework for GNNs operates, we can use the formulas to illustrate the updating process of node a from image 4:

Are we good with the image? There is one last example to check.

Do you remember we said the different GNNs layers can be identified through the functions they use? Well, for the last example, we will perform aggregation using a regular sum and combination with a max function. There might or might not exist a GNN layer with this configuration, but we will use it anyway just to finalize with an overall picture of the updating process of a graph.

In the image below, you can see the variables we need to update each node and also the values we defined as the nodes' attributes. For example, in aggregation for node d (yes, I know there are other nodes than just node a š), the neighbors are node a with its attributes defined as [0, 1, 1] and node e with the matrix [0, 1, 0,4]. The result of the sum is set as aĀ¹_d (aggregation in layer 1 for node d) with a value of [0, 2, 1.4]. For the combination we apply the max function using as input the values from the previous step [0, 2, 1.4] and the current values of node d [0.3, 0.1, 0.2]. So, after applying the max function the value for HĀ¹_d (attributes for node d in layer 1) will be [0.3, 2, 1.4]. āØ Tadaaa āØ.

I will let you the rest of the nodes, you can follow the same procedure to obtain something like this:

Finally, the original graph is updated to look like the right one on the image and the new values can be used to feed a MLP or any other type of layer depending on the configuration you set.

And with that, we have concluded the last section of this reading šā¦ applauses for you š. I hope you enjoyed this blog and learned the basics of Graph neural networks. Until next time!