# Random Data and Graph Neural Networks

The purpose of this blog is to summarize our understanding of the effects of the graph convolution operation(s) on a simple random data model, specifically the stochastic block model with Gaussian features. Then, we will discuss how graph convolutional and attentional networks perform in the context of node classification on such random data. Finally, we will relax many of the simple assumptions about randomized data, such as Gaussian features, and introduce a theoretically optimal message-passing architecture for node classification.

Specifically, we will discuss the following topics:

- How a single averaging graph convolution changes the mean and variance of the data.
- How it improves linear classification.
- How multiple graph convolutions improve linear classification as a function of the number of convolutions.
- How multi-layer graph convolutional networks improve classification compared to a neural network that does not utilize the graph data.
- A simple result on out-of-distribution generalization.
- How the placement of convolutions in the graph convolutional network can affect classification performance.
- When graph attention improves performance compared to graph convolution and when it doesn’t.
- Optimal message-passing architectures for node classification.

Although the actual technical analysis of the results we will present can be quite involved, in this blog we hope to make it easy for the reader to intuitively understand basic results on this topic.

Let’s start with the obvious question: why random data?

# Why random data?

When we first decided to work on graph neural networks in 2020, most papers proposed new models with experiments on various benchmarks. Authors would usually claim improvements over the current state-of-the-art and provide some intuition about why their proposed architecture or training procedure achieved better performance. However, everything looked like a black box to us. Why does stacking graph convolutional layers on top of each other work well? How does the graph convolution operation transform the data, and why can this lead to better performance? How is the performance of graph neural networks affected by noise in the data? How do we even define noise in the data to start with?

Random data helped us understand the performance of graph neural networks for a predetermined hypothesis and allowed us to perform a detailed and deep analysis of why a model performs well or not. After all, if our new architecture does not perform well on controlled data, then why would we expect it to work on real data, which are far more complex? The difficulty, of course, is to pick a data model that can resemble real data and at the same time be analyzable.

# The application: node classification

For this blog, we will focus on the node classification problem. This problem involves classifying the nodes of a given graph based on the features of those nodes. We demonstrate a toy example in the figure below.

# Which random data model?

There are many options; however, back in 2020, it seemed obvious to take one of the most popular and simple random data models for graphs, which allowed us to control the parameters of the underlying distributions and made intuitive sense for the problem of node classification. Therefore, we picked the Contextual Stochastic Block Model (CSBM) [1,2].

The CSBM is a coupling of the popular Stochastic Block Model (SBM) [3,4] with a mixture of Gaussians for the features of the nodes, where within a class, the features are drawn from the same component of the mixture model. We discuss the CSBM in detail below.

## History of random graphs

Random graphs first appeared in the work of Helen Jennings, “Statistics of Social Configurations,” in 1938 [5]. In Helen’s work, a random graph model was described as a “chance sociogram.” Helen used random graphs to study relationships within groups of people, comparing their real-world data to random relationships.

In 1947, Paul Erdős published the first theoretical paper utilizing random graphs [6]. Paul used random graphs to prove the existence of graphs with large girth and large chromatic number.

Twelve years later, in 1959, Paul Erdős and Alfréd Rényi wrote one of the first papers on the theory of random graphs. The paper was titled “On Random Graphs I” [3]. In the next eight years, Paul and Alfréd wrote a series of papers that popularized the topic of random graphs.

Perhaps less famous, but of equal importance, is the contribution of Edgar Gilbert, who, concurrently with Paul Erdős and Alfréd Rényi, wrote the first theoretical paper on random graphs [4]. In fact, one of the most used random graph models today is by Edgar.

Let’s take a look at a simple example of Edgar’s model. Given a set of nodes, we draw an edge between any two nodes with probability p. An illustration of Edgar’s model on a set of five nodes with p=0.5 is shown below.

Note that the above example of Edgar’s model can be generalized. One way would be to use different probabilities to draw edges for each pair of nodes. However, for simplicity, we will stick with simple versions of the data models in our discussion. As we will observe later, the simple versions of the data models reveal a lot of details about the performance of graph neural networks.

## The Stochastic Block Model (SBM)

In 1983, Paul Holland was the first to introduce the SBM [7]. Motivated by understanding social phenomena among people, Paul introduced the SBM based on real social networks characterized by class structure. By class structure, we mean that the nodes of the network are partitioned into classes, and the distribution of the edges between nodes depends on the class to which the nodes belong.

Let’s take a look at an example of Paul’s SBM with two classes. We assign a class membership to a given set of nodes. The assignment of nodes to classes is done randomly. For example, one can use the uniform distribution, where a node has a probability of 0.5 to be in the first class. This results in two balanced classes with high probability. However, other non-uniform distributions can be used as well to produce unbalanced classes. Given the class membership of the nodes, we draw an edge between two nodes in the same class with probability p, and an edge between two nodes in different classes with probability q. An illustration of the SBM with two classes of nodes is shown below.

The above example of the SBM can be generalized to multiple unbalanced classes and different probabilities for each edge.

## Important assumptions of the SBM

We will make the following assumptions for most of the blog. However, when we discuss optimal message-passing architectures, we will remove almost all assumptions and we will work with a very general data model.

*Assumptions*

- There are two classes.
- Class membership is assigned uniformly at random, i.e., with a probability of 0.5, a node belongs to the first class. This assumption implies that with high probability, the classes are balanced, meaning that the size of each class is roughly n/2, where n is the given number of nodes.
- The graphs are “dense”. This means that the parameters p and q of the SBM satisfy:

The third assumption implies that each node has

number of neighbours with high probability. Moreover, the above density assumption implies “regularity” in the number of neighbors, i.e., concentration of the number of neighbors around their expected value. This is an important technical property in analyzing the models. Later on, we will remove the density assumption and work with sparse graphs as well. The first and second assumptions will also be removed when we discuss optimal message-passing architectures. We note that the cube power above can be improved to a square power, but we leave such details out of this blog.

## The Contextual Stochastic Block Model (CSBM)

Fast forward 34–35 years to 2017 and 2018, where we see the introduction of the CSBM by two teams [1,2]. In CSBM, in addition to the SBM, we have features that characterize the nodes, which are sampled from a mixture of distributions such as the Gaussian Mixture Model.

Let’s look at the model. Similarly to SBM, in CSBM we assign a class membership to the given set of nodes randomly. The edges are drawn using the SBM as described above. Additionally, for each node, we draw a random feature vector using the Gaussian distribution corresponding to the node’s class.

The CSBM example above can be generalized in terms of the number of classes, class membership distribution, and edge and feature distributions.

## Important quantities of the CSBM

The ** relative signal strength for the graph** is defined as the ratio of the absolute difference between p and q over their sum.

The smaller this quantity is, the noisier the graph will be. For example, consider the case where p > q. This means that the SBM part of CSBM is set to be denser within each class compared to between classes. In other words, the SBM might draw graphs that look more community-like. This is often called the homophily setting in the literature. In this homophily setting, if p and q are close relative to their sum, then a lot of edges might be drawn between the nodes in different classes relative to the overall edges that could be drawn. On the other hand, if p < q, then the SBM is set to be denser between classes as opposed to within classes. The SBM might draw graphs that look less community-like, i.e., nodes in different classes are more likely to be connected than nodes within the same class. This is often called the heterophily setting in the literature. Surprisingly, although this setting seems hopeless since there are no clear communities, graph neural networks can still perform correct node classification in this setting. In this heterophily setting, if p and q are close relative to their sum, then a lot of edges might be drawn between nodes within the same classes relative to the overall edges that could be drawn.

Another important quantity of the graph is the ** expected number of neighbours** for each node. We will use this quantity in our analysis a lot.

** The signal-to-noise ratio for Gaussian features** measures the distance between the means of the two Gaussian distributions in the CSBM relative to the standard deviation. The smaller this quantity is, the noisier the data are. For example, if the distance between the means is very small relative to the standard deviation, then this implies that there is overlap between the Gaussian distributions, which, as we will see later, makes the sampled data from these distributions difficult to classify.

# Graph Convolution

There are many graph convolution operations. Let’s focus on one of the most popular graph convolutions, the averaging graph convolution [8]. An example for node 3 of a toy graph is shown below.

## Graph convolution reduces the distance between the means

The input to the graph convolution consists of random variables generated by the Gaussian mixture component of the CSBM. Therefore, it makes sense to ask the question: how does the mean of the random variables change after graph convolution? One graph convolution operation applied to the data moves the means closer by a factor equal to the relative signal strength of the graph [Proof of Theorem 2, 9].

Why does this happen? Let’s take a look at the result of the graph convolution for node 3 in our toy graph:

The output of the convolution for node 3 depends on node 7 as well. However, node 7 corresponds to a Gaussian distribution with mean ν, while nodes 1, 3, and 5 correspond to a Gaussian distribution with mean μ. Therefore, the output of the convolution for node 3 is pulled towards the mean of node 7.

## Graph convolution reduces the variance of the data

Graph convolution reduces the variance of the data by a factor equal to the expected number of neighbours [Proof of Theorem 2, 9]:

The variance reduction can be shown by simply applying the variance definition to the graph convolution operation for an arbitrary node and using the fact that the number of neighbours for a node is roughly equal to the expected number of neighbours shown above. The latter fact is an implication of our density assumption for the SBM.

Note that depending on the values of p and q, the expected number of neighbours can be a much larger factor than the relative signal strength of the graph. This implies that the variance reduction can be much larger compared to how much the means are moving closer, resulting in cleaner data. See below for an illustration. As we will see later on, this can result in classification improvement for models that use the graph data.

## Perfect classification for linear classifiers and one graph convolution

Let’s take a look at how graph convolution can improve the parameter regime where the data are perfectly classifiable. Perfect classification means that the data will be classified with no mistakes with high probability.

First, I will discuss a result for the case where we don’t use graph convolution, and then we will compare it with a classification result that uses graph convolution. The result below provides the parameter regime where any linear classifier fails to perfectly classify the data with high probability.

*Theorem (linear classifier without the graph) [Theorem 1, 9]: No linear classifier can perfectly classify the data with high probability if*

This means that if the distance between the means is smaller than the standard deviation times the square root of log n, then the data cannot be perfectly classified with high probability by simply using a linear classifier.

On the contrary, when we convolve the data first, they become perfectly classifiable in regimes where previously they were not.

*Theorem (with graph convolution) [Theorem 2, 9], [Theorem 4.2, 10]. If the relative signal strength of the graph satisfies:*

*then there exists a linear classifier which can perfectly classify the data with high probability if*

## Perfect classification for multiple convolutions for linear classifiers

We now apply k graph convolutions, one after the other, to the input data. The table below shows that the perfect classification threshold is improved exponentially up to a saturation level, after which performance does not worsen *[Theorem 4.2, 10]*.

The quantity ρ which appears in the bound is equal to

Our assumptions on the parameters ensure that ρ < 1. Let’s see an example of how this quantity affects the required number of convolutions before saturation. Let’s assume that the parameter p is a constant, meaning we are working with a dense graph. This implies that |p — q|/(p + q) is a constant, and the expected number of neighbors is O(n). Thus, ρ becomes O(1/n). In this case, one convolution already results in saturation, and more convolutions do not improve the separability threshold. If p = 1/sqrt(n), then two convolutions are enough.

To prove the results in the table above, we need an additional assumption, which is:

where k is the number of graph convolutions. This assumption implies that the relative signal strength in the graph needs to be large enough. In fact, the more convolutions are used, the cleaner the graph has to be. This assumption is a sufficient condition. It’s unknown if it is necessary as well. Later on, when we discuss partial classification results, the requirement that the graph needs to become cleaner as a function of the number of convolutions will be removed.

## Partial classification for multiple convolutions for linear classifiers

Let us now look into a more realistic, partial classification result. We call this more realistic since, in practice, we don’t usually achieve perfect classification. In this case, it makes sense to study the misclassification error of a linear classifier applied to the graph-convolved data. The misclassification error formula *[Theorem 4.1, 10]* is given below:

Note that only the second term depends on the number of convolutions and the feature’s noise-to-signal ratio. This term measures the amount of error introduced by the variance in the features and exponentially decreases with k. If the signal-to-noise ratio of the features is constant, we will always reach our optimal error bound of ρ squared when k = O(log log n). In the worst case, we may need log n convolutions to reach the optimal bound.

Similarly to the perfect classification result, the partial classification result requires an additional assumption about the quality of the distribution of the graph. In particular, we assume that the following sufficient condition is true:

Note that, compared to the corresponding assumption for the perfect classification result, the number of convolutions and the factor square root log n have been removed in this case.

## Out-of-distribution generalization for linear classification

All results discussed above hold as long as the assumptions for the parameters of the SBM are satisfied. This means that if one trains a model on data generated with specific p and q in the SBM, the model will generalize to any data generated by an SBM with different parameters p and q, as long as the parameters satisfy the assumptions and they have the same sign [9, 10].

# Graph attention convolution

The averaging graph convolution operation uses the same uniform edge-weight for each neighbour of a node. However, we have seen a toy example earlier where node 3 has a neighbour that is part of another class. If we are working with the homophily setting, i.e., p > q, then this could hurt performance because node 3 is uniformly averaging features that belong to another class. Similarly, if we are working with the heterophily setting, i.e., q > p, the opposite scenario would degrade performance. The model Graph Attention Network [11], also known as GAT, has been introduced to solve this issue. In particular, GAT introduces edge weights, gamma, which are learnable parameters. The learning process allows GAT to set the edge weights appropriately.

The definition for gammas is provided in the following figure.

By training the MLP for each layer, graph attention is able to modify the convolution operation, and thus, it has the potential to solve the node classification issues mentioned above.

## How successfully can GAT distinguish good from bad edges?

We will answer these questions using the CSBM and the same assumptions as for GCN. We will demonstrate the theorems in the form of plots.

In the figure below, the straight arrow represents the distance between the means of the Gaussians. We separate the distance into two regimes: the easy regime, which corresponds to a distance larger than sigma times the square root of log n, and the hard regime, which corresponds to a distance smaller than K times sigma. The parameter K can be a constant or a non-constant.

Let’s start with the easy regime and assume that p > q, the reverse results can be shown for q > p, see [12]. In this regime, GAT is able to distinguish same-class from different-class edges, which means that the edge weights for different-class edges are significantly down-weighted [Theorem 3, 12].

This further implies that GAT is also able to perfectly classify the nodes independently of how close q is to p [Corollary 5, 12]. However, in this regime, a classical argument implies that there exists a linear classifier that perfectly classifies the data with high probability without using the graph [*Theorem 1, 9*] [Proposition 7, 12]. Therefore, in this regime, the graph is actually not needed. On the contrary, the performance of GCN, which does not ignore different-class edges, degrades as q gets closer to p.

The hard regime is where GAT also starts to make mistakes in distinguishing same-class from different-class edges. In particular, GAT fails to distinguish a large fraction of same-class edges from different-class edges, since 90% of the edge weights are roughly uniform [Theorem 10, 12].

This is similar to GCN, where the weights for same-class and different-class edges are exactly uniform. This implies that in the hard regime, node classification depends on the same-class edge probability q, as opposed to the easy regime where it does not depend on q.

*Behaviour of attention weights as a function of the distance between the means. *Let’s also empirically study the behaviour of gamma in GAT as a function of the distance between the means.

In this plot, we illustrate the average gamma and its standard deviation as a function of the distance between the means. The blue line with the star marker denotes the same-class gammas, while the orange line with the x-marker denotes the different-class gammas.

It is clear that in the hard regime, that is, when the distance between the means is less than the standard deviation, the gammas concentrate around the uniform value 1/N_i, where N_i is the number of neighbours of node i. In the easy regime, that is, when the distance between the means is larger than the standard deviation, we start to observe the separation between the same- and different-class gammas. In particular, the same-class gammas are much larger than the different-class gammas

*Why does GAT fail to distinguish edges in the hard regime?*

An obvious question is why this is happening. There is a very simple reason that GAT fails to distinguish neighbours in the hard regime. To understand it, let’s first take a look at how current graph attention mechanisms work.

Each edge is represented by a concatenation of the feature vectors of its nodes. For example, the edge between nodes i and j is represented by the vector [X_i, X_j].

Then, the learnable edge weights, gamma, are parameterized using an MLP that takes as input the concatenation of the node features. Therefore, the problem of separating same-class from different-class edges reduces to separating the concatenated vectors.

Since the concatenated vectors also follow a Gaussian distribution, this means that the problem of separating same-class from different-class edges reduces to classifying a Gaussian mixture version of the XOR problem. This problem consists of four Gaussians, with the Gaussians separated into same-class and different-class edges.

When the distance between the means is larger than the standard deviation times the square root of log n, the data generated by the Gaussians are separable with high probability. However, when the distance between the means is less than sigma, there is too much overlap among the distributions, making it highly probable that the different-class edges can’t be separated from the same-class edges.

# Graph neural network

We will use the graph convolutional neural network [8], which utilizes the averaging graph convolution operation. A graph convolution layer is defined as:

A multi-layer graph convolutional neural network consists of multiple layers of graph convolution, where the output of the previous layer is given as input to the next layer.

## Performance for multi-layer graph neural networks

For this topic, we still use the CSBM; however, instead of a two-component mixture of Gaussians, we use four. A two-dimensional version of the model is illustrated below:

In this dataset, there are two classes, but each class consists of two Gaussian mixtures. For example, the features of nodes in class zero can be generated with equal probability either by a Gaussian with mean μ or −μ, and the features of nodes in class one can be generated with equal probability either by a Gaussian with mean ν or −ν. The variance is the same as in previous results. The SBM is set similarly to previous results. Therefore, the only thing that changes is the distribution of the features, making the data non-linearly classifiable. This justifies using multi-layer graph neural networks, since a linear classifier cannot perfectly classify the data.

We will briefly summarize the performance results for a graph neural network with up to three layers. Deeper architectures have not been analyzed yet for CSBM. Additionally, the results are for perfect classification, as partial classification results have not been studied yet. In [Theorem 2, 13] it is shown that when node features are accompanied by a graph, a single graph convolution at the third or second layer, enables a multi-layer network to classify the nodes in a wider regime compared to methods that do not utilize the graph. In particular, a single graph convolution improves the perfect classification threshold to

For comparison, in [Theorem 1, 13] it is shown that when the graph is not used the perfect classification threshold is worse:

Also, note that the convolution is not used in the first layer. This is because of the particular setting of XOR for CSBM, where after one convolution the data collapse with high probability around the same mean. An illustration is shown below.

To prove the above results the authors additionally assume that

which is an assumption on the quality of the graph. This assumption can be relaxed, the interested reader can check Theorem 2 in [13].

Furthermore, assuming a slightly denser graph, meaning that we assume

it is shown in [Theorem 2, 13] that with any combination of two graph convolutions in the second and/or the third layers, a multi-layer network can classify the data in an even wider regime:

It is important to note that the classification capacity is determined by the number of graph convolutions rather than the number of layers in the network. As we mentioned above, in [Corollary 2.1, 13] it shown that the gains obtained by placing graph convolutions in a layer, and compare the benefits of placing all convolutions in a single layer versus distributing them across different layers. It is shown that the performance is similar for all combinations with the same number of graph convolutions. Below, we illustrate this result using a large-scale popular dataset.

The reader can find more synthetic and real examples in [13].

**Optimal architectures**

In [14] we introduce a notion of Bayes optimality for node classification tasks, called asymptotic local Bayes optimality, and compute the optimal classifier according to this criterion for a fairly general statistical data model with arbitrary distributions of the node features and edge connectivity. The optimal classifier is implementable using a message-passing graph neural network architecture.

## Assumptions

Let us first start by discussing the assumptions:

- Features can follow any continuous distribution with a defined density function or any discrete distribution with a mass function.
- Any number of classes is allowed. Each class can have any size.
- The graph follows a SBM with any number of classes and any edge probabilities among nodes.
- The graph is sparse. This means that any edge probability should be of the form O(a/n), where a is a constant and n is the number of nodes. Consequently, on expectation, each node has a constant number of neighbours. However, this assumption does not imply concentration of the number of neighbours around its expectation. Therefore, the graph is not almost regular with high probability, as was the case under the dense assumption earlier.

We note that the above assumptions are way more general than everything that has been discussed in previous results.

## Ell-local classifier

Briefly, an ell-local classifier takes as input the feature vector of a node and the feature vectors of all neighbours which are up to ell-hops away, and outputs a label for the node. A few examples are illustrated below.

*Ell-local optimal Bayes classifier: *The classifier that minimizes the probability of misclassifying a node among all ell-local classifiers.

## Example: two-local optimal classifier

The equation for the optimal classifier in a generic data model with an arbitrary number of classes, feature probabilities, and edge probabilities, is a complex formula, which is provided in Theorem 1of [14]. However, we can illustrate the optimal classifier for the case of two classes [Corollary 1.1, 14], where the feature probabilities follow some absolutely continuous distribution, and the edge probabilities are arbitrary for the SBM. This example is sufficient to demonstrate how the optimal classifier works and how it differs from the basic graph neural networks discussed earlier. The computation of the classifier for node 1 in a toy graph is illustrated in the following figure.

In this simplified setting, we observe that messages propagated from nodes in the local neighborhood of node 1 are clipped proportionally to a function of their hop-distance from node 1. Specifically, the clipping function φ can be expressed in terms of the relative signal strength of the graph, given by |p-q|/(p+q). It is noteworthy that the clipping threshold decreases rapidly as the hop-distance increases. Since the relative signal strength of the graph is less than one, this implies that the value of messages propagated from nodes at hop-distance k to node 1 decreases exponentially with k, which is crucial for predicting the label of node 1.

Additionally, it’s important to highlight that this architecture differs from a simple convolution where a node averages the features of its neighbours. Here, the architecture first transforms the data using their log-likelihood ratio and then clips the result based on the distance to node 1. Finally, the clipped messages are summed at node 1.

*Performance on a simple sparse two-block CSBM*

Studying how such classifiers perform in practice on real data is an open problem.

# References

[1] N. Binkiewicz, J. T. Vogelstein, and K. Rohe. Covariate-assisted spectral clustering. Biometrika, 104:361–377, 2017.

[2] Y. Deshpande, A. Montanari S. Sen, and E. Mossel. Contextual stochastic block models. In Advances in Neural Information Processing Systems (NeurIPS), 2018.

[3] P. Erdős and A. Rényi. On random graphs I. Publ. Math. Debrecen, 6(290–297):18, 1959.

[4] E. N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4), 1141–1144.

[5] J. L. Moreno and H. H. Jennings. Statistics of social configurations. Sociometry, 1(3/4), 342–374.

[6] P. Erdős. Some remarks on the theory of graphs. Bull. Amer. Math. Soc. 53 (1947), 292–294.

[7] P. W. Holland, K. Blackmond Laskey and S. Leinhardt. Stochastic blockmodels: first steps. Social Networks 5 (1983), 109–137.

[8] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR), 2017.

[9] A. Baranwal, K. Fountoulakis and A. Jagannath. Graph convolution for semi-supervised classification: improved linear separability and out-of-distribution generalization. Proceedings of the 38th International Conference on Machine Learning, PMLR 139:684–693, 2021.

[10] R. Wang, A. Baranwal and K. Fountoulakis. Analysis of corrected Graph convolutions. arXiv 2405.13987, 2024.

[11] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò and Y. Bengio. Graph attention networks. International Conference on Learning Representations (ICLR), 2018.

[12] K. Fountoulakis, A. Levi, S. Yang, A. Baranwal andA. Jagannath. Graph attention retrospective. Journal of Machine Learning Research, 24(246):1−52, 2023.

[13] A. Baranwal, K. Fountoulakis and A. Jagannath. Effects of graph convolutions in multi-layer networks. The Eleventh International Conference on Learning Representations, 2023.

[14] A. Baranwal, K. Fountoulakis and A. Jagannath. Optimality of message-passing architectures for sparse graphs. Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023.