Graph Neural Network — Attention Mechanism Applied and Computational Improvement

9 min readMar 27, 2022

Motivation

“Graph” is a special data structure that usually appears in the era of the internet. For instance, people’s relationships on social networks can be viewed as a graph. Even the interaction dynamics of sportsmen can be viewed as a graph composed of nodes of fighters’ bunches and edges connected by them. When we are considering interactions among different entities, a graph can be a great representation. Interactions can be represented by edges. Entities can be represented by nodes. The dynamics among all entities can be represented by the whole graph.

Thus, if we would like to model the dynamics among the entities we are interested in, what could we use? Neural Network might be a good choice here because unlike some traditional models that require some prior knowledge in the specific field and some machine learning models require detailed and troublesome feature engineering, the neural network(NN) is somewhat more convenient and has a better fitting performance when you have a large dataset.

However, usual NNs cannot easily digest information from a graph-like data structure or data scientists need to do lots of data preprocessing. Thus, Graph Neural Network appears.

As a person who is interested in studying the financial market, I would like to take examples in finance and there are indeed lots of people focusing on this area. The stock market can be seen as a graph, where different firms are interacting with each other as suppliers, corporators, and competitors. Why not model the financial market using GNN? We can have a try.

We can just set the stock return, trading volume, and fundamentals as node vectors and view interactions among different firms as edges. There must be some relationship among those firms because they are members of the economy and part of them come from the same industry. However, how can we model their interactions and aggregate adjacent information to one node, utilizing what mechanism? We can just do an average or select maximum from them just like Convolutional Neural Network, but obviously, that is not reasonable.

Actually, concerned with Graph Convolutional Network and a gentle introduction to GNN, my friend, QiShuo Wang has published another post discussing them. The article’s link: https://medium.com/@wqs977107/seminar-on-graph-neural-network-aeaf2e176e0c

Now, let’s return back to the right way. How do we model interactions among public firms?

Attention is All You Need.

Graph Attention Network (GAN)

—“GRAPH ATTENTION NETWORKS”

We can apply Attention among node vectors(variable vector we would like to predict for every firm). For firms that are related, we can try to calculate a “correlation-like” metric to represent their pairwise importance w.r.t. each other, which is what Attention is doing.

To understand more about the mechanism, we still need some formulas.

In GAN, Attention is separated into three steps:

Compute self-attention utilizing W and node features
Apply Softmax activation (maybe applying LeakyRelu first) to obtain the weights
Do Multi-Head Attention

Self Attention: W is a weight matrix that did linear transformation for h i.e. node vector

There are two choices to calculate weights. Option 1 to calculate weight between node i and j using Self-Attention. Option 2 to calculate weight between node i and j (applying LeakyRelu)

Multi-head Attention: || represents applying an identical operation to the previous layer’s output. N denoted by i is the set of adjacent nodes for node i. α if kth iteration is the normalized attention coefficients computed by the kth attention mechanism. W of kth iteration is the corresponding input linear transformation’s weight matrix.

Instead of applying the usual way to calculate multi-head attention, the GAN paper applies an averaging method to calculate Multi-head Attention. Here, Multi-head Attention allows for more non-linearity compared with a single layer of self-attention.

Self-Attention and the aggregation process of a multi-head graph attentional layer is illustrated in Figure 1

This is the process of calculating node vectors in GAN. As we mentioned before, this mechanism applied on GNN has a major contribution that allows assigning different weights for every adjacent node compared with GCN. Another advantage is that the computation among this process can be parallelized and has similar computation complexity with GCN.

Besides the theoretical part of Graph Attention Networks, it would be beneficial to understand it in a more practical way.

Applications in Financial Market

An article named “FinGAT: Financial Graph Attention Networks for Recommending Top-K Profitable Stocks” introduces how people can apply Graph Attention Networks into the stock market to help investors select stocks.

This work mentions that in existing approaches on modeling time series of stock prices, the relationships among stocks and sectors (i.e., categories of stocks) are either neglected or pre-defined. Ignoring stock relationships will miss the information shared between stocks while using pre-defined relationships cannot depict the latent interactions or influence of stock prices between stocks and some of the relationships are usually confidential. Moreover, the relationship among firms is always dynamic, utilizing the intended connection could lead to overfitting or cannot capture the latest update.

In this paper, researchers aim at recommending the top-K profitable stocks in terms of return ratio using time series of stock prices and sector information. They propose a novel deep learning-based model, Financial Graph Attention Networks (FinGAT), to tackle the task under the setting that no pre-defined relationships between stocks are given.

The idea of FinGAT is three-fold.

First, devise a hierarchical learning component to learn short-term and long-term sequential patterns from stock time series. This is stock-level learning.

Second, a fully-connected graph between stocks and a fully-connected graph between sectors are constructed, along with graph attention networks, to learn the latent interactions among stocks and sectors. This covers stock-level modeling and sector-level modeling.

Third, a multi-task objective is devised to jointly recommend the profitable stocks and predict the stock movement (Recommending the most profitable stocks can be divided into two correlated parts: ranking stocks based on their predicted return ratios, and finding future stocks with positive movements). To achieve that, FinGAT applies a more appropriate loss function rather than MSE, which combines both cross-entropy loss and ranking-based loss function.

Experiments conducted on Taiwan Stock, S&P 500, and NASDAQ datasets exhibit remarkable recommendation performance of our FinGAT, compared with state-of-the-art methods.

MRR: Mean Reciprocal Rank; K: number of stocks selected

FinGAT is introduced to prove its potential applications in the stock market. Concerned with more details, people can refer to the reference listed below. There are lots of details about how to design such a sophisticated experiment. However, this model might not be satisfactory enough, there should be more exploration and there is no backtesting in this paper about the concrete performance for trading strategies generated by this model. Only predicting Top K stocks might not work in many kinds of situations. Whether the strategy can be applied in reality due to some difficulties in trading and liquidity is still questionable. A strategy designed for financial assets with low liquidity tends to perform better in an ideal situation.

This is only a brief introduction to GAT application in the financial market. Now, we have learned GAT architecture. Next, we can study more about how to deal with the computation of sparse matrixes.

Accelerating Sparse Graph Neural Network Computation

—“TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs”

Besides the mechanism applied on GNN to obtain better predictions, computation for adjacency matrix is always difficult, since it is too sparse and wastes a lot of computational power. It is always impossible that a node is related to a large set of other nodes in the graph. Take the stock market as an example again, no company can be significantly correlated with all other companies. We can observe the phenomenon from the right graph. Connections are usually sparse.

To solve the problem, researchers did some efforts in reducing sparsity and reducing the size of the set of adjacent nodes model need to take into consideration.

Graphs (a) and (b) show the operation of condensing the sparse matrix. It mainly condenses the sparse matrix into a smaller area without losing the “1” elements (connected relationship information) and ignores other parts to decrease the sparsity effectively.
Graph c shows how we can decrease the size of the set of connections we need to consider when we only want to deal with connections among several nodes. More specifically, among several nodes, in order to obtain the set of all their adjacent nodes, we first use an array to store every set of their adjacent nodes with the tolerance of duplication. Then we first sort the array and then apply deduplication to get the set of necessary nodes we need to consider. Applying sort first is to reduce the computational complexity in deduplication and possible extra memory we need, and the final unique set we obtain reduces the complexity of future operations further.

This is the only part of the paper about how to reduce computational complexity. Concerned with how to fit GNN better into Tensor Core Unit, readers can refer to the original paper.

Compared with some state-of-art methodologies, such as Sparse GEMM, Dense GEMM, and Hybrid Sparse-Dense. We can find this paper — TC GNN achieves a perfect improvement in four aspects.

The Effective Memory Access is the ratio between the size of the ac- cessed data that is actually involved in the later computation and the total size of data being accessed; The Computation Intensity is the ratio of the number of computing operations versus the data being accessed; Effective Computation is the ratio between the operations for generating the final result and the total operations.

Generally, we can see TC-GNN achieves lower memory consumption, more effective memory access, higher computational intensity and more effective computation.

Observation and Insights

We can see there are lots of mechanisms applied on GNN such as Convolution and Attention such as GCN and GAN. Applying other mechanisms on GNN might also work since the major difference is the data structure as a graph.
However, we can see most of the GNNs tend to only consider the adjacent nodes, which are their neighbors but not include “neighbors of their neighbors”.
In the future, to explore the influence of distant nodes, maybe we could apply an LSTM-like structure. But in this sense, it should be called long “distance” short “distance” memory. Try to merge other mature mechanisms into GNN.

Conclusion

•GNN has a large potential in applications of the graph-like dataset

•Traditional Neural Network settings can be transferred to GNN, such as Convolution and Attention Mechanism.

•GNN is not efficient in computation due to the sparse adjacency matrix, but there have been some methods used to accelerate the progress.

Reference:

1. Veličković P, Cucurull G, Casanova A, et al., Graph attention networks[J]. ICLR, 2018.
2. Wang, Yuke & Feng, Boyuan & Ding, Yufei, TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs, 2021.
3. Yi-Ling Hsu, Yu-Che Tsai, Cheng-Te Li, FinGAT: Financial Graph Attention Networks for Recommending Top-K Profitable Stocks, TKDE IEEE, 2021