Stock Market Forecasting with Differential Graph Transformer

Stanford CS 224W (Machine Learning with Graphs) course project by Xiang Li and Farzad Pourbabaee

Many have sought to make money by investing in the stock market, from professional fund managers to everyday people, but few are able to beat the market. In fact, S&P Dow Jones Indices’ 2024 SPIVA U.S. Mid-Year report shows that 57% of all active large-cap U.S. equity managers underperformed the S&P 500, consistent with the 60% underperformance rate observed in 2023 [Ganti, 2024].

Predicting stock prices is known to be extremely challenging due to the market’s non-linearity, short and long-term temporal dependencies, and the complex relationships between stocks. In this work, we aim to explore the benefits of modelling interstock relationships in predicting stock prices. We propose a novel graph transformer architecture called Differential Graph Transformer that learns to update stock correlation graphs on-the-fly to track the volatile market conditions. We demonstrate that our model significantly outperforms baselines that do not incorporate interstock relations and show that conditioning on statistical correlation priors help generalize the model to future unseen market conditions.

Stocks Have Shared Trends

Stock prices often move together due to the influence of market fundamental factors. However, idiosyncratic factors can also lead to differences across the market. For instance, in Figure 1, we plotted the time series of the prices for three highly correlated stocks: Apple (AAPL), Discover Financial Services (DFS), and Tesla (TSLA). As one can see there is some market pattern followed by all 3 stocks as well as stock-specific variations.

Figure 1. Historical stock prices for 3 highly correlated stocks

Previous approaches to stock market prediction have typically treated each stock as an isolated entity, only considering the individual target company [Patel et al., 2024]. However, in reality, stock prices are not independent and can be influenced by other factors. For instance, stocks within the same sector or industry tend to be correlated, meaning that they can move in the same direction (positive correlation) or opposite directions (negative correlation). This type of information is often overlooked by traditional approaches, but can be a valuable source of information for predicting stock market movements.

The interdependencies between different stocks can be measure by their correlations. In Figure 2, we present a graphical representation of 10 stocks that are highly correlated with Apple (AAPL). The higher the correlation the closer are the corresponding nodes and edge colors also express the strength of the correlation.

Figure 2. Correlation graph of 10 highly correlated stocks

Statistical Correlation Graphs are Preferred for Their Ease of Construction and Timeliness

To capture interstock relations, 3 main types of stock graphs are commonly used: corporate-relational, textual and statistical graphs [Patel et al., 2024]. Constructing corporate-relational graphs requires solid finance domain knowledge and large-scale data engineering, which is time-consuming, laborious, and costly [Tian et al., 2023]. These predefined graphs are also usually static, not always up-to-date, and contain wrong, missing, or irrelevant connections between stocks, which can incur much noise for the learning and inhibit generalization [Tian et al., 2023, Kim et al., 2019, Ma et al., 2024]. Textual graphs also have similar shortcomings. While public news and investor sentiment influence future prices, the impact of the sentiment data deteriorates over time and it also takes an uncertain amount of time for investors to react to the news that reflects in the stock price movement [Shantha Gowri and Ram, 2019]. Hence, the field has been moving towards statistical graph construction that relies solely on historical data. The most commonly used statistical measure is Pearson correlation coefficient between stock pairs. More recently, mutual information is proposed to address the limitation of Pearson correlation in capturing nonlinear stock patterns [Feng et al., 2022, Yan et al., 2020].

While statistical measures are shown to capture important relationships between stocks like industrial sectors [Yan et al., 2020] and relational dependence among stocks [Patel et al., 2024], it remains challenging for GNNs to take full advantage of those measures because of their dense and fully-connected nature. To prevent over-smoothing, correlation values below a given threshold are commonly cut off and the remaining positive correlations are binarized [Yin et al., 2021]. While thresholding produces a sparse graph suitable for GNNs to process, it also discards key information like the magnitude of the correlation between the stocks. It has also been shown that the performance of the GNN is sensitive to the threshold value and this hyperparameter may not be optimal across all stocks [Yin et al., 2021]. In the subsections below, we trace through the development of statistical graph construction from historical data, which went from global correlation graphs to fully dynamic, learnable graphs.

Global Correlation Graphs are Easy to Construct but Have Trouble Capture Changing Market Dynamics

Yin et al. [2021] pioneered the use of a global correlation graph based on all past prices. An undirected Pearson correlation graph is constructed based on all historical prices in the training data and only strong connections above an arbitrary threshold are kept to combat over-smoothing. A hybrid Graph Convolutional Network (GCN) and Gated Recurrent Unit (GRU) model is used to predict the next-day price of selected stocks in the Dow Jones Industrial Average (DJIA) and exchange-traded funds (ETFs). Rather than considering each stock indepently at each time step of the GRU, their model uses a GCN to create embeddings for all stocks at a given time step, conditioned on the previous hidden states of those stock and the current prices. While outperforming the GRU baseline that treats stocks independently, this type of fixed global correlation graphs are inherently limiting. Because of the way message passing GCNs function, stocks are limited to receiving messages from a fixed set of neighbors predefined by global correlation. This makes it challenging for the model to capture changing relationships between stocks over time.

Combination of Local and Global Correlation Graphs Better Captures Market Dynamics at Multiple resolutions

Expanding beyond a static global correlation graph, Ma et al. [2024] combines multiple correlation graphs at different time scales to capture global and local relationships. A similar thresholding trick as Yin et al. [2021] is used to achieve sparsity and a Multi-Graph Convolutional Network (Multi-GCN) model is used to combine global and local correlation graphs. Multi-GCN outperforms variants using only global correlations on the ETF, DJIA, and Shanghai Stock Exchange (SSE) datasets. While correlations with multiple resolutions help adjust the model to the volatile market conditions, predefined statistical correlations may not form optimal message passing paths between stocks as it blends many complex stock interdependencies into scalar correlation strengths.

Learnable Dynamic Graphs Allow for More Flexibility in Interstock Relations

Tian et al. [2023] goes a step further and uses a dynamic graph neural network to automatically learn the evolving dependencies from historical stock features and achieves better performance than predefined correlation and sector stock graphs. An adaptive dynamic graph learning (ADGL) module is developed to explicitly learn stock dependencies from the raw stock data for each time point by introducing a multihead sparse self-attention network. While the attention mechanism naturally
generates a dense weight matrix, only the top-k largest element of each row is kept due to worries of dense weights diffusing the attention scores. The sparsified weight matrix is then normalized by softmax and passed through a hybrid GCN-GRU architecture to model the dynamic evolutions of the learned stock dependencies.

Differential Graph Transformer as a Unifying Framework for Stock Market Forecasting

Throughout the development of graph construction from historical data, graph sparsification remains ad-hoc, relying on a single hyperparameter theshold that may not be optimal for all stocks as the market evolves. The challenges of GCN over-smoothing and diffused attention weights call for a new graph model that can effectively learn to focus on relevant nodes in a dense, fully-connected graph. Further, the model should be able to capture stock relations of different scales, both globally and locally, to navigate the volatile market.

We propose Differential Graph Transformer (DGT), a novel graph transformer model designed to filter through noises of the market and discover relevant connections in a dense graph. The model flexibly incorporates predefined local and global correlation graphs and can modify them on the fly by leveraging a recently proposed differential attention mechanism [Ye et al., 2024]. Fully dynamic graph construction is also possible and it is as simple as passing an identity matrix to be the correlation graph. Combined with causal attention along the temporal dimension (Figure 3), our approach effectively models complex spatial-temporal interdependencies between stocks.

Figure 3. The basic structure of our Differential Graph Transformer, illustration courtesy of [Zheng
et al., 2019]. The nodes represent stocks and the weighted edges represents predefined correlations
between stocks. At each time step t, differential graph attention is applied to the predefined correla-
tions and results in a dynamic attention matrix. Across the time steps, temporal attention captures
short and long temporal dependencies.

Problem statement

Given prices of N stocks on T days P = {p1, p2, …, pT } ∈ R^(T ×N) and the
stock correlation matrix A ∈ R^(N ×N), the task is to predict the price p_(T +1) on the next day.

Input Projection Fuses Price with Stock and Time Information

To create input embeddings for our model, we project each stock’s price at each time step to a d-dimensional embedding and add it to learnable stock and time embeddings:

Temporal Attention Encodes Temporal Dependencies Between Stocks

For each node s, a decoder-only transformer learns to encode the t + 1 time
step given t past node embeddings.

where X^s denotes input node embeddings of the stock s across all time steps, X′^s denotes the output node embeddings across all time steps. M is a causal attention mask that ensures future time steps do not leak into the past.

Differential Attention Learns to Cancel Attention Noise and Focus on Relevant Parts of the Context

Before elaborating on how we adapt differential attention to a graph setting, we start by explaining the gist of this new attention mechanism. The main job of attention is to focus on relevant parts of the input. However, experiments show that Transformer often allocate only a small proportion of attention scores to the relevant parts, while disproportionately focusing on irrelevant context (i.e. attention noise) [Ye et al., 2024]. Differential attention is thus proposed to amplify attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental
results on language modeling show that Differential Transformer outperforms Transformer in various practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers [Ye et al., 2024]. Differential attention is formulated as follows:

where W_Q, W_K , W_V are parameters and λ is a learnable scalar weight parametrized by λ_q1, λ_q2, λ_k1, and λ_k2. The weight has an initial offset of λ_init which is empirically set to 0.2 for the first layer. Differential attention straightforwardly extends to multi-head self attention [Vaswani et al., 2023] by computing differential attention on each head and concatenating the outputs of each head.

Differential Graph Attention Extends Differential Attention to Condition on a Graph

At a given time step t, differential graph attention incorporates
predefined correlation graphs into multi-head self attention to learn a dynamic graph. Rather than calculating two softmax attention maps and taking their difference, we inject the adjacency matrix of a predefined correlation graph as a prior into the attention calculation. A dynamic attention matrix is then subtracted from the prior to account for changes in the market. Differential graph attention is formulated as follows:

where X_t denotes the input temporal node embeddings at time t, X′_t denotes the output node embeddings, and A_t(h) denotes the adjacency matrix of the correlation graph for head h at time t. The first set of keys, querys, and values scale the input adjacency matrix and the second set acts as biases. Since local and global stock correlations are shown to complement each other [Ma et al., 2024], a local correlation prior can be applied to one head and a global prior can be applied to another
head to dynamically capture multi-scale dependencies. Below, we show a reference implementation of differential graph attention, adopted from the differential attention module in UniLM. We only show the forward() function for brevity:

class MultiheadDiffAttn(nn.Module):
def forward(
self,
x,
A=None,
attn_mask=None,
):
bsz, tgt_len, embed_dim = x.size()
src_len = tgt_len

# Project input x into query, key, and value
q = self.q_proj(x)
k = self.k_proj(x)
v = self.v_proj(x)

q = q.view(bsz, tgt_len, 2 * self.num_heads, self.head_dim)
k = k.view(bsz, src_len, 2 * self.num_kv_heads, self.head_dim)
v = v.view(bsz, src_len, self.num_kv_heads, 2 * self.head_dim)

q = q.transpose(1, 2)
k = repeat_kv(k.transpose(1, 2), self.n_rep)
v = repeat_kv(v.transpose(1, 2), self.n_rep)
q *= self.scaling

# Compute attention weights by multiplying query and key
attn_weights = torch.matmul(q, k.transpose(-1, -2))
attn_weights = torch.nan_to_num(attn_weights)
# Apply attention mask
if attn_mask is not None:
attn_weights += attn_mask
# Calculate attention scores using softmax
attn_weights = F.softmax(attn_weights, dim=-1, dtype=torch.float32).type_as(
attn_weights
)

# Calculate the lambda used for differential attention
lambda_1 = torch.exp(torch.sum(self.lambda_q1 * self.lambda_k1, dim=-1).float()).type_as(q)
lambda_2 = torch.exp(torch.sum(self.lambda_q2 * self.lambda_k2, dim=-1).float()).type_as(q)
lambda_full = lambda_1 - lambda_2 + self.lambda_init

# **Optionally condition the differential attention on a graph prior A**
attn_weights = attn_weights.view(bsz, self.num_heads, 2, tgt_len, src_len)
attn_weights = attn_weights[:, :, 0] * (1 if A is None else A) - lambda_full * attn_weights[:, :, 1]

# Compute output embeddings by mixing values based on their attention scores
attn = torch.matmul(attn_weights, v)
attn = self.subln(attn)
attn = attn * (1 - self.lambda_init)
attn = attn.transpose(1, 2).reshape(bsz, tgt_len, self.num_heads * 2 * self.head_dim)
return (attn, attn_weights)

Putting it All Together into a Differential Graph Transformer

The final embeddings at time step T are projected out as next day’s prices. During training, we predict the next day’s price at each time step t in parallel using teacher forcing, similar to a standard decoder.

Our loss function is the l2 distance between the predicted and real prices:

We evaluate our model with common regression metrics including root
mean square error (RMSE) and mean absolute error (MAE) [Patel et al., 2024]:

Construction of the S&P500 Dataset: A Realistic Benchmark for Stock Forecasting

Our dataset consists of the daily close prices of S&P500 stocks spanning 10 years, from October 31, 2014 to October 02, 2024. S&P500 is commonly used as a benchmark for stock market forecasting because the index is a key indicator of the U.S. stock market, representing 80% of the total market
capitalization of U.S. public companies [Patel et al., 2024, Global, 2024]. We obtain this dataset of 2496 trading days directly from Yahoo Finance and split the dataset into chunks of 64 days, with each chunk roughly corresponding to a fiscal quarter. The first 80% of the chunks are used for training, the next 10% reserved for validation, and the last 10% reserved for test. In other words, the first 8 years or 31 quarters are used for training, followed by 1 year or 4 quarters for validation and the last year
for test. This split structure ensures that the model is trained, validated, and tested on sequential and non-overlapping time periods, allowing us to accurately assess its performance on future unseen data. All raw prices are transformed via z-score normalization according to the training datasets following Tian et al. [2023].

Exploratory Data Analysis on S&P500 Confirms Mutual Information as the Better Correlation Measure

Previously, we introduced Pearson Correlation Coefficient and Mutual Information as 2 main statistical measures used for constructing correlation graphs. To get an intuitive sense of their effectiveness, we conducted some exploratory data analysis on the S&P 500 dataset. We start by plotting the top 3 most correlated stocks with Apple based on the entire 10-year dataset using mutual information and pearson, as shown in Figure 4 and 5.

Figure 4. Top 3 correlated stocks with Apple (AAPL) using global mutual information
Figure 5. Top 3 correlated stocks with Apple (AAPL) using global pearson correlation coefficient

Visually, we can notice that global mutual information is better at finding stocks sharing similar trends than pearson. This is likely because pearson only measures the strength and direction of linear relationships between stocks and is thus not adequate to capture trends over the long term in the volatile market. However, this performance gap between the two measures drastically narrows when we consider a small window roughly the length of a fiscal quarter (about 64 days) in Figure 5 and 6. We can see that mutual information and pearson picks the same top two stocks that shares similar trend with Apple, only differing on the third most similar stock.

Figure 5. Top 3 correlated stocks with Apple (AAPL) using local mutal information
Figure 6. Top 3 correlated stocks with Apple (AAPL) using local pearson correlation coefficient

PyG Temporal Comes in Handy for Constructing a Temporal Graph Dataset for S&P500

We can use PyG Temporal package to construct a temporal dataset with either static or dynamic graphs. In the case of global correlations, we can create a StaticGraphTemporalSignal dataset where the graph is fixed but the signal is changing along the time axis. In the case of local correlations, we can create a DynamicGraphTemporalSignal dataset where the graph can also changing along the time axis. Regardless of the type of datasets, there are 4 key functions we need to implement:

  1. _get_edges(): Returns the sequence of edge lists for the graphs. For static graph, this will be a single numpy array of edges. For dynamic graphs, this will be a list of numpy arrays of edges, each array representing a graph at that time step. In our case, all our graphs are fully-connected so we can simply expand a square matrix filled with 1s using np.nonzero to the corresponding edge list.
  2. _get_edge_weights(): Returns the edge weights for the graphs. The edge weights should have the same shape as the edges, except the last dimension is 1 instead of 2.
  3. _get_targets_and_features(): Returns of tuple of features and targets. The features are the inputs to the model and in our case is the past 64 day prices of all the S&P500 stocks. The targets are the expected outputs of the model and in our case is the 65th day price of all the stocks.
  4. get_dataset(): Puts the results of the above 3 functions together into a StaticGraphTemporalSignal or a DynamicGraphTemporalSignal.

Below is a simplified version of our S&P500 dataset, where helper functions for loading the correlation matrices and code for splitting the dataset are omitted for brevity:

from torch_geometric_temporal.signal import StaticGraphTemporalSignal, DynamicGraphTemporalSignal

# Dataset loader for SP500 stock prices
class SP500CorrelationsDatasetLoader(object):
def __init__(self, corr_name, corr_scope):
self._read_csv(corr_name, corr_scope)

def _get_edges(self, times, overlap):
# Construct a fully-connected graph
def helper(corr_index):
return np.array(np.ones(self._correlation_matrices[corr_index].shape[:2]).nonzero())

if len(self._correlation_matrices) == 1:
_edges = helper(0)
else:
_edges = []
for time in range(0, self._dataset.shape[0] - self.batch_size, overlap):
if not time in times:
continue
corr_index = max(0, time // self.days_in_quarter - 1)
_edges.append(
helper(corr_index)
)
return _edges

def _get_edge_weights(self, times, overlap):
# Edge weights are the correlations between stocks
def helper(corr_index):
w = self._correlation_matrices[corr_index]
# Flatten the first two dimensions
return w.reshape((w.shape[0] * w.shape[1],) + w.shape[2:])

if len(self._correlation_matrices) == 1:
_edge_weights = helper(0)
else:
_edge_weights = []
for time in range(0, self._dataset.shape[0] - self.batch_size, overlap):
if not time in times:
continue
corr_index = max(0, time // self.days_in_quarter - 1)
_edge_weights.append(
helper(corr_index)
)
return _edge_weights

def _get_targets_and_features(self, times, overlap, predict_all):
# Given previous batch_size stock prices...
features = [
self._dataset[i : i + self.batch_size, :]
for i in range(0, self._dataset.shape[0] - self.batch_size, overlap)
if i in times
]
# predict next-day stock prices
targets = [
(self._dataset[i+1 : i + self.batch_size+1, :, 0]).T if predict_all else (self._dataset[i + self.batch_size, :, 0]).T
for i in range(0, self._dataset.shape[0] - self.batch_size, overlap)
if i in times
]
return features, targets

def get_dataset(self, batch_size) -> Union[StaticGraphTemporalSignal, DynamicGraphTemporalSignal]:
# Returning the data iterator where the train is designed for many-to-many predictions (each day predict next day's price)
# while the validation and test are many-to-one predictions (many past days predict tomorrow's price)

self.batch_size = batch_size

total_times = list(range(0, self._dataset.shape[0] - self.batch_size, self.batch_size))

times = list(range(total_times[int(len(total_times) * 0)], total_times[int(len(total_times) * 0.8)]))
overlap = self.batch_size
predict_all = True

_edges = self._get_edges(times, overlap)
_edge_weights = self._get_edge_weights(times, overlap)
features, targets = self._get_targets_and_features(times, overlap, predict_all)
dataset = (DynamicGraphTemporalSignal if type(_edges) == list
else StaticGraphTemporalSignal)(
_edges, _edge_weights, features, targets
)
return dataset

Experiments on S&P500 Shows Superiority of Differential Attention Conditioned on Graph Priors

To study the usefulness of incorporating interstock relations in price prediction and the effect of different types and scopes of correlations, we trained a GRU baseline and variants of DGT on the S&P500 dataset. For the DGT variants, we experiment with 2 types of dependence measures: pearson and mutual information. For each correlation type, we examine 3 scopes: global, local, and dual. Global correlations span the entire training dataset while local correlations focus one the previous fiscal quarter (64 days). In practice, we use the correlation matrix calculated on the previous batch for all predictions in the current batch to reduce computation costs. Dual correlation combines both global and local correlations and feed the model with both graphs at each time step. In the DGT model, we then condition one differential attention head on the global correlation graph and another on the local. Below are the experiment results:

Figure 7. Experiment results on S&P500 of the GRU baseline and variants of DGT

Takeaway 1: DGT without spatial attention outperforms GRU, showcasing the power of temporal attention

Compared to RNN models like GRU, our temporal attention module achieves significantly lower test errors. This result confirms the power of transformers in modelling complex, nonlinear temporal dependencies within a stock.

Takeaway 2: DGTs conditioned on local correlations outperform fully dynamic DGT and DGT without spatial attention

Overall, we found that DGT with local mutual information achieves the best performance and reduces the Test RMSE by 45% compared to fully dynamic DGT and 51% compared to DGT without spatial attention. DGT with local pearson also outperforms these two variants by a large margin. We hypothesize that local correlations create a useful inductive bias on the model to enable better generalization. Coupled with differential graph attention, the model can adjust the up-to-date correlations dynamically to achieve better performance on unseen data. We also observe that local correlations consistently outperforms global correlations, confirming the need for more granular statistics for market prediction.

Takeaway 3: DGTs conditioned on dual correlations can achieve competitive performance but are sensitive to type of correlation

A natural question to ask is whether combining both global and local correlations can result in better performance, given Ma et al. [2024]’s observation that correlations at different resolutions complement each other. Our experiment shows that dual pearson performs competitively against the local mutual information variant, achieving a better Test MAE but a worse Test RMSE. However, we found that dual mutual information performs worse than all models except for the GRU baseline based on Test RMSE. We are not sure of the exact cause of this issue but hypothesize that the reduced number of heads per correlation variant in the case of dual correlation can cause instability during training. With dual correlation, each correlation graph has only 1 corresponding attention head, compared to 2 heads in the case the global or local correlations. This can make the training less stable, especially given the dynamic nature of the local correlation variant.

Visualization of Model Predictions on Test Set

Finally, we can visualize the predicted stock prices on the test set using DGTs conditioned on different correlation types and scopes. For visual clarity, we separate the visualization into 2 figures. Figure 8 shows the predicted Apple stock price on the test set using DGTs conditioned on pearson and Figure 9 shows the predicted price conditioned on mutual information. We can see that the test error rates shown in the Figure 7 is reproduced visually. Figure 8 shows that dual pearson closely matches the groundtruth stock price, followed by local and lastly global pearson. Figure 9 shows that local mutual information closely matches the real stock price. The global variant is more off while the dual variant is noticably much less stable than either the local or global variants.

Figure 8. Predicted vs real Apple (AAPL) stock price on test set with DGT conditioned on pearson correlation coefficient
Figure 9. Predicted vs real Apple (AAPL) stock price on test set with DGT conditioned on mutual information

Check out the full implementation on Colab:

Edit (Jan 19, 2025)

Thanks to reader Leadgit, a few mistakes in the original code were uncovered, which involved using reshape or view instead of permute to tranpose dimensions of a PyTorch tensor. These mistakes corrupted input data to the transformers while the GRU baseline was not affected. I fixed the mistakes in the colab notebook, reran the experiments, and uploaded the new notebook and weights to GitHub. In general, performance of all DGT variants are greatly improved after the fix and here is the new result table:

The new experiment results show that dual MI performs the best on test, followed by local PCC and dual PCC. The three takeaways above are still generally valid, only the absolute ranking changed slightly. The unexpectedly poor performance of dual MI in the original experiment was likely due to input corruption and both dual variants after the fix perform very well.

References

S. Feng, C. Xu, Y. Zuo, G. Chen, F. Lin, and J. XiaHou. Relation-aware dynamic attributed graph attention network for stocks recommendation. Pattern Recognition, 121:108119, 2022. ISSN 0031–3203. doi: https://doi.org/10.1016/j.patcog.2021.108119. URL: https://www.sciencedirect.com/science/article/pii/S003132032100306X.

A. R. Ganti. SPIVA® U.S. Mid-Year 2024, 2024. URL: https://www.spglobal.com/spdji/en/spiva/article/spiva-us.

S. Global. S&P 500, Oct. 2024. URL: https://www.spglobal.com/spdji/en/indices/equity/sp-500/.

X. Hou, K. Wang, C. Zhong, and Z. Wei. ST-Trader: A spatial-temporal deep neural network for modeling stock market movement. IEEE/CAA Journal of Automatica Sinica, 8(5):1015–1024, 2021. doi: 10.1109/JAS.2021.1003976.

D. Y. Kenett, X. Huang, I. Vodenska, S. Havlin, and H. E. Stanley. Partial correlation analysis: Applications for financial markets, 2014. URL: https://arxiv.org/abs/1402.1405.

R. Kim, C. H. So, M. Jeong, S. Lee, J. Kim, and J. Kang. HATS: A hierarchical graph attention network for stock movement prediction, 2019. URL: https://arxiv.org/abs/1908.07999.

D. Ma, D. Yuan, M. Huang, and L. Dong. VGC-GAN: A multi-graph convolution adversarial network for stock price prediction. Expert Systems with Applications, 236:121204, 2024. ISSN 0957–4174. doi: https://doi.org/10.1016/j.eswa.2023.121204. URL: https://www.sciencedirect.com/science/article/pii/S0957417423017062.

M. Patel, K. Jariwala, and C. Chattopadhyay. A systematic review on graph neural network-based methods for stock market forecasting. ACM Comput. Surv., 57(2), Oct. 2024. ISSN 0360–0300. doi: 10.1145/3696411. URL: https://doi.org/10.1145/3696411.

B. Shantha Gowri and V. S. Ram. Influence of news on rational decision making by financial market investors. Investment Management and Financial Innovations, 16(3):142–156, 2019.

H. Tian, X. Zhang, X. Zheng, and D. D. Zeng. Learning dynamic dependencies with graph evolution recurrent unit for stock predictions. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 53(11):6705–6717, 2023. doi: 10.1109/TSMC.2023.3284840.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023. URL: https://arxiv.org/abs/1706.03762.

Y. Yan, B. Wu, T. Tian, and H. Zhang. Development of stock networks using part mutual information and Australian stock market data. Entropy, 22(7), 2020. ISSN 1099–4300. doi: 10.3390/e22070773. URL: https://www.mdpi.com/1099-4300/22/7/773.

T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei. Differential Transformer, 2024. URL: https://arxiv.org/abs/2410.05258.

X. Yin, D. Yan, A. Almudaifer, S. Yan, and Y. Zhou. Forecasting stock prices using stock correlation graph: A graph convolutional network approach. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.

C. Zheng, X. Fan, C. Wang, and J. Qi. GMAN: A graph multi-attention network for traffic prediction, 2019. URL: https://arxiv.org/abs/1911.08415.

--

--

Stanford CS224W: Machine Learning with Graphs
Stanford CS224W: Machine Learning with Graphs

Published in Stanford CS224W: Machine Learning with Graphs

Tutorials of machine learning on graphs using PyG, written by Stanford students in CS224W.

Responses (13)