Predicting stock market crashes with statistical machine learning techniques and neural networks

With this blog post I am introducing the design of a machine learning algorithm that aims to forecast crashes in stock markets solely based on past price information. I start with a quick background on the problem and elaborate on my approach and findings. All the code and data are available on GitHub.

A stock market crash is a sharp and quick drop in total value of a market with prices typically declining more than 10% within a few days. Famous examples of major stock market crashes are the Black Monday in 1987 and the real estate bubble in 2008. A crash is usually attributable to the burst of a price bubble and is due to a massive sell-off that occurs when a majority of market participants try to sell their assets at the same time.

The occurrence of price bubbles implies that markets are not efficient. In inefficient markets prices do not always reflect fundamental asset values but are inflated or deflated based on traders’ expectations. These expectations are reinforced by traders’ subsequent actions which further inflate (or deflate) prices. This leads to positive (or negative) price bubbles which eventually burst. This phenomenon was described by George Soros as reflexivity and is the basic assumption for forecasting methods used in technical analysis.

Today, there is not much debate over the existence of bubbles in financial markets. However, understanding these inefficiencies and predicting when price bubbles will burst is a highly difficult task. Imagine you could identify a bubble that is about to build up and predict when the market will crash. You would not only be able to make a profit while prices are increasing but also to sell at the right moment to avoid losses.

Some mathematicians and physicists attempted to tackle this problem by investigating the mathematics behind price structures. One such physicist is Professor Didier Sornette who successfully predicted multiple financial crashes [1]. Sornette uses log-periodic power laws (LPPLs) to describe how price bubbles build up and burst. In essence, the LPPL fits the price movements leading up to a crash to a faster than exponentially increasing function with a log-periodic component (reflecting price volatility with increasing magnitude and frequency).

And this is where the idea for this project is coming from. If the recurring price structures found by researchers exist, should it not be possible for a machine learning algorithm to learn these patterns and predict crashes? Such an algorithm would not need to be aware of the underlying mathematical laws but would instead be trained on data with pre-identified crashes, and identify and learn these patterns on its own.

Data and Crashes

The first step was to collect financial data and identify crashes. I was looking for daily price information from low correlated major stock markets. Low cross-correlation is important for valid cross validation and testing of the model. The matrix below shows the cross-correlation of daily returns from 11 major stock markets.

Correlation matrix of daily price returns for 11 major stock market indices

To avoid having any two data sets with a cross-correlation greater than 0.5 in my collection, I proceeded with only data from the S&P 500 (USA), Nikkei, (Japan), HSI (Hong Kong), SSE (Shanghai), BSESN (India), SMI (Switzerland) and BVSP (Brazil).

To identify crashes in each data set, I first calculated price drawdowns. A drawdown is a persistent decrease in price over consecutive days from the last price maximum to the next price minimum. The example below shows three such drawdowns in the S&P 500 over the period from end of July to mid August 2018.

Example of three drawdowns. The first one shown lasted from July 25th to July 30th 2018 and has a total loss of approximately (2846–2803)/2846 = 1.5%

I considered two different methodologies to identify crashes. The first one follows a suggestion by Emilie Jacobsson [2] who defines crashes in each market as drawdowns in the 99.5% quantile . With this methodology I found drawdown thresholds that classify a crash ranging from around 10% for less volatile markets like the S&P 500 to more than 20% for volatile markets such as the Brazilian one. The second methodology follows the suggestion from Johansen and Sornette [3] who identify crashes as outliers, that is drawdowns that lie far from the fitted Weibul distribution when the logarithm of the rank of drawdowns in a data set is plotted vs the drawdown magnitude.

Distribution of drawdowns by rank as an example for the Shanghai index since 1996.

I tested my algorithms with both crash identification methodologies and concluded that the first methodology (Jacobsson) is advantageous for two reasons. First, Sornette does not clearly state how much deviation from the Weibul distribution classifies a drawdown as a crash, thus human judgement is necessary. Second, his methodology leads to the identification of fewer crashes which leads to heavily imbalanced data sets. This makes it harder to collect a sufficient amount of data for a machine learning algorithm to train on.

With the collection of the seven data sets mentioned above I accumulated a total of 59,738 rows of daily stock prices and identified a total of 76 crashes.

Problem statement and feature selection

I formulated a classification problem with the goal of predicting, for each point in time (e.g. each trading day), whether or not a crash will occur within the next 1, 3, or 6 months.

If past price patterns are indicative of future price events, the relevant information to make a prediction on a certain day, is contained in the daily price changes of all days prior to that day. Thus, to predict a crash on day t, the daily price changes from each day prior to t could be used as a feature. However, because models presented with too many features do get slower and less accurate (“curse of dimensionality”), it makes sense to extract a few features that capture the essence of past price movements at any point in time. I therefore defined 8 different time windows that measure mean price changes over the past year (252 trading days) for each day. I used increasing window sizes from 5 days (leading up to day t) to 126 days (for t-₁₂₆ to t-₂₅₂) to get a higher resolution of price changes in more recent times. Because price volatility is not captured when averaging price changes over multiple days, I added 8 features for the mean price volatilities over the same time windows. For each data set I normalized the mean price changes and volatilities.

To evaluate the feature selection, I performed a logistic regression and analyzed the regression coefficients. The logistic regression coefficients correspond to the change in log odds of the associated feature meaning the logarithm of how the odds (ratio of the probability of a crash vs no crash) change with a change in that feature when all other features are being held constant. For the plot below I transformed the log odds to odds. Odds greater than 1 indicate that the crash probability increases with an increase in the corresponding feature.

Logistic regression coefficients indicating the influence of the features on the predictive variable

The coefficient analysis shows that the volatility over the past several days is the strongest indicator of an upcoming crash. A recent price increase, however, does not seem to indicate a crash. This is surprising at first glance because a bubble is typically characterized through an exponential increase in price. However, many of the found crashes did not occur immediately after a peak in price, instead, prices decreased over some time leading up to a crash. A high price increase over the past 6 to 12 months increases the likelihood of a predicted crash, indicating that a general price increase over the long term makes a crash more likely and that price movements over longer time periods contain valuable information for crash forecasting.

Training, validation, and test set

I selected the S&P 500 data set for testing and the remaining 6 data sets for training and validation. I chose the S&P 500 for testing because it is the largest data set (daily price information since 1950) and contains the largest amount of crashes (20). For training I performed 6-fold cross-validation. This meant running each model six times and using five data sets for training and the remaining one for validation.

Scoring

To evaluate the performance of each model I used the F-beta score. The F-beta score is the weighted harmonic mean of precision and recall. The beta parameter determines how precision and recall are weighted. A beta larger than one prioritizes recall and a beta smaller than one prioritizes precision.

I chose a beta of 2 which puts more emphasis on recall meaning that an undetected crash gets a stronger punishment than a predicted crash that did not occur. This makes sense under a risk averse approach assuming that not predicting a crash that occurs has more severe consequences (loss of money) than expecting a crash that does not occur (missing out on potential profits).

Regression models, Support vector machines and Decision trees

I started with linear and logistic regression models. Regression models find the optimal coefficients for a function by minimizing the difference of the predicted and the actual target variable over all training samples. While the linear regression estimates a continuous target variable, the logistic regression estimates probabilities and is therefore generally better suited for classification problems. However, when I compared the prediction results of both models, logistic regression outperformed linear regression only in some cases. While this came as a surprise it is important to note that even though the logistic regression might provide a better fit to estimate probabilities of a crash, the sub-optimal fit of a linear regression is not necessarily a disadvantage in practice if the chosen threshold effectively separates the binary predictions. This threshold was optimized to maximize the F-beta score on the training set.

Next, I tested Support Vector Machines (SVMs). SVMs use a kernel function to project the input features into a multidimensional space and determine a hyperplane to separate positive from negative samples. Important parameters to consider are the penalty parameter C (measure for how much misclassifications should be avoided), the kernel function (polynomial or radial basis function), the kernel coefficient gamma (determines the dimension of the kernel function) and the class weight (determines how to balance positive vs negative predictions). The best SVM models achieved scores similar to those of the regression models. This makes regression models preferable as they train much faster. Decision trees were not able to perform at the same level with any of the other tested models.

Recurrent Neural Networks

The next step was to implement recurrent neural networks (RNNs). As opposed to traditional machine learning algorithms and traditional artificial neural networks, recurrent neural networks are able to consider the order in which they receive a sequence of input data and thus to allow information to persist. This seems like a crucial characteristic of an algorithm that deals with time series data such as daily stock returns. This is achieved through loops that connect cells so that at time-step t the input is not only the feature xₜ but also the output from the previous time step hₜ-₁. The figure below illustrates this concept.

Recurrent Neural Network

However, a major issue with regular RNNs is that they have problems learning long term dependencies. If there are too many steps between xₜ-ₙ and hₜ, hₜ might not be able to learn anything from xₜ-ₙ. To help with that, Long Short Term Memory networks (LSTMs) have been introduced. Basically, LSTMs do not only pass the output from the previous cell hₜ-₁, but also the “cell state” cₜ-₁ into the next cell. The cell state, gets an update at each step based on the input (xₜ and hₜ-₁) and in return updates the output hₜ. In each LSTM cell, four neural network layers are responsible for the interactions between the inputs xₜ, hₜ-₁, cₜ-₁ and the outputs hₜ and cₜ. For a detailed description of the LSTM cell architecture please refer to colah’s blog [4].

Recurrent Neural Network with Long Short Term Memory (LSTM)

RNNs with LSTM have the capability to detect relationships and patterns that simple regression models would not be able to find. Therefore, if an RNN LSTM would be able to learn complex price structures that precede crashes, should such a model not be able to outperform the previous tested models?

To answer that question I implemented two different RNNs with LSTM with the Python library Keras and went through rigorous hyper-parameter tuning. The first decision was the length of the input sequence for each layer. The input sequence for each time step t consists of daily price changes from a sequence of days leading up to t. This number has to be chosen with care since longer input sequences require more memory and slow down the computation. In theory the RNN LSTM should be able to find long term dependencies, however, with an LSTM implementation in Keras, the cell states are only passed from one sequence to the next one if the parameter stateful is set to true. In practice this implementation is cumbersome. To avoid that the network recognizes long term dependencies ranging over different data sets and epochs during training, I implemented manual resetting of the state whenever the training data switches data sets. This algorithm did not deliver strong results so I instead set stateful to false but increased the sequence length from 5 to 10 time steps and supplied the network with additional sequences of average price changes and mean volatilities from time windows prior to 10 days up to 252 trading days back (similar to the features selected for the previous tested models). Finally, I tuned hyper-parameters and tried different loss functions, number of layers, number of neurons per layer and dropout vs no dropout. The best performing RNN LSTM has a sequential layer followed by two LSTM layers with 50 neurons each, uses an adam optimizer, a binary cross entropy loss function and a sigmoid activation function for the last layer.

Evaluation

While hyper parameter tuning and increasing length of sequences and adding long term features lead to faster training (optimal result on the validation set after around 10 epochs), none of the RNN LSTM models were able to outperform the previous tested models.

Recall vs precision for all models

The plot above shows the precision and recall performance of the different models. Different colors indicate different models and different shapes indicate a different predictive variable (crash in 1, 3 or 6 months). The bar plot below visualizes the F-beta scores of all models for a 1, 3 and 6 month crash prediction. Random stands for the expected performance of model with no predictive power that predicts a crash as often as the tested models.

F-Beta score for all models

The best results show a F-beta score of 41, 37 and 29 for the prediction of a crash within 6, 3 and 1 month respectively. The precision ranged from 12–16% and the recall from 45–71%. This means that while around 50% of the crashes are detected, around 85% of the crash signals are “false alarms”.

Conclusion

First the bad news. The RNN LSTM is seemingly not able to learn complex price patterns that would enable it to outperform the simpler regression models. This suggests that there aren’t any complex price patterns that occur prior to all (or almost all) crashes but do not occur otherwise. This does not mean Sornette’s hypothesis of crashes that are preceded by certain price patterns fit by the log-periodic power law is not valid. It means however, that if such patterns exist, (1) these patterns also occur in cases when they are not followed by a crash, (2) there are many crashes which are not preceded by these patterns or (3) there is not enough data for a RNN to learn these patterns. While more data would definitely provide more clarity, part of the problem might be a combination of (1) and (2). Sornette fits log-periodic power laws to certain crashes identified as outliers but does not do so for all crashes with a drawdown of similar magnitude. To provide an algorithm that finds the crashes described by Sornette, the training data would need to be specifically labeled with only the crashes that fit these patterns. This might lead to an improvement of identification for those crashes but would not help with (2) since the crashes that are of a different type would not be expected to be detected. However, if provided with sufficient data and a large enough list of identified crashes, it might certainly be worth rerunning the RNN LSTM models.

The good news is that simple price patterns, defined through long term changes in price and changes in volatility, seem to be occurring regularly before crashes. The best models were able to learn these patterns and forecast a crash significantly better than a comparable random model. For example, for a crash prediction in 3 months the best regression model achieved a precision of 0.15 and a recall of 0.59 on the test set while a comparable random model with no predictive power would be expected to achieve 0.04 precision and 0.16 recall. The results look similar for 1 month and 6 month crash prediction, with the F-beta score being best for 6 month prediction and worst for 1 month prediction. Whether these results are good enough to optimize an investment strategy is debatable. However, risk averse investors might certainly allocate their portfolio positions more conservatively if the discussed regression crash indicator continuously warns of an upcoming crash.

Prediction for a crash in 3 months by the logistic regression model for the S&P 500 from 1958 to 1976

A look at test data price index charts and crash predictor indicator at the time of crashes showed that while some crashes have been detected remarkably well others occurred with no or almost no warning from the crash predictor. The figure above shows an example of a not predicted crash (in 1962) and three consecutive pretty well predicted crashes (in 1974). That some crashes are being detected better than others is in line with the assumption that certain typical price patterns exist that do precede some but not all crashes. The different algorithms mostly struggled with the same crashes, which is why I did not attempt to combine different models.

By taking a weighted average of the binary crash predictions over the past 21 days, (more recent predictions weighted stronger), the logistic regression model predicts the likelihood of a crash in the S&P 500 as of November 5, 2018 within 6 months with 98.5%, within 3 months with 97% and within one month with 23%. After reading this study I leave it up to you what to do with this information.


References

[1] Why Stock Markets Crash, Didier Sornette. Book available here.

[2] How to predict crashes in financial markets with the Log-Periodic Power Law, Emilie Jacobsson. Find the paper here.

[3] Large Stock Market Price Drawdowns Are Outliers, (2001) Anders Johansen and Didier Sornette. Find the paper here.

[4] Understanding LSTM Networks, colah’s blog. Link here.