Using Machine Learning to Predict Stock Prices
Machine learning and deep learning have found their place in the financial institutions for their power in predicting time series data with high degrees of accuracy and the research is still going on to make the models better. This post is the advanced continuation of my introductory template project on using machine learning to predict stock prices. Find the link below:
Machine Learning and deep learning have become new and effective strategies commonly used by quantitative hedge funds…medium.com
It is based on my project AIAlpha, which is a stacked neural network architecture that predicts the stock prices of various companies. This project is also one of the finalists at iNTUtion 2018, a hackathon for undergraduates here in Singapore.
The workflow for this project is essentially in these steps:
- Acquire stock price data
- Denoise data using Wavelet Transform
- Extract features using Stacked Autoencoders
- Train LSTM using features
- Test model for predictive accuracy
In this post, I will go through the specifics of each step and why I choose to make certain decisions.
1. Data Acquisition
Stock price data is easy to acquire thanks to
pandas_datareader API for Yahoo Finance. Hence, it was done simply using the following command.
stock_data = pdr.get_data_yahoo(self.ticker, self.start, self.end)
2. Denoising Data
Due to the complexity of the stock market dynamics, stock price data is often filled with noise that might distract the machine learning algorithm from learning the trend and structure. Hence, it is in our interest to remove some of the noise, while preserving the trends and structure in the data. At first, I wanted to use the Fourier Transform (those unfamiliar should read this article), but I thought Wavelet Transforms may be a better choice to preserve the time factor of the data, instead of producing a merely frequency based output.
The wavelet transform is very closely related to the Fourier Transform, just that the function used to transform is different and the way this transformation occurs is also slightly varied.
The process is as follows:
- The data is transformed using Wavelet transform.
- Coefficients that more than a full standard deviation away are removed (out of all the coefficients)
- Inverse transform the new coefficients to get the denoised data.
Here is an example of how wavelet transforms denoises time series data:
As you can see, the random noise that was present in the initial signal is absent in the denoised versions. This is exactly what we are looking to do with our stock price data.
Here is the code to denoise data:
x = np.array(self.stock_data.iloc[i: i + 11, j])
(ca, cd) = pywt.dwt(x, "haar")
cat = pywt.threshold(ca, np.std(ca), mode="soft")
cdt = pywt.threshold(cd, np.std(cd), mode="soft")
tx = pywt.idwt(cat, cdt, "haar")
pywt is excellent for wavelet transforms are has lessened my load tremendously.
3. Extracting Features
In a usual machine learning context, extracting features will require expert domain knowledge. This is a luxury that I do not have. I could perhaps try using some form of technical indicators such as moving average or moving average convergence divergence (MACD), or momentum measures, but I felt that using it blindly might not be optimal.
However, automated feature extraction can be achieved by using stacked autoencoders or other machine learning algorithms like restricted Boltzmann machines. I have chosen to use stacked autoencoders due to the interpretability of the encoding as compared to the probabilities from the restricted Boltzmann machines.
In essence, stacked autoencoders get very good at compressing data and reproducing it back again. What we are interested in is the compression part, as it means the information required to reproduce the data is in some way encoded in the compressed form. This suggests that these compressed data can be in some way the features of the data that we are trying to extract features out from. The following is the network structure of a stacked autoencoder:
The input data is compressed into however many neurons desired and the network is forced to rebuild the initial data using the autoencoder. This forces the model to extract key elements of the data, which we can interpret as features. One key thing to note is that this model actually falls under unsupervised learning as there are no input-output pairs, but both input and output is the same.
We can use
keras to build such a model and it is more useful to use the functional API as opposed to the sequential one.
def __init__(self, encoding_dim):
self.encoding_dim = encoding_dim
def build_train_model(self, input_shape, encoded1_shape, encoded2_shape, decoded1_shape, decoded2_shape):
input_data = Input(shape=(1, input_shape))
encoded1 = Dense(encoded1_shape, activation="relu", activity_regularizer=regularizers.l2(0))(input_data)
encoded2 = Dense(encoded2_shape, activation="relu", activity_regularizer=regularizers.l2(0))(encoded1)
encoded3 = Dense(self.encoding_dim, activation="relu", activity_regularizer=regularizers.l2(0))(encoded2)
decoded1 = Dense(decoded1_shape, activation="relu", activity_regularizer=regularizers.l2(0))(encoded3)
decoded2 = Dense(decoded2_shape, activation="relu", activity_regularizer=regularizers.l2(0))(decoded1)
decoded = Dense(input_shape, activation="sigmoid", activity_regularizer=regularizers.l2(0))(decoded2)
autoencoder = Model(inputs=input_data, outputs=decoded)
encoder = Model(input_data, encoded3)
# Now train the model using data we already preprocessed
train = pd.read_csv("preprocessing/rbm_train.csv", index_col=0)
ntrain = np.array(train)
train_data = np.reshape(ntrain, (len(ntrain), 1, input_shape))
autoencoder.fit(train_data, train_data, epochs=1000)
I trained the autoencoder with the denoised stock price data from 2000 till 2008. After training for 1000 epochs, the RMSE decreased to around 0.9. Then, I used that model to encode the rest of my stock price data into features.
4. LSTM Model
The LSTM model needs no introduction as it has become very widespread and popular in predicting time series. It gets its exceptional predictive ability from the existence of the cell state that allows it to understand and learn longer-term trends in the data. This is especially important for our stock price data. I will discuss some aspects of the design choices that I feel is important below.
The type of optimizer used can greatly affect how fast the algorithm converges to the minimum value. Also, it is important that there is some notion of randomness to avoid getting stuck in a local minimum and not reach the global minimum. There are a few great algorithms, but I have chosen to use Adam optimizer. The Adam optimizer combines the perks of two other optimizers: ADAgrad and RMSprop.
The ADAgrad optimizer essentially uses a different learning rate for every parameter and every time step. The reasoning behind ADAgrad is that the parameters that are infrequent must have larger learning rates while parameters that are frequent must have smaller learning rates. In other words, the stochastic gradient descent update for ADAgrad becomes
The learning rate is calculated based on the past gradients that have been computed for each parameter. Hence,
Where G is the matrix of sums of squares of the past gradients. The issue with this optimization is that the learning rates start vanishing very quickly as the iterations increase.
RMSprop considers fixing the diminishing learning rate by only using a certain number of previous gradients. The updates become
Now that we understand how those two optimizers work, we can look into how Adam works.
Adaptive Moment Estimation, or Adam, is another method that computes the adaptive learning rates for each parameter by considering the exponentially decaying average of past squared gradients and the exponentially decaying average of past gradients. This can be represented as
The v and m can be considered as the estimates of the first and second moment of the gradients respectively, hence getting the name Adaptive Moment Estimation. When this was first used, researchers observed that there was an inherent bias towards 0 and they countered this by using the following estimates:
This leads us to the final gradient update rule
This is the optimizer that I used, and the benefits are summarized into the following:
- The learning rate is different for every parameter and every iteration.
- The learning does not diminish as with the ADAgrad.
- The gradient update uses the moments of the distribution of weights, allowing for a more statistically sound descent.
Another important aspect of training the model is making sure the weights do not get too large and start focusing on one data point, hence overfitting. So we should always include a penalty for large weights (the definition of large would be depending on the type of regulariser used). I have chosen to use Tikhonov regularization, which can be thought of as the following minimization problem:
The fact that the function space is in a Reproducing Kernel Hilbert Space (RKHS) ensures that the notion of a norm exists. This allows us to encode the notion of the norm into our regularizer.
A newer method of preventing overfitting considers what happens when some of the neurons are suddenly not working. This forces the model to not be overdependent on any groups of neurons, and consider all of them. Dropouts have found their use in making the neurons more robust and hence allowing them to predict the trend without focusing on any one neuron. Here are the results of using dropouts
As you can tell, when there is a dropout, the error continues to decrease while without dropout the error plateaus.
5. Model Implementation
All of the analysis above can be implemented with relative ease thanks to
keras and their functional API. This is the code for the model (to view the entire code, check out my GitHub: AlphaAI)
def __init__(self, input_shape, stock_or_return):
self.input_shape = input_shape
self.stock_or_return = stock_or_return
input_data = kl.Input(shape=(1, self.input_shape))
lstm = kl.LSTM(5, input_shape=(1, self.input_shape), return_sequences=True, activity_regularizer=regularizers.l2(0.003),
recurrent_regularizer=regularizers.l2(0), dropout=0.2, recurrent_dropout=0.2)(input_data)
perc = kl.Dense(5, activation="sigmoid", activity_regularizer=regularizers.l2(0.005))(lstm)
lstm2 = kl.LSTM(2, activity_regularizer=regularizers.l2(0.01), recurrent_regularizer=regularizers.l2(0.001),
out = kl.Dense(1, activation="sigmoid", activity_regularizer=regularizers.l2(0.001))(lstm2)
model = Model(input_data, out)
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["mse"])
# load data
train = np.reshape(np.array(pd.read_csv("features/autoencoded_train_data.csv", index_col=0)),
(len(np.array(pd.read_csv("features/autoencoded_train_data.csv"))), 1, self.input_shape))
train_y = np.array(pd.read_csv("features/autoencoded_train_y.csv", index_col=0))
# train_stock = np.array(pd.read_csv("train_stock.csv"))
# train model
model.fit(train, train_y, epochs=2000)
These are the results of my prediction for various companies.
It is evident that the results of using this neural network architecture is decent and can be profitable if implemented into a strategy.
In addition to the model learning from historical data, I wanted to make the model always learning, even from the predictions. Hence I have made it such that it becomes an online model that learns and also predicts. In other words, it learns over historical data, predicts tomorrow’s price, and tomorrow, when the actual price is known, it learns using that too. So the model is always improving.
In addition to using the actual price to improve, I have also considered making a secondary model that uses sentiment values of news and Twitter about the company. I shall first outline how those data were acquired.
In addition to stock price data, I wanted to experiment with some natural language processing. Hence, I tried delving into using sentiment data from twitter and news to improve the stock predictions.
The first major struggle was obtaining the tweets for free, as the Twitter API to fetch the entire archive was paid. However, I found an API that allowed me to get the tweets over the past 10 days, and then I can implement some form of NLP to extract the sentiment data from the tweets. This was not optimal but still useful for my online learning model.
The twitter api was used to scrape the past 10 days, and the sentiment score was calculated using TextBlob and averaged over the numerous tweets.
Similar to Twitter, getting news data was incredibly difficult. I tried analysing the URL of the Bloomberg articles but realized manually scrapping websites all the way from 2000 was almost impossible. Hence, I settled with the Aylien API that has quite a powerful scraping model.
The news articles were scrubbed with the conditions that they only include stocks and financial news, filtered to top 150 Alexa websites, and the sentiment score was averaged using exponentially weighted moving average to take into account recent news more than older news.
Given my sentiment scores, I used an extra layer of neural network to correct the error of my predict even more. However, the results of this is not available at the time of this article, since it takes one day to produce one data point.
Neural networks are very adept at predicting time series data, and when coupled with sentiment data, can really make a practical model. Although the results here were impressive, I am still finding ways to improve it, and maybe actually develop a full trading strategy from it. Currently, I am looking into using Reinforcement Learning to develop a trading agent that uses the results from the predictive model.