📈 Tuning LSTM to predict stock price in SET50
It is always better to know the future especially in stock trading as we can plan well on when to buy and when to sell to make a profit.
As I searched over the internet, I found some examples of using deep-learning algorithms like Long Short-Term Memory (LSTM) to predict stock price.
However, I haven’t seen many examples and results of tuning LSTM for a different range of predictions to get the best result and also the accuracy of the model when we use it with a various number of stocks.
🎯 In this blog post, I’ll share how and the result of the following activities:
- Build a stock predictor using LSTM and tune the parameters for one selected stock in predicting its Adjusted Close price in the next 1, 5, and 10 days. The way that I intend to tune the parameters are:
- Starting by using the lowest value for all parameters and allow only one parameter to be adjusted.
- Loop through the cycle of build, train, and validate the different values of the parameters to find the best value of these parameters.
- Do this for all parameters to see which parameter and which value give the lowest error.
- Update the particular parameter with value while the others are still the lowest.
- Repeat all steps until the error doesn’t get lower and that will be the set of the best parameter value.
2. Build a model by using the set of parameters as found from previous topics to do prediction on another set of stock and measure how many stocks that the model could predict with acceptable error rate.
3. Build a user-friendly script that the user can:
- Provide a list of stocks, date range of historical data, and train the model.
- Query the predicted price by selecting the stock and the date range of price prediction.
Data Source
As I reside in Thailand, the target list of stock for my model will be the 50 highest value stock in the Stock Exchange of Thailand (or SET50). The list is here: https://www.settrade.com/C13_MarketSummary.jsp?detail=SET50. The stock that I’ll use as the sample for tuning LSTM parameters will be one of the stocks from SET50. The data that I use will be from Yahoo Finance — https://finance.yahoo.com/ which provides the historical data of Open, High, Low, Close, Adjusted Close price, and trade volume.
Metrics
The metrics for each topic of study will be a measurement of prediction accuracy. Since, this will be a measurement of how the predicted line fits the actual line. Mean Square Error would be the appropriate metrics to measure the model accuracy (the lower the better). We can even take a square root to find the actual distance between the data and the predicted line.
⚠️ Long-Post and Technical Content Ahead
The content from here will be a deep-drive implementation for each topic as listed above. You can read the Jupyter Notebook version with executable source code from here — https://github.com/pathompong-y/stock_predictor/blob/master/stock_predictor_tuning_study.ipynb
If you want to run the notebook and get along with the content please check the instruction on the project’s repository — https://github.com/pathompong-y/stock_predictor/
👌 TLDR — The result and how to do it by yourself for the busy guys :)
- By iteratively adjust each LSTM parameter, we can reduce MSE of the model. The top influential parameters from my trial are epoch and history point (range of historical data for prediction).
- To optimize the error, do it stock by stock is better than generalize one set of parameters to every stocks.
- You can try my parameters by using this Jupyter Notebook.
https://github.com/pathompong-y/stock_predictor/blob/master/stock_predictor.ipynb
It requires no-installation on your computer as it will run on https://colab.research.google.com. The instruction is already inside the notebook. - With this notebook, you can (1) provide list of stock and date range of historical data to train the model. My parameter will be applied automatically (2) query for the forecast price by providing day range from the end date of the training data. It can predict up to 10 days at maximum. The MSE of prediction will also return to you.
- I do not guarantee any prediction accuracy result :)
1. Build LSTM and optimize parameters for one stock in SET50 on 1,5 and 10 days prediction
1.1 Select the workspace and install yfinance library
First, we will need to get stock data from Yahoo Finance. yfinance library is the package that we need to install. It is packed with function to get the data from Yahoo Finance. The documentation can be found here — https://pypi.org/project/yfinance/.
Install yfinance using !pip install command.
!pip install yfinance
import yfinance as yf
As LSTM required TensorFlow to build and train the model, I develop this notebook on Google’s collaboratory — https://colab.research.google.com which provides free Jupyter Notebook working space with TensorFlow GPU supported installed. It is also able to read/write files to Google Drive which is quite handy for me in this situation as my machine doesn’t have GPU.
Keras is a Deep Learning library which has LSTM implemented that I’m going to use in this exploration — https://keras.io/.
1.2 Prepare data
Since we will use all of SET50 data in the next topic, I’ll download all of them. The stock that I select to explore is INTUCH.BK which is the one that I used to trade recently.
I use collaboratory’s Google Drive mounting features to store the downloaded data and also intermediate result while working on this notebook.
yfinance has handy command below which can download historical data within 2 lines. First, we have to initiate the instance of yfinance using the ticker name. After that, we can use the history function to download historical data. More detail in yfinance’s documentation: https://pypi.org/project/yfinance/
# Instantiate object from stock ticker
stock_data = yf.Ticker(stock)# yfinance history function is able to define period to download historical data
pd.DataFrame(stock_data.history(period='max',auto_adjust=False,actions=False)).to_csv(file)
After I save the data to CSV, I explore them a bit to check for completeness of the data, null data, and the expected features (Open, High, Low, Close, Adjusted Close, Volume).
Base on a quick check, the data is quite ready to use.
After we have got all of the data, we have to make it ready to train the model. Here is the list of things to do:
- Drop null rows (if any) as we can’t use it anyway.
- Drop ate as we can’t use it as features in model training.
- Normalize the data to have the value between 0–1 as it will help the neural network has better performance. This is per this post: https://towardsdatascience.com/why-data-should-be-normalized-before-training-a-neural-network-c626b7f66c7d. To normalize and scale the data backup, we can use Python’s preprocessing.MinMaxScaler(). What we have to is to also keep the object that we use to scale the data down and use the same object to scale the data up.
- Transform the data format. We will predict the Adj. Close for the period of prediction day range (1,5 and 10). So, the dataset will consist of the set of Open, High, Low, Volume of each day for the number of history points day that we will use to do prediction.
- For example, if we say we want to use 30 history points. One row of our dataset will consists of the following features:
[dayAopen, dayAclose, dayAvolume, dayAhigh, dayAlow,dayA-1open, dayA-1close, dayA-1volume, dayA-1high, dayA-1low....dayA-29low]
Here are the code that I use to perform all of the activities above.
# Construct the CSV filepath for INTUCH.BKstock = 'INTUCH.BK'
filename = gdrive_path+stock+'.csv'# Read the file and drop null rowdf = pd.read_csv(filename)
df_na = df.dropna(axis=0)# Drop Date as this is time series data, Date isn't used. Also drop Close as we will predict Adj Close.df_na = df_na.drop(['Date','Close'],axis=1)# As neural network has better performance with normalize data, we will normalize the data before train and predict# After we got the predict result, we will scale them back to normal value to measure error rate.# Normalise all data to the value range of 0-1 as neural network algorithm has better performance with this data rangedata_normaliser = preprocessing.MinMaxScaler()y_normaliser = preprocessing.MinMaxScaler()data_normalised = data_normaliser.fit_transform(df_na)# The length of dataset, number of day to predict and number of featureshistory_points = 30
predict_range = 1# Prepare the data in the format of [day-1-open,day-1-max,day-1-min,...day-history_point ] as 1 row input for predict the 'predict_range' price for train and testohlcv_histories_normalised = np.array([data_normalised[i : i + history_points].copy() for i in range(len(data_normalised) - history_points - predict_range +1)])# Get the actual price [day1-adj close,day2-adj close....day-predict_range adj close] for train and testnext_day_adjclose_values_normalised = np.array([data_normalised[i + history_points:i + history_points + predict_range,3].copy() for i in range(len(data_normalised) - history_points - predict_range+1)])# Create the same array as the normalised adj close but with the actual value not the scaled down value. This is used to calculate the prediction accuracynext_day_adjclose_values = np.array([df_na.iloc[i + history_points:i + history_points+predict_range]['Adj Close'].values.copy() for i in range(len(df_na) - history_points - predict_range+1)])# Use the passed normaliser to fit the actual value so that we can scale the predicted result back to actual valuey_normaliser.fit(next_day_adjclose_values)
Now, the data is ready. As we are going to train the model, we will have to split the data to train and test.
The older data will be the training set and the newer data will be the test set.
I select 90% of the data as train data and 10% of the data to be test data.
So, we can use Python’s array slicing to split the data. The code below is the example from my function. ohlcv_histories is the data that we prepared earlier.
n = int(ohlcv_histories.shape[0] * 0.9)ohlcv_train = ohlcv_histories[:n]
y_train = next_day_adj_close[:n]ohlcv_test = ohlcv_histories[n:]
y_test = next_day_adj_close[n:]
1.3 Build, Train and Validate the model
Then, it is ready to create LSTM model, train, and validate the model by using mean squared error. LSTM that I will use is a simple one consist of a hidden layer, dropout layer, and forecast layer.
I create a function so that I can change the parameters of the model. The parameters that we change when we build LSTM models are:
- hidden layer number — The layer of LSTM
- dropout probability — The probability to forget the information of previous node
- history points — The range of data use in training the model for each iteration (E.g. 30 days for each iteration from all of the data in the training set)
- feature number — The number of features. If we add more features this number has to change.
- optimizer (mostly we will use ‘adam’)
Here is the code inside the function.
# Initialize LSTM using Keras librarymodel = Sequential()# Defining hidden layer number and the shape of the input (number of data in the dataset and the number of feature)model.add(LSTM(layer_num, input_shape=(history_points, features_num)))# Add forget (dropout) layer with probability per argumentmodel.add(Dropout(dropout_prob))# End the network with hiddenlayer per the size of forecast day e.g. 1,5,10model.add(Dense(predict_range))# Build and return the model per the selected optimizermodel.compile(loss='mean_squared_error', optimizer=optimizer)
After we get the model as a result from compile(), we can fit it with the training data. Additional parameters that we can change when we fit the data are:
- batch size
- epoch
model.fit(x=ohlcv_train, y=y_train, batch_size=batch_size, epochs=epoch, shuffle=True, validation_split=0.1)
Once the model has completed the training, we can use test data to predict the result and compare the result with the actual result by calculating mean squared error (MSE). However, the actual result that we have is the scaled-up value (the normal price one not the normalized 0–1 which we got from the model)
Before calculating MSE, we have to scale the predicted price back.
# The model is train. Test with the test datasety_test_predicted = model.predict(ohlcv_test)# Scale up the result to actual value with y_normaliser that we use earliery_test_predicted = y_normaliser.inverse_transform(y_test_predicted)# Calculate the error with MSEreal_mse = np.mean(np.square(unscaled_y_test - y_test_predicted))scaled_mse = real_mse / (np.max(unscaled_y_test) - np.min(unscaled_y_test)) * 100
Now, we have the completed code to prepare data, build, train and validate the model and also able to change parameters when we build and train the model to find the set that gives the lowest MSE.
For the first attempt, I try with all lowest parameter for 1 day prediction. The history points that I use is 30 days on all historical data that was downloaded.
# Must be the same as history point that we use to prepare datahistory_points = 30# Must be the same number of features when we prepare datafeatures_num = 5# LSTM parameterslayer_num = 30predict_range = 1optimizer = 'adam'dropout_prob = 1.0# Create LSTM model objectmodel = get_LSTM_Model(layer_num, history_points, features_num,predict_range,optimizer,dropout_prob)# Parameter for model trainingbatch_size = 10epoch = 10# Train model with our train datamodel.fit(x=ohlcv_train, y=y_train, batch_size=batch_size, epochs=epoch, shuffle=True, validation_split=0.1)
After we got the result, we can plot the predict price and the actual price to see how it different.
real = plt.plot(unscaled_y_test, label='real')
pred = plt.plot(y_test_predicted, label='predicted')plt.legend(['Real', 'Predicted'])
plt.show()
We can say that the result is able to capture the trend quite well. It will constantly predict lower than the actual price when the price is uptrend while predict higher than the actual price.
1.4 Optimize parameters for 1,5 and 10 days prediction
Then, it’s time to strengthen our model by finding the best parameter value. In summary, here is the list of parameters to optimize:
- hidden layer number
- dropout probability
- history points
- batch size
- epoch
The way that I do is to create a function that will loop thru the range of one parameter while all others parameter value will fix to see which value of the particular parameter give the lowest MSE. So, I’ll have 5 functions in total.
Here is the example of the function. Other functions share the same structure but just changing the parameter.
def get_best_history_points(predict_range, max_history_points, stock_list, hidden_layer=10, batch_size=10,epoch=10,dropout_probability=1.0,mode='file'):mse_list = []exception_list = []for history_points in range(30,max_history_points+1,round(max_history_points/10)):for stock in stock_list:try:model, scaled_mse = train_and_validate_stock_predictor(stock,history_points,predict_range,hidden_layer,batch_size,epoch,dropout_probability,mode)print("Predict {} days for {} with MSE = {}".format(str(predict_range),str(stock),str(scaled_mse)))mse_list.append([history_points,stock,scaled_mse])pd.DataFrame(mse_list).to_csv('/content/drive/My Drive/Colab Notebooks/stocklist_'+str(predict_range)+'_mse_history_'+mode+'.csv')except Exception as e:print("exception "+str(e)+"on "+stock)
exception_list.append([predict_range,stock,str(e)])
pd.DataFrame(exception_list).to_csv('/content/drive/My Drive/Colab Notebooks/exception_list.csv')continue
Then, I start by run all of the function to see which parameter at which value give the lowest MSE.
From first-round, we found that epoch = 90 has the lowest MSE at ~2.85
We will run all functions except epoch again and also fix epoch value at 60 as input to all functions. This is to find other parameters that could decrease MSE further. I repeat these steps until MSE isn’t decreased anymore.
Finally, I got the result which give the lowest MSE at ~2.79 as below:
- hidden layer number = 10
- dropout probability = 1.0
- batch size=10
- epoch=90
- history point=90
I tried to optimize the MSE further by adding some technical analysis indicator that is commonly used to trade the stock. I select MACD and EMA which is quite not complicated to calculate. The example code is as below. It will add MACD and EMA at 20 and 50 days to the stock data dataframe.
# Extract Close data to calculate MACDdf_close = df[['Close']]df_close.reset_index(level=0, inplace=True)df_close.columns=['ds','y']# Calculate MACD by using DataFrame's EWM https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ewm.htmlexp1 = df_close.y.ewm(span=12, adjust=False).mean()exp2 = df_close.y.ewm(span=26, adjust=False).mean()macd = exp1-exp2# Merge MACD back as new column to the input dfdf = pd.merge(df,macd,how='left',left_on=None, right_on=None, left_index=True, right_index=True)# Rename DataFrame columnsdf.columns = ['Date','Open','High','Low','Close','Adj Close','Volume','MACD']# Add new columns using EMA windwos size. EWM can use directly.df[ema1] = df['Close'].ewm(span=20, adjust=False).mean()df[ema2] = df['Close'].ewm(span=50, adjust=False).mean()return df
However, the MSE with additional data is increased to be around ~6.7 instead. So, adding them for 1 day prediction might not be the case.
The overall steps to find parameters for 1 day is as described earlier. So, I repeat all of the steps for 5 days and 10 days prediction and get the result as follow:
1 day prediction at 2.78 MSE
- history points : 90
- hidden layer : 10
- batch size : 10
- dropout probability : 1.0
- epoch : 90
- add MACD and EMA? : No
5 days prediction at 7.56 MSE
- history points : 30
- hidden layer : 70
- batch size : 10
- dropout probability : 1.0
- epoch : 60
- add MACD and EMA? : No
10 days prediction at 14.55 MSE
- history points :50
- hidden layer : 60
- batch size : 10
- dropout probability : 0.3
- epoch : 80
- add MACD and EMA? : No
It is quite a surprise for me that adding MACD and EMA doesn’t help for predicting INTUCH at any range. However, I’ll still keep the function and try it with other stocks in SET50 instead.
Now, we have the parameters for each day’s range of prediction. We can try them with SET50 stocks to see how many stocks can be predicted with acceptable accuracy for 1,5 and 10 days prediction.
2. Apply the parameter set above with SET50 to find how well it work with other stocks
It takes a long time to run through 50 stocks for 1, 5, and 10 days. I have to make a copy of my notebook and run them simultaneously on 3 browser tabs to save time.
I visualize the result for each day range prediction by using the histogram of MSE.
The more range of predictions gives higher MSE.
I also explore further by running 5 days prediction again with the same set of parameter value but add MACD and EMA to the dataset this time to see how different of the result between with and without additional data.
The histogram of MSE differentiation is as follow:
df_set50_five_days = pd.read_csv('/content/drive/My Drive/Colab Notebooks/set50_5_mse.csv')df_set50_five_days_add = pd.read_csv('/content/drive/My Drive/Colab Notebooks/set50_5_mse_add_data.csv')df_set50_five_days_diff = pd.DataFrame(df_set50_five_days['1'] - df_set50_five_days_add['1'])plt.hist(df_set50_five_days_diff['1'], bins=100, color='#0504aa',alpha=0.7, rwidth=0.85)
It is quite interesting to see that about half of them have a better result and another half of them have a poorer result.
Some of the conclusion that we can see are:
- The current parameters for building and training the model are created base on 1 example stock, however, it is possible to use to predict other stocks which some of them even have better accuracy than the example case.
- We can see that the MSE is increased for some stocks and also decreased for some stocks on 10 days prediction after we add MACD and EMA as training additional features. Base on this, it is clear that to optimize stock price prediction accuracy we should scope down for one stock as the parameter won’t give a good result on all stocks as each stock price moves due to influences of different factors.
3. Create User-Friendly function for user to select their interested stocks, train, and query for stock price prediction
After I got the set of LSTM parameters that would work best for 1, 5, and 10 days prediction, I’ll try to build a script which the user can use to train the model and query for the stock price prediction.
Since the LSTM model using Keras required Tensorflow and also a decent machine to run, building the webserver to build and train the model on it is quite costly.
I have tried free tier of the webserver on Heroku but the free tier web server has a limited execution time of any function. Therefore, it won’t be possible to run train the model which needs more than 5 minutes to do.
So, I decided to separate the notebook. The new one will have 2 code cells. One for receiving a list of stock and range of training data and another one for query the predicted price.
To do this, I need to pack all of my functions into one file so that it is easily used by upload the notebook and the functions file together to new collaboratory space and make it ready to run.
The user will be able to provide the list of stock and date range of data to train the model. Then, a query for the predicted price with a limit at 10 days prediction.
You can grab a copy of this notebook on my repository. The instruction to set up is already provided — https://github.com/pathompong-y/stock_predictor/blob/master/stock_predictor.ipynb
Again, the most convenient way to run it is by using https://colab.research.google.com
💻 Conclusion
- One way to tune the parameter of LSTM is to iteratively adjust one by one by starting from setting every parameter at lowest and iteratively change only one parameter at a time to find the value that has the lowest MSE. Then, use that parameter value along with other parameters and iterate to find the best value of the next parameter.
- The longer range of prediction the higher error that we will get.
- Adding technical analysis indicators like MACD and EMA does help improve the prediction accuracy for 50% of the stock in 5 days prediction.
- At the end of the day, to optimize the accuracy of the prediction we have to do it stock by stock as the features that will affect is varies.
👓 Further improvement
- We have a lot of possibilities to improve the model further which can try for example we can adjust the EMA range or even add other technical analysis indicators and play around a combination of them.
- We may also re-group the stock that the price movements are based on the same set of certain criteria. For example grouping companies in Oil & Gas together and we may add oil price as additional data features as these companies’ stock prices will be affected by the oil price movement.
That’s it. Thank you for reading this long post and I hope that it provides you with some insights that you may be inspired to improve it further.
As I’m also new to the fields, please let me know any mistake or further area of improvement about my work shared here. That would be very grateful to me.
See you again on the next project!