Predicting stock price based on historical data

Rama Teja G
5 min readJul 22, 2020

--

In this article, I want to engage you in the art of prediction. I’ve been curious about a thing for some time now. Most of us must have dreamed of earning huge sums of money from financial markets. It looks quite enticing for the uninitiated. But when we dip our toe in it, we understand the complexity of it. Trade secrets that veterans have accumulated over years of trial and error. Rules that make them, best traders, consistent money… which took them a long time and grueling effort to come up with.
What the 21st century brings is a new opportunity. An opportunity for anyone to come up with the rules by giving the data to a program and asking it to figure out! This is what I want to explore and show you a sliver of what the future might hold in place for us.

Scope of discussion

I am going to show, how to design and build a time series prediction model.

Data

OHLC stands for Open, High, Low, Close values of a given stock in a particular time frame. Historical OHLC data is generally available for free, from different vendors. I’ve used Yahoo Finance API… Using a python library called pandas_datareader we can download historical data of different stocks for any specified time frame. Below is the sample data received.

Downloaded Data in CSV fomat

I’ve selected Nifty50 stock for this project, it’s symbol is NIFTY. When querying Yahoo finance use: “^NSEI” as the symbol to get Nifty50 data. I’ve taken data starting from 2007 till the start of 2020. Which came up to 3020 days. I’ve cut it off before Covid-19 mayhem hit the market as it is an abnormal development (Post COVID market prediction will come in future blog posts).

I’ve split the data into train and test sets, 2400 days for training, and 620 days for validation. After which I’ve normalized the data. Then the data is shaped such that each training sample will have 30 consecutive days Open values as input and 31st days Open Value as expected output.

Data Normalization: It is important to scale down the values to a reasonable range for convergence. I’ve divided the data by 1000, this brought the range to 5–10.

Training data tensor structure
A network that can predict the next days' Open value based on previous days' open values. The number of previous days can be varied and tested. I've used 30 days, for which I've reshaped the dataset.

Model

Tensorflow model summary

I’ve created a model with 6 layers. I’ve tried different combinations with only one LSTM layer, and varying the parameters of layers, without the Convolutional layer, etc., for the above model structure I got the best fit for testing data.

  1. lambda_13 will reshape the input as required by next layer
  2. conv1d_4 is a 1 Dimensional Convolutional layer. It has 32 filters and “relu” activation function
  3. bidirectional_14 is an LSTM layer with return_sequences set to True and 32 filters. But as it is Bidirectional number of filters becomes 64
  4. bidirectional_15 is another LSTM with 32 filters. But as it is Bidirectional number of filters becomes 64
  5. dense_9 is 1 neuron which predicts the output value.
  6. lambda_14 scales up the dense_9 output by 100 to retain more significant digits of data

Training

I’ve used the SGD optimizer and Huber loss. The main metric I’ve looked at to determine the model performance is MAE.

In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. Source: Wikipedia

As there are bound to be sudden spikes that might be caused by external factors, using Huber loss will help the network not get effected by those outliers.

The model was trained with different batch sizes ranging from 2–10 and up-to 200 epochs until MAE went below 90 Rs on the training data. In my opinion, playing with the batch size is important.

Low batch sizes can lead to very high training time. But for large batch sizes, even though the training loss and MAE are decreasing, the performance is very poor in the testing set. As in the image even though prediction follows the curve there is a vertical offset from original data.

Predicting

I’ve taken a rolling window of 30 days’ data and used it to predict the next days’ expected Open Value on the validation set.

Do not forget to scale the input similar to the training scale. Here it is divided by 1000

The blue line is the Original Open prices of stock, Orange is the Predicted prices based on previous known Open prices. With a batch size of 3, I got good testing MAE of 83.97. This implies that the predicted Open price is in the average range of 84 Rs around the original stock price.

Testing data Original vs Predictions

Conclusion

I’ve trained a model to predict a stock opening price based on the previous history. The key takeaways being: the seasonality of stock and various other metrics are automatically derived by the network without us giving it any direction. There is still scope for improvement in terms of using multi-modal data, for example using not just opening price, but all OHLC to predict the next day opening price. In essence, Neural Networks gives us the power of prediction without hard coding the rules of the game.

--

--