Stock Price Prediction: XGBoost

Published in

The Startup

9 min readJun 1, 2020

In our latest entry under the Stock Price Prediction Series, let’s learn how to predict Stock Prices with the help of XGBoost Model.

In case you want to dig into the other approaches of Stock Price Prediction, have a look at our other blogs in this series:

The following article would introduce you to steps and training procedures you need to follow while carrying out time series forecasting with the help of XGBoost.

What is XGBoost?
Basic Terminology
Preprocessing the Dataset I: Adding Features
Determining Training and Validation Data
Preprocessing the Dataset II: Normalisation
Model
Training the Model: Step 1, Step 2, Step 3
Prediction
Conclusion

What is XGBoost?

XGBoost stands for eXtreme Gradient Boosting.
Based on the implementation of gradient boosted decision trees, XGBoost has recently been dominating applied machine learning due to its speed and performance. XGBoost algorithm was developed as a research project at the University of Washington. Tianqi Chen and Carlos Guestrin presented their paper at SIGKDD Conference in 2016. XGBoost is one huge leap into the field of Machine Learning. This article provides a detailed understanding of the algorithm.

Basic Terminology

Before we dig in, let’s get acquainted with some commonly used terms in time series forecasting. Besides each term, the variable name used while coding has also been mentioned:

Horizon (H): The number of days for which the stock price needs to be predicted
Number of Lag Features (N): Stock Price today depends largely on yesterday’s stock price. But what about the stock price a month ago? Does it equally affect today’s stock price? No. The number of lag features is the number of days into the past that influence today’s stock price value. Later in this blog, you’ll learn how we determined this value.
Features (features): Stock Price might depend on what time of the year (holidays like December 25?), month of the year (the month in which the budget is released?) and so on. We use fastai to generate such features.
Normalized Stock Price Value of N Timestamps (features_lag): The value of stock price on N timestamps need to be normalized before sending them as input to the model (using mean and standard deviation).
Stock Price DataFrame (data): Obviously, you need to have some data! We use the New Germany Fund Data. We store this data in a pandas dataframe.
Size of Training Data, Validation Data (train_size, val_size) : self-explanatory!

Preprocessing the Dataset I: Adding Features

First off, we suggest using fastai to generate columns that introduce more details about the timestamp. For instance, whether the concerned timestamp is at the start of a month, week, year and so on. This would help to incorporate trends and seasonalities in your model. Suppose the quantity you are trying to predict shoots up on the first of every month. Wouldn’t you want your model to take that into consideration?

Import add_datepart from fastai.tabular.
DATE is the column that contains all the dates in DateTime format.
We deleted the DATEElapsed column as we felt we didn’t require it.

Next up, we add lag features. col_name is the column of the quantity you want to predict, like, opening stock price. num is the number of lag features you want to compute. We recommend that you choose a higher value, like 15. We will narrow down to the number of lag features later on.

The nth lag column contains the value of stock price n days before the current timestamp. For example, the first lag column would contain yesterday’s stock price value in today’s row. The image of the modified dataframe explains this better:

2014–06–23 is the first date for which we have the value of the stock price. We don’t have the value of stock price a day before this date. That’s why the first value in OPEN_lag_1 is NaN. Similarly, the first two values in OPEN_lag_2 are NaN

Determining the Train and Validation Dataset

Correlation of Each Column with OPEN (continuous data only)

The above preprocessing has added a lot of columns. But do we require all of them?

What if your model doesn’t depend on what day of the week it is? You don’t need to consider that column right.

But, how do you learn which columns your ‘to be predicted’ quantity would depend on?
Simple! Plot the correlation (for continuous variables, yes, including lag features) and boxplots (for categorical variables).
We prefer to choose only the columns which have a correlation of greater than 0.97.

We drop all continuous columns except for the first 7 lag features.

Preprocessing the Dataset II: Normalisation

Inputs into our model should be normalized, preferably. But this isn’t as simple as finding the statistical values (mean and standard deviation, namely) of all stock price values and performing a simple calculation.

There’s a slight modification. Remember how we specified that the stock price today would largely depend on the N values before it. So instead of the entire data, we use the mean and the standard deviation of the t-1 to t-N values to normalize the stock price at t=0.

Lines 2–4: The Pandas function rolling computes the average of window number of entries before (and including) the current entry. min_periods is the minimum number of observations in the window required to have a value (otherwise result is NaN).

Lines 5–6: We are concerned with the average of t-N values before (and excluding) the current entry. This is because the stock price at t=0 cannot predetermine the stock price at t=0, because we don’t know the stock price at t=0 beforehand! In these lines, the computed mean_list and std_list are shifted to one step back to produce the desired result.

This results in two columns like:

The mean and standard deviations are finally used to normalize the columns:

The column col_name_mean contains the mean of N values prior to the stock price value in col_name. So, line 6 simply column to scale-down the col_name values.

However, the lag features cannot be simply scaled down by col_name_mean. Why? Because you need to find the statistical values (mean and standard deviation) of N values prior to the lag feature to scale it down in the right manner. Remember, we aren’t using the statistical values of all stock prices to scale down the values; we are using the statistical values for previous N timestamps. Put simply, to scale down yesterday’s stock price, I need to compute the mean and standard deviation of N days before yesterday.

Model

Training the XGBoost to predict H values isn’t merely a .fit() call followed by a .predict() call.
Why?
1. The training dataset contains N scaled lag columns, which the test dataset won’t have at first.
2. Future Stock Prices largely depend on recent values. Thus, the validation dataset would be a better determinant of the forecasted values for the test dates, rather than the training values.

Popular Method, known as recursive forecasting is deployed.

Training the Model: Step 1

The function train_model prepares your train and test dataset to send it to the user-defined function fit_model.
This function is called with the combined dataset of train and validation entries.

Line 8–10: Form the Train and Test Dataset
In the first iteration, the first train_size values form the training dataset. One the model is ready, we’d like to predict H values. Thus, the test dataset has the immediate H entries from the last entry (excluding) of the train dataset.
In the next iteration, the train dataset shifts by H/2 values (step of the for loop). Again, the test dataset is formed by the immediate H entries from the final entry of the train dataset.
This process is reiterated until we reach the last prediction we can make, i.e., for the last H values of the validation dataset.

Line 12: Delete NaN Values
NaN values aren’t accepted in a model.

Lines 14–18: Prepare the Train and Test Dataset
The train and test datasets have the features we shortlisted in section Preprocessing I. Additionally, the train dataset shall also contain the lag features.
Notice trainY contains scaled-down Opening Price Values though testY contains the actual values.
Why?
The model is trained with normalized values to prevent bias. To accurately calculate accuracy metrics like MAPE, RMSE, the rescaled prediction values are compared with the actual (not scaled-down) testY values.

Lines 20–22: Arrays of Previous N Values
You’ll soon figure why this was necessary! For the time being though, remember that we need to tell our model that these N values are going to have a significant say in what the predicted (N+1) value is.

Training the Model: Step 2

Lines 3–8 show how the model is called and fitted on the training dataset generated under section Training the Model: Step 1.

imp_features returns the estimates of feature importance from a trained predictive model. This article beautifully describes how to interpret this variable.

Training the Model: Step 3

This function is the crux of recursive forecasting. You’ll finally learn how to exploit the N values to get the best forecast.

Line 5 iterates over each day in the set of H days. Each day is forecasted independently using the model we fitted in the section Training the Dataset: Step 2.

Remember our model was trained with a dataset that had lag features as well. This function helps us create those lag features for a test date and sends it to the model (line 13) in order to obtain a prediction.

How are the lag features created?
For each day, line 6 creates a variable forecast_scaled, which consists of scaled-down values of the N days just preceding the day of prediction. (Notice how at the first iteration, this is nothing but the prev_values we had created in section Training the Model Step 1).
Line 8 creates a copy of the ith row of the dataset, where i is the timestamp for which we’d like to predict the opening value. Line 10 creates the lag column names. These columns are added to the dataset with their corresponding values. The variable forecast is updated in each iteration (line 15), so each time we do get the previous N values in forecast_scaled.

This prediction is rescaled and appended to the NumPy Array Forecast (lines 14–15). The last H values of the NumPy array forecast are returned because these are the days that have been predicted.

Prediction

In accordance with the above-defined functions, stock price prediction can be carried out by:

predictions, mape_net, mae_net, rmse_net, features_importance = train_model(train_val, train_size, horizon, N_final, features, features_lag, n_estimators, max_depth, validation.shape[0])

Conclusion

This article gives you a glimpse of how to use XGBoost to carry out time-series prediction. Shout out to Yibin NG for his excellent article here, which cleared our concepts.

Feel free to reach out in case you think we got something wrong.

This blog was written by Nikita Saxena.