Forecasting SP500 stocks with XGBoost and Python Part 2: Building the model
Welcome to part 2 of the “Forecasting SP500 stocks with XGBoost and Python” series, a two-part series where I take you through creating a basic XGBoost model for time series forecasting.
The goal with this series is to take you through how I created a basic forecasting model, based on XGBoost’s gradient boosting trees algorithm, or simply XGBRegressor.
I started out by wanting to learn how easily I could get just a working model, and now I am sharing that process with you — in other words this series is especially aimed at people starting out with forecasting.
Part 1 was all about sourcing the data. We went through using pandas-datareader to download SP500 stocks from Jan 1st 2017 up to Dec 31st 2022 for all tickers. At the end I showed you how to calculate the Daily Return % to know how much the adjusted close value changes day over day.
As usual, you can find all the supporting code on GitHub here.
So, for this model we start by… Actually, let’s go through the dataset once more before doing machine learning.
- Date: of the stock values, there is one line per day per stock
- High: highest value for the day per stock
- Low: lowest value for the day per stock
- Open: opening value for the day per stock
- Close: closing value for the day per stock
- Volume: number of transactions on that date per stock
- Adj Close: or adjusted close, “is the closing price after adjustments for all applicable splits and dividend distributions”
- Daily Return %: day over day growth of Adj Close
Now that you’re familiar with the data, let’s dive into feature engineering.
The primary goal with feature engineering for this model was to convert all non-numeric columns in the original dataset into numeric features. In other words, I had to map the tickers into integers, and split the date parts into their own features.
The first cell creates a separate dataframe (DF) that holds all the distinct tickers from the Ticker column and maps them to an integer. This was not needed for training, but relevant for evaluating the model later.
The second cell creates a separate column for each date feature: year, month, and day. Again, we can only have numeric features for training.
The third cell maps the tickers to integers, but this time it is the feature for training, not a separate DF as before.
Finally, the fourth cell removes unwanted columns.
I did not go further with feature engineering for this model so here’s what the DF looks like at this point:
Don’t worry about the Date column — there is one last thing we need it for.
Next up is splitting the dataset into training and evaluation datasets. The former is used to train the model, the latter to evaluate how well the model performs.
Compared to a regular regression or classification model, the order of the data matters here. This is a time series model i.e. we want the rows to be sorted by date. In other words, the train-evaluation split is a direct split of the DF, in this case the first 90% rows used for the training, the remaining 10% to evaluate the model.
I calculated which list index represents the 90% split for the unique dates available: the date at the 90% split index is the date used to split the DF rows.
The query function is quite handy to filter DFs in a more pythonic syntax, it even lets you reference variables with an @.
Now that I had a DF for each dataset, I just had to separate features and labels — note how Date was not included in the features, it was kept at the end of feature engineering purely to split the data.
Build the Model
Now comes the time to build the ML model. I am skipping everything about hyperparameter tuning and presenting a set of values, but bear in mind these came from experimentation and automatic tuning. I highly recommend checking Optuna, it was the library used for tuning hyperparameters automatically.
Again, after a lot of trial and error and some automatic tuning I found this to be the best set of hyperparameters — this includes the 90% split for training and eval.
And here is the code for training the model:
The output shows all the parameters used for the model with their default values, except for the four parameters I specified when defining the model in the previous code cell.
XGBoost has a neat method to plot the feature importance to the model, i.e. how much each training feature contributes to the model’s predictive prowess.
The daily highest value was the most important feature for forecasting, followed by the ticker. Personally I was expecting the ticker and another financial feature at the top, but expected the Daily Return % to be more relevant. Turns out my intuition was proven wrong by data :)
Evaluate the model — regression metrics
At this point the model is complete, but I want to show you how I evaluated the model. I did it in two parts:
- First I used common regression metrics because I had both the actual value and the value predicted by the model, what we call supervised training
- And second I plotted the predicted against the actual values for multiple tickers
Note I used regression metrics because the model predicts a concrete number, in this case dollars, not a numeric value to assign a class to each data point.
First off I had to rebuild the evaluation data. This meant recreating the dates from year, month and day, making predictions for the evaluation data and mapping the ticker numbers back to the names.
The base DF for this was the x_eval DF i.e. the eval DF without the label/Adjusted Close column.
The first cell creates that new DF for evaluation and recreates the Date column from the separate year, month, and day features.
The second cell uses the original evaluation dataset to predict the Adjusted Close values — remember x_eval and df_pred are essentially the same at this point with the same number of rows, but x_eval has the necessary features for making predictions. This is only a matter of saving the predictions in the right DF, to be fair I could have made predictions with df_pred before creating the Date column.
Remember that separate DF created at the beginning with the ticker names and numbers? It returns in the third code cell to map the ticker numbers back to names.
With the DF complete, now we can calculate:
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- Mean Absolute Error Percentage (MAEP)
In a nutshell, all three columns calculate an aggregate value for the difference between the predicted and actual values, but in different units.
According to the metrics the model has, on average, 1.13% of difference between the predicted and the real values (MAEP). From MAE’s perspective, each prediction is, on average, wrong by 1.56 USD (the data provided is in US dollars).
Evaluate the model — plotting predictions vs actuals
Okay at this point I won’t bother you with any more code. The complete notebook is available on GitHub here so you can read at your own leisure how the plots were calculated. Let’s jump straight into the plots produced.
Picking AMD as an isolated stock, the model was pretty close especially until August 2021, but then the difference grows ever so slightly over time, being unable to predict some patterns in the real data — namely in the period between October and November 2021.
The first plot showed AMD because it was a familiar company, but NVR was chosen in the second plot plot because it was one of the companies where the model showed the weakest predictive power during initial tests. It wasn’t completely horrible, but it failed to predict most of the changes in patterns over time.
Looking at predictions for AMD and Intel side by side we see the model performs quite decently. Even with Intel’s stocks being much more regular in the bottom visual, we can see again that the model did a decent job predicting the fluctuations in AMD’s stocks over time.
Looking at another set of stocks with irregular evolutions for demo purposes, we can see the model did not fully adapt to the fluctuations, especially for GOOG’s stocks.
Okay enough looking at stocks, overall what companies did the model learn the least about? Well, here’s a bar chart of the top 10 companies with the highest MAPE.
Wow, NLOK has an exceedingly high MAPE… wonder what the predictions vs actuals looks like…
Hmm I see, not that great. The model learned the stock was highly volatile, but did not learn how exactly the volatility of NLOK worked.
Okay, let’s do one last plot: evolution of MAPE over time:
Nothing unexpected here: the further into the future the lower the predictive power of the model because financial data is affected by so many variables in the real world.
First and foremost, this was a fun ML project to work on in my spare time. It had little science involved if I’m completely honest, but doing even basic feature engineering and hyper parameter tuning to see how low I could get the MAPE and how close I could get the predicted and actual lines in the plots was a rewarding iterative process.
Clearly there are improvements to be done in this model, different data to be included in training and probably even a different model algorithm to be used. Moreover, this was a generic model for all SP500 stocks — there is clearly a case to design more specific models, such as a model for a specific industry.
However, finally getting hands-on experience with a time series model and making it work was quite the adventure and I hope sharing the experience helps you get started with XGBoost and forecasting. We all do mistakes and arrive at bad results when starting, so I wanted to share my experience to give you a head start in your next project :)