Applying XGBoost to Analyze S&P Sector Indices

Sam Erickson
4 min readDec 26, 2023

--

In my last article, I applied ordinary least squares to the problem of predicting the S&P 500 index from the S&P sector indices. The goal wasn’t so much to predict the S&P 500 index as it was to understand what sectors influence the S&P 500 the most. However, I quickly discovered that the model was plagued by multicollinearity. Multicollinearity happens when the features that are being used in a linear model are highly correlated. This results in a model that can behave poorly, and also makes the resulting model more difficult to interpret. In the case of using S&P sector indices to predict the S&P 500 index, I found out that even though both the information technology sector index and materials sector index are equally correlated with the S&P 500 index at 0.97, we also saw how both of these sectors are highly correlated with each other, with a correlation of 0.91. I believe that this hurt model interpretation, because I ended up with a linear model where the information technology sector had the largest coefficient, but the materials sector coefficient was comparatively small.

In this article, I will attempt to use a more robust model in order to determine which indices contribute the most to the S&P 500 index. To do this, I will use XGBoost, which is a favorite tool of mine. It uses gradient boosting in application to regression trees, and it often performs very well in practice. I also really like how XGBoost provides a method that ranks the feature importance. We will use this to determine which features are the most important in the model.

Data Preparation

First I will process the data as we have before. The resulting data frame contains the adjusted closing price for each S&P sector index, as well as the S&P 500 index (column spx in the data frame):

import pandas as pd
import numpy as np

def read_and_process_data(sector_name_list):
merged_df = None
for sector_name in sector_name_list:
sector_df = pd.read_csv(sector_name + '.csv')
sector_df['date'] = pd.to_datetime(sector_df['Date'])
sector_df[sector_name] = sector_df['Adj Close']
sector_df.drop(columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'],
inplace = True)

if merged_df is None:
merged_df = sector_df
else:
merged_df = merged_df.merge(sector_df, how = 'inner', on = 'date')
merged_df.set_index('date', inplace = True)
merged_df.dropna(inplace = True)
return np.round((merged_df - merged_df.mean())/merged_df.std(), 1)


df = read_and_process_data(['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'health_care',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities', 'spx'])
df.head()
Processed Data

Visualizing Correlation Matrix

Next let’s visualize the correlation matrix with a heat map:

import matplotlib.pyplot as plt
import seaborn as sns

corr = df.corr()
sns.set(font_scale=0.6)
sns.heatmap(corr, annot = True)
plt.title("Correlation matrix of S&P Sectors")
plt.show()

The S&P 500 index (spx) is most correlated with the information technology sector and the materials sector at 0.97. The next most correlated sectors with the S&P 500 index are the industrial sector and the healthcare sector at 0.95, followed by consumer staples at 0.92 and consumer discretionary at 0.9. Energy is the least correlated with the S&P 500 index at 0.44.

Creating the XGBoost Model

To get the best model, I will use cross validation to find the xgboost parameters:

from sklearn.model_selection import GridSearchCV
import xgboost as xgb

X = df[['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'health_care',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities']].values
y = df['spx'].values

param_grid = {
'subsample': [0.5, 0.7, 1],
'reg_lambda': [0.1, 0.5, 0.9, 1.3]
}

xgb_model = xgb.XGBRegressor(n_estimators = 2000, max_depth = 3, learning_rate = 0.1)
grid_search = GridSearchCV(xgb_model, param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X, y)

print("Best set of hyperparameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

Next, let’s fit the model to these parameters:

X = df[['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'health_care',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities']].values
y = df['spx'].values

xgb_model = xgb.XGBRegressor(n_estimators = 2000, max_depth = 3, learning_rate = 0.1,
reg_lambda = 0.1, subsample = 0.5, importance_type = 'total_gain')

xgb_model.fit(X, y)
xgb_model.score(X, y)
R² of XGBoost Model

The model performs well, at least better than the previous OLS model that I tried.

Visualizing Feature Importance

Now that we have an XGBoost model that performs well, it is time to visualize the feature importance for this model. This can be interpreted as the sectors that have the most influence on the S&P 500:

import matplotlib.pyplot as plt
import seaborn as sns

x_columns = np.array(['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'health_care',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities'])


feature_importance = xgb_model.feature_importances_
feature_importance_std = (feature_importance - np.min(feature_importance)) / (np.max(feature_importance) - np.min(feature_importance))
feature_importance = feature_importance_std * 100

sorted_idx = np.argsort(feature_importance)
feature_importance = feature_importance[sorted_idx]
x_columns = x_columns[sorted_idx]

plt.title('S&P Sector Importance in the S&P 500')
sns.barplot(y = x_columns, x = feature_importance, orient = 'h');

The results are more of what I was expecting , given the correlations that we visualized earlier. I am still somewhat surprised by these results though, since information technology scored so low.

--

--