Applying Ordinary Least Squares to Analyze S&P 500 Sector Indices

Sam Erickson
4 min readDec 16, 2023

--

The other day, I was suddenly curious about which financial market sectors have the biggest impact on the S&P 500 index. Sure, I could just Google my question and be done with it, but I thought it would be way cooler to apply my data science skills to perform an analysis to answer this question.

To this end, I decided that one way to measure the effect of a sector on the S&P 500 is to take measure the correlation between the S&P 500 index and a S&P sector indices, which can be found here. However, this does not provide a complete answer, because I want to be able to determine the effect of a sector on the S&P 500 index movements. To do this, I will create a linear regression model using ordinary least squares.

I will do three things in this article: (1) preprocess the data and collect the data, (2) measure the correlations between all indices, and (3) create a linear regression model between the S&P 500 index (y) and the sector indices (X).

Preprocessing the Data

First let’s preprocess the data. All the data I obtained was from December of 2018 to December of 2023 from yahoo finance, for the following sectors: utilities, real estate, materials, industry, information technology, health care, financials, energy, consumer staples, consumer discretionary and communication services. Note that spx is the S&P 500 index:

import pandas as pd
import numpy as np

def read_and_process_data(sector_name_list):
merged_df = None
for sector_name in sector_name_list:
sector_df = pd.read_csv(sector_name + '.csv')
sector_df['date'] = pd.to_datetime(sector_df['Date'])
sector_df[sector_name] = sector_df['Adj Close']
sector_df.drop(columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'],
inplace = True)

if merged_df is None:
merged_df = sector_df
else:
merged_df = merged_df.merge(sector_df, how = 'inner', on = 'date')
merged_df.set_index('date', inplace = True)
merged_df.dropna(inplace = True)
return np.round((merged_df - merged_df.mean())/merged_df.std(), 1)


df = read_and_process_data(['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'health_care',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities', 'spx'])
df.head()
Head of Processed Data

Measuring Correlations

We can efficiently look at all of the correlations with a heat map:

import matplotlib.pyplot as plt
import seaborn as sns

corr = df.corr()
sns.set(font_scale=0.6)
sns.heatmap(corr, annot = True)
plt.title("Correlation matrix of S&P Sectors")
plt.show()
Heat Map of Correlation Matrix for S&P Sector Data

The S&P 500 index (spx) is most correlated with the information technology sector and the materials sector at 0.97. The next runner ups are industrials and health care at 0.95, followed by consumer staples at 0.92 and consumer discretionary at 0.9. Energy is the least correlated with the S&P 500 index at 0.44. I expect the least squares model to have all positive coefficients, with the information technology sector and the materials sector having the largest coefficients.

Creating the OLS Model

First let’s define X and y and fit our linear model:

from sklearn.linear_model import LinearRegression

X = df[['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'health_care',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities']].values
y = df['spx'].values
reg = LinearRegression().fit(X, y)
reg.score(X, y)
R² Score

Next let’s determine which sectors have the largest impact on the S&P 500:

x_columns = ['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'health_care',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities']

argsorted = np.argsort(reg.coef_)[::-1]
for idx in argsorted:
print('{} has a coeficient of {}'.format(x_columns[idx],
np.round(reg.coef_[idx], 2)))
S&P Sector Coefficients

The results are interesting, because although information technology has the largest sized coefficient, materials has a much smaller coefficient than information technology, despite the fact that both are equally correlated with the S&P 500 index. This tells me a couple of things: (1) there is definitely some interactions that can (and should) be modeled, and (2) the fact that our features are so highly correlated indicates that there is much collinearity in our model, which can pose problems for ordinary least squares. We can solve both of these problem by including interaction terms, as well as using a more robust model such as Lasso or Ridge regression, or even tree based regression model such as gradient boosting or random forests.

Thank you for reading my article! If you enjoy reading my articles, please subscribe!

--

--