Applying K-Medoids Clustering to S&P Sector Indices, Part I

Sam Erickson
4 min readDec 31, 2023

--

I recently have been writing much about the S&P sector indices. Specifically, I have been interested in determining which sectors influence the S&P 500 index the most, as can be seen here and here. One of the most interesting things about both of these articles is the part where I visualize a correlation matrix with a heat map. Not only is this very beautiful, but this also leads me to question how I could make more use of this correlation matrix. For sake of brevity, I have displayed the correlation matrix for the different sectors below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def read_and_process_data(sector_name_list):
merged_df = None
for sector_name in sector_name_list:
sector_df = pd.read_csv(sector_name + '.csv')
sector_df['date'] = pd.to_datetime(sector_df['Date'])
sector_df[sector_name] = sector_df['Adj Close']
sector_df.drop(columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'],
inplace = True)

if merged_df is None:
merged_df = sector_df
else:
merged_df = merged_df.merge(sector_df, how = 'inner', on = 'date')
merged_df.set_index('date', inplace = True)
merged_df.dropna(inplace = True)
return np.round((merged_df - merged_df.mean())/merged_df.std(), 1)


df = read_and_process_data(['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'healthcare',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities'])

corr = df.corr()
sns.set(font_scale=0.6)
plt.figure(figsize = (8, 6))
sns.heatmap(corr, annot = True)
plt.title("Correlation matrix of S&P Sectors")
plt.show();

I recently realized that I could perform clustering using a dissimilarity metric based on correlation, using 1 — correlation as a dissimilarity metric and then applying the k-medoids algorithm to each dissimilarity score pair. Here is the code to get the dissimilarity matrix:

dissimilarity_measures = df.corr()
# Correlations are mostly positive, so set negatives to 0
#since they are small anyways
dissimilarity_measures[dissimilarity_measures < 0] = 0
dissimilarity_measures = 1 - dissimilarity_measures

sns.set(font_scale=0.6)
plt.figure(figsize = (8, 6))
sns.heatmap(dissimilarity_measures, annot = True)
plt.title("Dissimilarity Matrix of S&P Sectors")
plt.show();

In the above dissimilarity matrix, a score closer to 1 indicates a higher degree of dissimilarity, while a score of 0 indicates a lower degree of dissimilarity. A score of 1 means totally dissimilar, while a score of 0 means totally similar.

Here is the code that I wrote to do the clustering on this dissimilarity matrix:

class KMedoids:   
def __init__(self, dissimilarity_measures, cluster_assignments, X):
'''
Implement the k-medoids algorithm on the the dissimilarity measures and initial cluster assignments.

dissimilarity_measures: 1 - correlation_matrix
cluster_assignments: list of lists which are the best initial guess at the inital assignments
for the clusters
X: list of strings, the members that we are going to cluster
'''

self.dissimilarity_measures = dissimilarity_measures
self.clusters = cluster_assignments
self.X = X

def __update_cluster_centers(self):
updated_cluster_centers = []
for cluster in self.clusters:
min_distance = np.inf
cluster_center = None
for member in cluster:
current_distance = 0
for other_member in cluster:
current_distance += self.dissimilarity_measures.loc[(member, other_member)]
if current_distance < min_distance:
min_distance = current_distance
cluster_center = member
updated_cluster_centers.append(cluster_center)

self.cluster_centers = updated_cluster_centers

def __update_clusters(self):
updated_clusters = [[] for _ in self.clusters]

for x in self.X:
min_distance = np.inf
min_cluster = None
for cluster_center in self.cluster_centers:
if self.dissimilarity_measures.loc[(x, cluster_center)] < min_distance:
min_distance = self.dissimilarity_measures.loc[(x, cluster_center)]
min_cluster = cluster_center
idx = self.cluster_centers.index(min_cluster)
updated_clusters[idx].append(x)

return updated_clusters

def fit(self):
while True:
self.__update_cluster_centers()
updated_clusters = self.__update_clusters()

changed = False
for cluster, old_cluster in zip(updated_clusters, self.clusters):
if set(cluster) != set(old_cluster):
changed = True
break
if changed:
self.clusters = updated_clusters
else:
break

Finally, here is the code I used to do the cluster fitting:

# Our initial guess for cluster assignments
cluster_assignments = [
['communication_services', 'consumer_discretionary'],
['consumer_staples', 'healthcare', 'information_technology'],
['energy'],
['financials', 'industrials', 'materials'],
['real_estate', 'utilities']
]

# All of the 11 sector indices
X = ['communication_services', 'consumer_discretionary', 'consumer_staples',
'energy', 'financials', 'healthcare',
'industrials', 'information_technology', 'materials',
'real_estate', 'utilities']

kmedoids = KMedoids(dissimilarity_measures, cluster_assignments, X)
kmedoids.fit()
kmedoids.clusters
Resulting Clusters from K-Medoids

I am kind of surprised that information technology and healthcare did not end up in the same cluster, given that they have a correlation of 0.92 with each other. The fact that real estate and energy end up in their own clusters does make sense to me. I think that the results might indicate that the number of clusters needs to be changed from five to some other number. The number of clusters can be determined by looking at a plot of the number of clusters vs the sum of the distances from each cluster member to cluster centers. The best choice for the number of clusters is where an “elbow” occurs in the plot. Maybe this will be the topic of my next article. Another issue could be in my choice of dissimilarity measure. Finally, randomizing the initial cluster assignments could offer further performance benefits as well.

As always, thank you for reading my blog!

--

--