Applying K-Medoids Clustering to S&P Sector Indices, Part I

4 min readDec 31, 2023

I recently have been writing much about the S&P sector indices. Specifically, I have been interested in determining which sectors influence the S&P 500 index the most, as can be seen here and here. One of the most interesting things about both of these articles is the part where I visualize a correlation matrix with a heat map. Not only is this very beautiful, but this also leads me to question how I could make more use of this correlation matrix. For sake of brevity, I have displayed the correlation matrix for the different sectors below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def read_and_process_data(sector_name_list):
    merged_df = None
    for sector_name in sector_name_list:
        sector_df = pd.read_csv(sector_name + '.csv')
        sector_df['date'] = pd.to_datetime(sector_df['Date'])
        sector_df[sector_name] = sector_df['Adj Close']
        sector_df.drop(columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], 
                        inplace = True)
        
        if merged_df is None:
            merged_df = sector_df
        else:
            merged_df = merged_df.merge(sector_df, how = 'inner', on = 'date')
    merged_df.set_index('date', inplace = True)
    merged_df.dropna(inplace = True)
    return np.round((merged_df - merged_df.mean())/merged_df.std(), 1)


df = read_and_process_data(['communication_services', 'consumer_discretionary', 'consumer_staples',
                          'energy', 'financials', 'healthcare',
                          'industrials', 'information_technology', 'materials',
                          'real_estate', 'utilities'])

corr = df.corr()
sns.set(font_scale=0.6)
plt.figure(figsize = (8, 6))
sns.heatmap(corr, annot = True)
plt.title("Correlation matrix of S&P Sectors")
plt.show();

I recently realized that I could perform clustering using a dissimilarity metric based on correlation, using 1 — correlation as a dissimilarity metric and then applying the k-medoids algorithm to each dissimilarity score pair. Here is the code to get the dissimilarity matrix:

dissimilarity_measures = df.corr()
# Correlations are mostly positive, so set negatives to 0
#since they are small anyways
dissimilarity_measures[dissimilarity_measures < 0] = 0
dissimilarity_measures = 1 - dissimilarity_measures

sns.set(font_scale=0.6)
plt.figure(figsize = (8, 6))
sns.heatmap(dissimilarity_measures, annot = True)
plt.title("Dissimilarity Matrix of S&P Sectors")
plt.show();

In the above dissimilarity matrix, a score closer to 1 indicates a higher degree of dissimilarity, while a score of 0 indicates a lower degree of dissimilarity. A score of 1 means totally dissimilar, while a score of 0 means totally similar.

Here is the code that I wrote to do the clustering on this dissimilarity matrix:

class KMedoids:   
    def __init__(self, dissimilarity_measures, cluster_assignments, X):
        '''
        Implement the k-medoids algorithm on the the dissimilarity measures and initial cluster assignments.

        dissimilarity_measures: 1 - correlation_matrix
        cluster_assignments: list of lists which are the best initial guess at the inital assignments 
                            for the clusters
        X: list of strings, the members that we are going to cluster
        '''
        
        self.dissimilarity_measures = dissimilarity_measures
        self.clusters = cluster_assignments
        self.X = X
        
    def __update_cluster_centers(self):
        updated_cluster_centers = []
        for cluster in self.clusters:
            min_distance = np.inf
            cluster_center = None
            for member in cluster:
                current_distance = 0
                for other_member in cluster:
                    current_distance += self.dissimilarity_measures.loc[(member, other_member)]
                if current_distance < min_distance:
                    min_distance = current_distance
                    cluster_center = member
            updated_cluster_centers.append(cluster_center)
        
        self.cluster_centers = updated_cluster_centers
        
    def __update_clusters(self):
        updated_clusters = [[] for _ in self.clusters]
        
        for x in self.X:
            min_distance = np.inf
            min_cluster = None
            for cluster_center in self.cluster_centers:
                if self.dissimilarity_measures.loc[(x, cluster_center)] < min_distance:
                    min_distance = self.dissimilarity_measures.loc[(x, cluster_center)]
                    min_cluster = cluster_center
            idx = self.cluster_centers.index(min_cluster)
            updated_clusters[idx].append(x)
        
        return updated_clusters
                    
    def fit(self):
        while True:
            self.__update_cluster_centers()
            updated_clusters = self.__update_clusters()
            
            changed = False
            for cluster, old_cluster in zip(updated_clusters, self.clusters):
                if set(cluster) != set(old_cluster):
                    changed = True
                    break
            if changed:
                self.clusters = updated_clusters
            else:
                break

Finally, here is the code I used to do the cluster fitting:

# Our initial guess for cluster assignments
cluster_assignments = [
    ['communication_services', 'consumer_discretionary'],
    ['consumer_staples', 'healthcare', 'information_technology'],
    ['energy'],
    ['financials', 'industrials', 'materials'],
    ['real_estate', 'utilities']
]

# All of the 11 sector indices
X = ['communication_services', 'consumer_discretionary', 'consumer_staples',
     'energy', 'financials', 'healthcare',
     'industrials', 'information_technology', 'materials',
     'real_estate', 'utilities']

kmedoids = KMedoids(dissimilarity_measures, cluster_assignments, X)
kmedoids.fit()
kmedoids.clusters

I am kind of surprised that information technology and healthcare did not end up in the same cluster, given that they have a correlation of 0.92 with each other. The fact that real estate and energy end up in their own clusters does make sense to me. I think that the results might indicate that the number of clusters needs to be changed from five to some other number. The number of clusters can be determined by looking at a plot of the number of clusters vs the sum of the distances from each cluster member to cluster centers. The best choice for the number of clusters is where an “elbow” occurs in the plot. Maybe this will be the topic of my next article. Another issue could be in my choice of dissimilarity measure. Finally, randomizing the initial cluster assignments could offer further performance benefits as well.

As always, thank you for reading my blog!

Applying K-Medoids Clustering to S&P Sector Indices, Part I

Written by Sam Erickson