Applying K-Medoids Clustering to S&P Sector Indices, Part II

5 min readJan 7, 2024

In the last article I wrote, I applied K-Medoids to S&P sector data, using a dissimilarity measure derived from the correlation coefficient. I noted how the clusters seemed to somewhat “off”, due to the fact that healthcare and information technology ended up in different sectors. At the end, I concluded that although the clusters did make some sense, the clustering performance could be improved by doing one (or all) of the following things: (1) choosing the best cluster size, (2) modifying the dissimilarity measure to be better, and (3) randomizing the initial cluster assignments and choosing the clustering assignments that result in the best performance.

In this article, I will first fix the dissimilarity measure to be (1-R)/2, where R is the correlation coefficient. In the last article I was rounding all of the negatively correlated sectors to 0 and using 1-R. However, with this you lose some information, so I believe it will be better to use (1-R)/2.

First let’s load the data and create the dissimilarity matrix:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def read_and_process_data(sector_name_list):
    merged_df = None
    for sector_name in sector_name_list:
        sector_df = pd.read_csv(sector_name + '.csv')
        sector_df['date'] = pd.to_datetime(sector_df['Date'])
        sector_df[sector_name] = sector_df['Adj Close']
        sector_df.drop(columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], 
                        inplace = True)
        
        if merged_df is None:
            merged_df = sector_df
        else:
            merged_df = merged_df.merge(sector_df, how = 'inner', on = 'date')
    merged_df.set_index('date', inplace = True)
    merged_df.dropna(inplace = True)
    return np.round((merged_df - merged_df.mean())/merged_df.std(), 1)


df = read_and_process_data(['communication_services', 'consumer_discretionary', 'consumer_staples',
                          'energy', 'financials', 'healthcare',
                          'industrials', 'information_technology', 'materials',
                          'real_estate', 'utilities'])

dissimilarity_measures = df.corr()
dissimilarity_measures = (1 - dissimilarity_measures)/2

sns.set(font_scale=0.6)
plt.figure(figsize = (8, 6))
sns.heatmap(dissimilarity_measures, annot = True)
plt.title("Dissimilarity Matrix of S&P Sectors")
plt.show();

As a reminder, a dissimilarity measure can range from 0 to 1. A dissimilarity score of 0 between two sectors means that they are the same, while a dissimilarity score of 1 between two sectors means they are totally opposite.

Next I will modify my code to use random initial assignments, and then choose the best resulting clustering assignment. Furthermore, I will perform the clustering algorithm for two clusters to six clusters, and plot the total dissimilarity score vs number of clusters to determine the best number of clusters to use:

from random import shuffle

class KMedoids:   
    def __init__(self, dissimilarity_measures, K, X):
        '''
        Implement the k-medoids algorithm on the the dissimilarity measures and initial cluster assignments.

        dissimilarity_measures: 1 - correlation_matrix
        K: int, the number of clusters to use
        X: list of strings, the members that we are going to cluster
        '''
        
        self.dissimilarity_measures = dissimilarity_measures
        self.K = K
        self.X = [sector for sector in X]
        shuffle(self.X)
        self.clusters = []
        idx_multiplier = round(len(X)/self.K)
        for i in range(self.K):
            if i < self.K - 1:
                self.clusters.append(self.X[i*idx_multiplier:(i+1)*idx_multiplier])
            else:
                self.clusters.append(self.X[i*idx_multiplier:])
        
    def __update_cluster_centers(self):
        updated_cluster_centers = []
        for cluster in self.clusters:
            min_distance = np.inf
            cluster_center = None
            for member in cluster:
                current_distance = 0
                for other_member in cluster:
                    current_distance += self.dissimilarity_measures.loc[(member, other_member)]
                if current_distance < min_distance:
                    min_distance = current_distance
                    cluster_center = member
            updated_cluster_centers.append(cluster_center)
        
        self.cluster_centers = updated_cluster_centers
        
    def __update_clusters(self):
        updated_clusters = [[] for _ in self.clusters]
        
        for x in self.X:
            min_distance = np.inf
            min_cluster = None
            for cluster_center in self.cluster_centers:
                if self.dissimilarity_measures.loc[(x, cluster_center)] < min_distance:
                    min_distance = self.dissimilarity_measures.loc[(x, cluster_center)]
                    min_cluster = cluster_center
            idx = self.cluster_centers.index(min_cluster)
            updated_clusters[idx].append(x)
        
        return updated_clusters
                    
    def fit(self):
        while True:
            self.__update_cluster_centers()
            updated_clusters = self.__update_clusters()
            
            changed = False
            for cluster, old_cluster in zip(updated_clusters, self.clusters):
                if set(cluster) != set(old_cluster):
                    changed = True
                    break
            if changed:
                self.clusters = updated_clusters
            else:
                break
                
    def score(self):
        sum_score = 0
        for cluster, center in zip(self.clusters, self.cluster_centers):
            for cluster_member in cluster:
                sum_score += self.dissimilarity_measures.loc[(cluster_member, center)]
        self.score = sum_score
        return self.score

cluster_sizes = [2, 3, 4, 5, 6]

# All of the 11 sector indices
X = ['communication_services', 'consumer_discretionary', 'consumer_staples',
     'energy', 'financials', 'healthcare',
     'industrials', 'information_technology', 'materials',
     'real_estate', 'utilities']

min_scores = []
min_clusters = []
for K in cluster_sizes:
    min_score = np.inf
    min_cluster = None
    for i in range(100):
        kmedoids = KMedoids(dissimilarity_measures, K, X)
        kmedoids.fit()
        score = kmedoids.score()
        if score < min_score:
            min_score = score
            min_cluster = kmedoids
    min_scores.append(min_score)
    min_clusters.append(min_cluster)

plt.plot(cluster_sizes, min_scores)
plt.title('Number of Clusters vs Dissimilarity Score')
plt.xlabel('Number of Clusters')
plt.ylabel('Total Dissimilarity');

As you can see, as we add more clusters, the total dissimilarity score goes down. This is to be expected, as you could imagine the case where there are 11 clusters, all of size 1 — in this case the total dissimilarity score would be 0, because a sector has 0 dissimilarity with itself. We generally want to avoid using too many clusters, as this will cause over fitting. We can do this by selecting the least value of cluster size in the graph that has a dissimilarity score that is almost as good as larger cluster sizes. In other words, a good rule of thumb is to look for an elbow in the plot and use the corresponding number of clusters. In this case, the elbow in the plot occurs when the number of clusters is five.

The k-medoids clustering algorithm gave the following cluster centers and their corresponding clusters:

kmedoids = min_clusters[3]
clusters, centers = kmedoids.clusters, kmedoids.cluster_centers

for center, cluster in zip(centers, clusters):
    print('Center {} has the following members:'.format(center))
    print(cluster)
    print()

It looks like the fixes that I thought of in part I of this series have worked! The only point of contention that I can think of is that the dissimilarity measure could be further tweaked to be more meaningful. For example, with the way that the dissimilarity measure is currently defined, sectors that have a perfect negative correlation of -1 would have a dissimilarity measure of 1, while two sectors that have a correlation of 0 with each other would have a dissimilarity score 0.5. So totally uncorrelated sectors are more similar to each other than perfectly negatively correlated sectors. This does not make as much sense to me, because I believe that a correlation of 0 will make two sectors more different from each other than a correlation of -1. Maybe I will look into this next time!

As always, thank you for reading my blog!

Applying K-Medoids Clustering to S&P Sector Indices, Part II

Written by Sam Erickson