Cluster Analysis-Unsupervised ML for Pairs Trading using Indian Stock data

Sanjjushri Varshini R
featurepreneur
Published in
6 min readMar 11, 2022

In this article, you will learn how Unsupervised ML works and how to find pairs using stock data for pair trading.
Here we will be using three types of cluster analysis methods i.e.,
i) K-Means Clustering
ii) Hierarchical Cluster
iii) Affinity Propagation Clustering

Download the necessary dataset:
DOWNLOAD THE DATASET FROM HERE
This dataset contains the list of the top 1000 companies based on the market capitalisation of Indian companies dated March 2020. We need this dataset

Indian stock data is not handy. So it is necessary to have a list of companies and iterate over them to obtain the data.

  1. Import necessary modules:
from nsepy import get_historyfrom datetime import dateimport pandas as pdimport numpy as npimport seaborn as snsfrom sklearn.preprocessing import StandardScaler

2. Reading the excel file for getting the list of companies:

df = pd.read_excel("MCAP_31032020_TOP1000.xlsx")stock_1000 = list(df.Symbol)

3. Getting the top 500 companies list:

stock = stock_1000[:500]

4. Iterate through the list and obtain our data for each of the companies:

the_stock_data = {}for symbols in stock:    try:
the_stock_data[symbols] = get_history(symbol = symbols , start = date(2019, 1, 1), end = date(2022, 1,31))
except:
continue

Here we are specifying the start date and end date.

5. Concatenating the obtained data:

data = pd.concat(the_stock_data)

6. Resetting the index:

data = data.reset_index()

7. Extracting the required data i.e., Date and Closing price.

data = data.pivot(index='Date', columns='Symbol', values = 'Close')data.head()

8. Describing method and set the decimal point to 3:

pd.set_option('precision', 3)data.describe().T.head(10)

9. Handling null values:

Checking if null values exist:

data.isnull().values.any()

Visualizing the null values using missing:

import missingnomissingno.matrix(data)

Dropping the columns which have null values of more than 20% of data:

print('Data Shape before cleaning =', data.shape)missing_percentage = data.isnull().mean().sort_values(ascending=False)dropped_list = sorted(list(missing_percentage[missing_percentage > 0.2].index))data.drop(labels=dropped_list, axis=1, inplace=True)print('Data Shape after cleaning =', data.shape)

Filling the null values using ffill and bfill method:

data = data.fillna(method='ffill')data = data.fillna(method='bfill')

Now our dataset does not contain any null values.

12. Storing the data for future use:

data.to_csv('NSE500_stock_data')

13. Calculating the returns & volatility and creating a data frame:

#Calculate returns and create a data frame
returns = data.pct_change().mean()*266
returns = pd.DataFrame(returns)
returns.columns = ['returns']
#Calculate the volatility
returns['volatility'] = data.pct_change().std()*np.sqrt(266)
data = returns
data.head()

14. Visualizing the data (returns and volatility):

sns.displot(data, x="returns", y="volatility")

15. Using StandardScalar for transforming the data:

#Prepare the scaler
scale = StandardScaler().fit(data)
#Fit the scaler
scaled_data = pd.DataFrame(scale.fit_transform(data),columns = data.columns, index = data.index)
X = scaled_data
X.head()

16. Visualizing the data after using Standard Scalar:

sns.displot(X, x="returns", y="volatility")

17. Performing K-Means Clustering:

Finding the number of clusters

i) Elbow Method:

from sklearn.cluster import KMeans
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
K = range(1,15)distortions = []#Fit the method
for k in K:
kmeans = KMeans(n_clusters = k)
kmeans.fit(X)
distortions.append(kmeans.inertia_)
#Plot the results
fig = plt.figure(figsize= (15,5))
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('Elbow Method')
plt.grid(True)
plt.show()

Using the kneed library that finds the optimal number of clusters:

from kneed import KneeLocatorkl = KneeLocator(K, distortions, curve="convex", direction="decreasing")kl.elbow

ii) Silhouette Method:

from sklearn.metrics import silhouette_score#For the silhouette method k needs to start from 2
K = range(2,15)
silhouettes = []
#Fit the method
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10, init='random')
kmeans.fit(X)
silhouettes.append(silhouette_score(X, kmeans.labels_))
#Plot the results
fig = plt.figure(figsize= (15,5))
plt.plot(K, silhouettes, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette Method')
plt.grid(True)
plt.show()

Using the kneed library that finds the optimal number of clusters:

kl = KneeLocator(K, silhouettes, curve="convex", direction="decreasing")print('Suggested number of clusters: ', kl.elbow)

Building our k-Means algorithm with 4 clusters:

c = 4#Fit the model
k_means = KMeans(n_clusters=c)
k_means.fit(X)
prediction = k_means.predict(X)
#Plot the results
centroids = k_means.cluster_centers_
fig = plt.figure(figsize = (18,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c=k_means.labels_, cmap="rainbow", label = X.index)
ax.set_title('k-Means Cluster Analysis Results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=10)
plt.show()

Finding how many instances each cluster has:

clustered_series = pd.Series(index=X.index, data=k_means.labels_.flatten())clustered_series_all = pd.Series(index=X.index, data=k_means.labels_.flatten())clustered_series = clustered_series[clustered_series != -1]plt.figure(figsize=(12,8))
plt.barh(range(len(clustered_series.value_counts())),clustered_series.value_counts())
plt.title('Clusters')
plt.xlabel('Stocks per Cluster')
plt.ylabel('Cluster Number')
plt.show()

18. Performing Hierarchical Clustering:

#x-axis - stock, y-axis - distance between themfrom sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(15, 10))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(X, method='ward'))

Finding the number of clusters :

plt.figure(figsize=(15, 10))
plt.title("Dendrogram")
dend = shc.dendrogram(shc.linkage(X, method='ward'))plt.axhline(y=9.5, color='purple', linestyle='--')

Building our Hierarchical Clustering algorithm with 5 clusters:

#Fit the model
clusters = 5
hc = AgglomerativeClustering(n_clusters= clusters, affinity='euclidean', linkage='ward')labels = hc.fit_predict(X)#Plot the results
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0], X.iloc[:,1], c=labels, cmap='rainbow')ax.set_title('Hierarchical Clustering Results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
plt.show()

19. Performing Affinity Propagation Clustering:

Building our Affinity Propagation Clustering algorithm:

from sklearn.cluster import AffinityPropagation#Fit the model
ap = AffinityPropagation()
ap.fit(X)
labels1 = ap.predict(X)
#Plot the results
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0], X.iloc[:,1], c=labels1, cmap='rainbow')
ax.set_title('Affinity Propagation Clustering Results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
plt.show()

Obtaining the number of clusters and arranging them for a better look:

from itertools import cycle#Extract the cluster centers and labels
cci = ap.cluster_centers_indices_
labels2 = ap.labels_
#Print their number
clusters = len(cci)
print('The number of clusters is:',clusters)
#Plot the results
X_ap = np.asarray(X)
plt.close('all')
plt.figure(1)
plt.clf
fig=plt.figure(figsize=(15,10))
colors = cycle('cmykrgbcmykrgbcmykrgbcmykrgb')
for k, col in zip(range(clusters),colors):
cluster_members = labels2 == k
cluster_center = X_ap[cci[k]]
plt.plot(X_ap[cluster_members, 0], X_ap[cluster_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=12)
for x in X_ap[cluster_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.show()

20. Using Silhouette score to find which method performs the best:

print("k-Means Clustering", metrics.silhouette_score(X, k_means.labels_, metric='euclidean'))print("Hierarchical Clustering", metrics.silhouette_score(X, hc.fit_predict(X), metric='euclidean'))print("Affinity Propagation Clustering", metrics.silhouette_score(X, ap.labels_, metric='euclidean'))

Here the K-Means algorithm performed well.

21. To extract trading pairs

i) The total number of clusters and the total number of pairs is found:

cluster_size_limit = 1000
counts = clustered_series.value_counts()
symbol_count = counts[(counts>1) & (counts<=cluster_size_limit)]print ("Number of clusters: %d" % len(symbol_count))print ("Number of Pairs: %d" % (symbol_count*(symbol_count-1)).sum())

ii) Reading the data which we stored earlier:

data1 = pd.read_csv("NSE500_stock_data")

ii) Finding the unique pairs:

def find_cointegrated_pairs(data, significance=0.05):
n = data.shape[1]
score_matrix = np.zeros((n, n))
pvalue_matrix = np.ones((n, n))
keys = data.keys()
pairs = []
for i in range(1):
for j in range(i+1, n):
S1 = data[keys[i]]
S2 = data[keys[j]]
result = coint(S1, S2)
score = result[0]
pvalue = result[1]
score_matrix[i, j] = score
pvalue_matrix[i, j] = pvalue
if pvalue < significance:
pairs.append((keys[i], keys[j]))
return score_matrix, pvalue_matrix, pairsfrom statsmodels.tsa.stattools import cointcluster_dict = {}for i, clust in enumerate(symbol_count.index):
symbols = clustered_series[clustered_series == clust].index
score_matrix, pvalue_matrix, pairs = find_cointegrated_pairs(data1[symbols])
cluster_dict[clust] = {}
cluster_dict[clust]['score_matrix'] = score_matrix
cluster_dict[clust]['pvalue_matrix'] = pvalue_matrix
cluster_dict[clust]['pairs'] = pairs
pairs = []for cluster in cluster_dict.keys():
pairs.extend(cluster_dict[cluster]['pairs'])
print ("Number of pairs:", len(pairs))
print ("In those pairs, we found %d unique symbols." % len(np.unique(pairs)))
print(pairs)

22. Visualize trading pairs by using TSNE (t-distributed stochastic neighbour embedding):

from sklearn.manifold import TSNE
import matplotlib.cm as cm
stocks_data = np.unique(pairs)
X_data = pd.DataFrame(index=X.index, data=X).T
in_pairs_series = clustered_series.loc[stocks_data]
stocks = list(np.unique(pairs))
X_pairs = X_data.T.loc[stocks]
X_pairs.head()
X_tsne = TSNE(learning_rate=30, perplexity=5, random_state=42, n_jobs=-1).fit_transform(X_pairs)X_tsneplt.figure(1, facecolor='white',figsize=(15,10))
plt.clf()
plt.axis('off')
for pair in pairs:
ticker1 = pair[0]
loc1 = X_pairs.index.get_loc(pair[0])
x1, y1 = X_tsne[loc1, :]
ticker2 = pair[0]
loc2 = X_pairs.index.get_loc(pair[1])
x2, y2 = X_tsne[loc2, :]
plt.plot([x1, x2], [y1, y2], 'k-', alpha=0.3, c='b');
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=215, alpha=0.8, c=in_pairs_series.values, cmap=cm.Paired)plt.title('TSNE Visualization of Pairs');# Join pairs by x and y
for x,y,name in zip(X_tsne[:,0],X_tsne[:,1],X_pairs.index):
label = name
plt.annotate(label,
(x,y),
textcoords="offset points",
xytext=(0,10),
ha='center')
plt.show()

The visualization output:

Hope you enjoyed learning Unsupervised ML!
Keep exploring!

--

--