Practical Application of K-Means Clustering to Stock Data.

Abhay Dodiya
5 min readMar 4, 2023

--

Clustering is the task of grouping a set of objects with similar characteristics into one bucket and differentiating them from the rest of the group. This paper explains the clustering process using the simplest clustering algorithm — the K means. The idea was further applied to stock market data and tried to understand how we can use this method to make some meaningful information out of this.

Introduction: Data clustering is a way in which we make clusters of objects that are somehow similar in characteristics. Precisely, Data Clustering is a technique in which, the information that is logically similar is stored together. In clustering, the objects of similar properties are placed in one class of objects and a single access to the group makes the entire class available.

A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Hence managing data is a complex job. Grouping them into different clusters will bring order to data. The goal of clustering becomes very clear when we are trying to discover the underlying structure of the data.

Clustering Algorithms: Clustering algorithms can be broadly classified into two categories: Unsupervised linear clustering algorithms and unsupervised non-linear clustering algorithms. Where K-means, Hierarchical & Gaussian falls under unsupervised linear whereas kernel K-means & density-based clustering algorithm fall under unsupervised non-linear clustering algorithms. Whereas the unsupervised method is basically when no information is provided to the algorithm on which data points belong to which clusters.

K-Means clustering: K-means clustering is an example of a partitioning (bottom-up) algorithm. Data points are grouped based on similarity, but the degree of homogeneity of the clusters that are formed is dependent largely on how many clusters the algorithm is told to find.

K-Means initializes cluster centroids with randomly selected data points and then iteratively assigns the data points to their closest cluster and updates the centroids to the mean of the respective data points.

The Euclidean distance is the straight-line distance between two points. It is named after the “Father of Geometry”, the Greek mathematician Euclid.

K-Means clustering inference: K-means clustering is one of the basic clustering algorithms in the machine learning domain.

The inference of this algorithm is based on the value of ‘K’ which is the number of clusters that can be found in an n-dimensional dataset. In the K-Means algorithm, since it is considered there are ‘k’ number of clusters; we consider there is’k’ a number of cluster means (center points), where the cluster mean is the average of all the data points falling under each cluster.

The end objective algorithm is that each data point in the data set is grouped into ‘k’ cluster and ‘k’ cluster means. If the dataset is tightly surrounding the cluster means, then it considers a good cluster.

K-Means clustering algorithm:

  1. Place K points into the space represented by the objects that are being clustered. These points represent the initial group of centroids.
  2. Assign each object to the group that has the closest centroid.
  3. When all objects have been assigned, recalculate the positions of the K centroids.
  4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

Implementation: So far we have covered what K-means is all about, now we will look at how to apply this concept to real-world data. Stock data is a universe where each of them is related in some way or the other, and to bring some meaningful information out of this universe is of much importance. In this exercise, we will take the universe of 50 stocks that constitute NIFTY, and we will take the return series of these constitutes in order to make uniform series and create clusters.

In order to carry out this exercise we need to have two parameters for each stock, so we will take the mean and standard deviation of each stock for the last two years of data and plot them to see what they look like.

Stocks Data Basic Statistics

Once we plot them we need to create the initial ‘K’ cluster to initialize the clustering effects.

Initial Clusters

In K-means the objective is to minimize the distance between a data point and the centroid, for this reason, the next step is to find the Euclidian distance of each data point with all centroids.

After finding out the distance we need to figure out the minimum distance and among which cluster a particular data point belongs.

We need to sum the value of the minimum distance and our objective is to minimize this distance in order to achieve optimized clusters with the help of a solver we have minimized the minimum distance and come up with the optimized value of clusters.

We can plot the after-effect of the optimized cluster with return series and also look at the position movement of data points among new optimized clusters.

Conclusion: This article highlights about clustering and further discusses a linear form of the unsupervised clustering method K-means clustering. The idea is further implemented in a universe of stock data with their return and standard deviation properties and tries to classify the optimum clusters for each stock.

List of Cluster for Stocks

References:

http://en.wikipedia.org/wiki/Cluster_analysis

Aravind H, C Rajgopal, K P Soman. “A simple approach to Clustering in excel” (international journal of computer applications (0975–8887)

https://sites.google.com/site/dataclusteringalgorithms/

http://www.microarrays.ca/services/kmeans_clustering.pdf

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html

--

--

Abhay Dodiya

Quant Finance | Data Science | Artificial Intelligence | Machine Learning. https://github.com/abhaydd22