Stock Picks using K-Means Clustering

Published in

uptick-blog

6 min readFeb 8, 2019

Disclaimer: I am not vested in any of these stocks and I am not an equity analyst. Please do your due diligence before investing!

TLDR: Wanted to pick the best stocks to invest. Used K-means clustering to filter out a winning group. Discovered a group of 57 stocks with outstanding performance.

Being a Finance graduate, I wanted to put my money to good use, i.e. investing in the stock market to make better returns than the bank’s interest rate. What got me stumped was: which stock should I invest in? I knew that I had to do my own due diligence and read up on the companies financial statements and what not, but which company should I start with?

With this in mind, I would like to use my data science knowledge to narrow down my search for the winning stocks. I plan to use a clustering technique by grouping similar stocks, and hopefully be able to filter out the better performing stocks. As the name goes, clustering algorithms aims to segment a data set and group similar data points together.

Summary

I have applied K-means clustering on my data set and the final outcome was very favorable. I have managed to narrow my stock picks from 1300 stocks (NYSE and NASDAQ stocks) to 57 stocks where the average annual returns was 24% with a variance of 5%, over the time period of 2012 to 2018.

What is K-means clustering? (Explained like I’m five)

K-means clustering is a type of unsupervised learning model. Unsupervised models are used to learn from a data set that is not labeled or classified. It identifies commonalities in the data set and react based on the presence or absence of such commonalities in each data point.

A K-means clustering model simply initialize K number of centroids and data points that are close (similar) to the centroids will be categorized as a cluster.

Image taken from https://dzone.com/articles/10-interesting-use-cases-for-the-k-means-algorithm

Data Processing

The data set that I have obtained was pulled using the Stocker and Yahoo-Finance python packages. Below is an example of how the data set looks like.

Based on the data set, I will fit these two variables into the K-means model:

Annual returns
Annual variance

I have decided to use these variables as they inform us on the stock performance and its volatility (risk).

From the raw data, I have transformed it into a more usable data frame that informs me of each stocks average annual return and variance (over the last 7 years).

Example of how the transformed data looks like

Evaluating the Model (K-means)

I have used two metrics to evaluate the model:

Sum of squares of error (SSE) within cluster. SSE value will inform the user on how close each data points are to the center.
Silhouette score. Silhouette score measures how similar the data point is to its own cluster compared to other clusters.

We would want to have a low SSE value and high silhouette score (silhouette score ranges from 1 to -1).

Model Outcome: Golden Cluster Discovered after 3 Iterations of Clustering

I have conducted 3 iterations of K-means clustering and below was my thought process.

To decide the value of K in each iteration, I have plotted out 2 graphs that inform me of the SSE within cluster values and silhouette scores of each K value, for a range of K values.

First Iteration Resulted in Skewed Distribution

For example, in my first iteration I have looped the value of K 25 times, from K=1 to K=25. This resulted in the two graphs seen below.

Deciding my K value using these 2 graphs

From the first graph, we can see that as the number of clusters went past 7, the SSE within clusters value plateaus off. We then reference to the second graph where we can see that there is a huge drop in silhouette score when the number of clusters increased from 14 to 15. Hence, I have chosen to apply K=14 in my model. We can see the outcome of the first iteration of clustering below.

We can see that a majority of the stocks (1193) are concentrated in one cluster while the other clusters are very sparse. The returns and variance of those other clusters are scarily high! A savvy investor would definitely not invest in those stocks. Since the current result did not narrow down my search for the winning stocks, I decided to conduct another K-means clustering on the majority cluster, cluster 0.

Second Iteration Resulted in 2 Potential Clusters

In the second iteration of K-means clustering, I have chosen my K value to be 5 (using the same logic as above). The results (as seen below) are better spread out and we can see that cluster 0 and cluster 2 are the better performing clusters.

It was quite difficult discerned which was the better performing cluster (out of the two) as the average annual return and variance were quite proportionate. As such, I decided to add a metric, Sharpe Ratio, that better reflects the stock performance.

Sharpe Ratio is a measure that helps us understand the return of an investment compared to its risk. You can read more about it here if you are interested.

To have a better visualization of the numbers I have plotted out 2 box-plots.

In figure B, we can see that cluster 2 has the best Sharpe Ratio distribution. It ranged from 0.2 to 2.2 (typically a Sharpe Ratio of > 1 is considered good and > 2 as very good). Furthermore, we can see that there are some outliers (outperforming stocks) in cluster 2 and that the number of stocks in that cluster is still fairly large (257 stocks). Hence, I continued to conduct a third iteration of K-means clustering on cluster 2 to take a more in-depth look, and attempt to filter out the golden stocks.

Third Iteration Resulted in the Golden Cluster

In the third iteration of K-means clustering, we can find the golden cluster! Cluster 3 has an average annual returns of 24%, a variance of 5% and its Sharpe Ratio ranged from 0.7 to 2.2!

Closing thoughts

I have managed to narrow down to 57 stocks to conduct my due diligence on them. Further analysis on their financial statements should be conducted before any investment decision is being made. With this cluster, I could even form an alpha portfolio (portfolio optimization will be left in another blog post).

Some limitations and assumptions of this study was that:

The time period where there was a financial crisis (2007 and 2008) were not taken into account.
The volatility of each stocks within a year was not taken into account as well. The golden cluster’s annual return could have a low variance of 5% in the 7 years period, but there could be huge movement in the stocks within a year.

Disclaimer: I am not vested in any of these stocks and I am not an equity analyst. Please do your due diligence before investing!

If you are interested in my workings and codes, you can click on this link to visit them on my GitHub.

That’s all folks!

Stock Picks using K-Means Clustering

Written by Timothy Ong