K-means Clustering

Christopher Hartig
3 min readJun 26, 2020

--

K- means Clustering is a data science algorithm that groups points on a grid with x and y similar values into a certain number of clusters. At the center of each cluster is a centroid. Each point is assigned to the cluster it’s closest to. Initially these centroids are chosen at random, but they are fine-tuned by minimizing the mean distance between the points in the cluster. These visualizations should make it clear how the process works.

In this K-means Clustering example there are four different clusters

I got into data science primarily through my interest in sports, so I thought it would be fun to look at baseball data. I know baseball isn’t the most exciting game to all, but its discrete nature and abundance of statistics make it easy to examine from a data science perspective.

In a previous project, I found that a team’s run differential in a season (runs scored subtracted by runs allowed) is very predictive of a team’s record. I wanted to examine the relationship between run differential and making the playoffs.

First I just looked at how likely it was for a team to make the playoffs.
Then I looked at the distribution of run differential in my data set
I plotted the relationship in a scatter plot

With so many data points, it’s hard to glean much from this graph at all. You can tell that teams that make the playoffs tend to better than teams that don’t, but not much else. KMeans Clustering however, makes the relationship easier to see and understand.

The data is a lot easier to make sense of when it’s separated into four clusters. Essentially, the teams with bad run differentials (red cluster) never made the playoffs, the mediocre to below average (orange) teams rarely make it, the above-average to good teams make it some of the time, and the good teams (green) make it most of the time.

Making my own K-means algorithm this week got me thinking more about what K-means is actually doing and it made me more thankful for Scikit-learn. I will definitely be using Scikit-learn more in my career, but it was fun to build my own methods.

--

--