Applications of K-Means Clustering

Eqxtech Admin
Equinox Media Tech
Published in
6 min readMay 3, 2022

Originally published at https://eqxtech.com on May 3, 2022
by
Stacy Ryssiouk

At Equinox, we frequently look at groups of similar members to learn more about their behaviors. To do so, we need to both define new groupings in a methodical way as well as continually re-evaluate our existing groupings.

What is Clustering?

Clustering is a machine learning method used on numerical data with the goal of grouping the data points to create “clusters.” Since there is no need to label data, and well documented Python libraries are available, this makes for a very practical machine learning project.

Below we will explore the algorithm, code implementation, and use cases. There are various algorithms out there for clustering, but I will be discussing K-means which is the most well-known and easy to implement.

What is K-Means?

The K-mean algorithm allocates a set of data points into K clusters; the “means” part refers to the method used to calculate the centroids (centers of the clusters).

First, set your value for K.

There are two ways of selecting K for K-means:

  1. Pick K based on what makes sense for your use-case. For example, if you’re clustering flower measurements and there are 3 species of that flower, set K = 3.
  2. Use the elbow method to mathematically determine the optimal value for K (see appendix).

Next, iterate through the data to find your optimal centers for the clusters

To do this:

  • Pick K random points (let’s call them n₁,n₂,n₃,…,nₖ)
  • Take all the remaining points and assign each one to the closest n
    - This will result in K clusters
  • Continue picking different sets for n₁,n₂,n₃,…,nₖ randomly and calculating the mean of the resulting clusters (i.e. the centroids)
  • When we find the set of centroids with the smallest variance between them and the points in the cluster, we have found the optimal centroids

Now we’re ready to label the data.

Each row gets a label (1- # of clusters) mapping it into one of the groups.

Where have we already used it?

Before jumping into the implementation, let me give some more context on our use case as we have already explored the ‘re-evaluate our existing groupings’ part of our mission for clustering applications.

Check-in Engagement

We frequently use a metric, referred to as engagement segment, which groups members based on their in-club visit average. The bucketing for these segments were created over 10 years ago using general business logic. The goal here was to run K-means twice on in-club visits to see how the clusters compare to the segments we use now. Ultimately, we were looking to solve the question of — Do our engagement segments still make sense?

  1. With K = 5 clusters with current categories: Hardcore, Committed, Motivated, Occasional, and Disengaged
  2. Number of K based on results from Elbow method; K= 4

On Demand Class Engagement

Last summer we launched our new Equinox+ app which included on-demand classes that members have access to take outside of the club.

To categorize user’s engagement we created segments using on-demand class completes which are bucketed in the same way as Check-in Engagement. We replicated running K-Means in two different ways, as above, to see if the results were similar, and if the buckets should be different for digital class engagement.

Code Implementation

1. To start, import python libraries in a Jupyter Notebook

2. Import the data into a pandas dataframe by directly connecting to Redshift database and running a SQL query.

3. Turn data into numpy array & run KMeans with 5 clusters

4. Label pandas dataframe data by adding a new column

5. Identify how the clusters relate to the use-case to understand the results. In our case, the lowest valued cluster would equate to the group ‘Disengaged’, the second to lowest ‘Occasional’, etc. In our output, the clusters were not in “order”, so we added labels based on looking at the minimum and maximum of each cluster.

6. Visualize the results — here, I used a catplot and boxplot from the library seaborn

7. Evaluate results — how do the counts and ranges compare?

What were the results?

In-Club Visit Engagement

Using the elbow method, we found four groups naturally occurred instead of five and that our ‘Hardcore’ group should have a bit higher cutoff than what we are currently using. However, since the cutoff only shifted by 1.5 weekly visits, it did not warrant a change to an established metric.

On Demand Class Completes

For class completes, the elbow method also revealed four clusters but, the spread of the clusters in both K=4 and K=5 was much more notable. The ‘Hardcore’ cluster shifted by over seven average weekly class completes — quite a difference. Since the ‘Hardcore’ cutoff increased so much, all the other clusters’ ranges became wider by a value ranging from 0.63 to 6 class completes. Since this is a newly established metric, we will be changing the cutoff based on the results from the five-clusters to keep the five categories consistent with the Engagement Segments.

How can we use Clustering in the future?

Clustering on members attributes

Clustering can be used to identify groups within our member base. Various numerical metrics can be used, and, in our case, some examples are club visits, spend, and group fitness class utilization. With this information, we can dive deeper into these groupings to understand their behaviors and how they differ from one another. The knowledge we gain from here allows us to formulate different strategies around retaining members and increasing their engagement. The ultimate goal is to understand our members to the point where we can provide the right experience based on what they are most likely to be interested in.

Appendix

Educational Resources:

K-Means:

https://stanford.edu/~cpiech/cs221/handouts/kmeans.html http://www.learnbymarketing.com/methods/k-means-clustering/

Elbow Method:

https://en.wikipedia.org/wiki/Elbow_method_(clustering)#:~:text=In%20cluster%20analysis%2C%20the%20elbow,number%20of%20clusters%20to%20use.

Elbow Method from Yellowbrick:

https://www.scikit-yb.org/en/latest/api/cluster/elbow.html

Picture Credit:

https://medium.com/@krause60/news-article-clustering-using-unsupervised-learning-7647600a04fd

Originally published at https://eqxtech.com on May 3, 2022.

--

--