Photo by tyler hendy from Pexels

(Machine) Learning by Example: Clustering

by Andrew McCarley

Opex Analytics
8 min readJul 24, 2019

--

In 2017, I took a giant leap from my comfortable career in industry to the strange new world of data science consulting. For the first few months, my head spun as I learned about machine learning models, AI trends, and the intricacies of the data science toolkit.

One of my favorite things about my new role was its focus on creating data-driven solutions. In industry, longstanding business rules often govern many organizational processes; while sometimes effective, they’re usually not designed to keep up with an evolving business environment. Naturally, I was excited when my first project used machine learning to offer an intelligent, data-driven alternative to a client’s set of business rules.

During that project (and reinforced in every other since), I realized two things:

  • Knowing AI would have made my life so much easier in prior roles. In the first few weeks alone, I found solutions to multiple problems my teams faced daily (and considered totally unsolvable).
  • Business leaders don’t always see how AI can help them. Many people in industry may have a loose understanding of some AI/machine learning techniques, but aren’t knowledgeable enough to imagine how these methods could apply to the problems their businesses face.
Photo by Helloquence on Unsplash

Considering these two realizations, as well as the fact that a large part of my new role involves communicating complex technical concepts to business leaders, I decided to come up with some real-world examples that help ground explanations of important ideas in data science.

In this post, I’d like to share something that I developed during my first project here at Opex: a golf example that breaks down a specific machine learning technique called clustering, with an emphasis on practical application.

Scenario

Pretend you were just hired as a college golf coach. Your mandate: improve the team’s performance as quickly as possible.

As a new coach, you’ve naturally had no real exposure to the players currently on the team. However, let’s say that you have a sheet with their average scores over the past year, as well as access to detailed player performance data (which had heretofore gone unused).

Crucially, as the new coach, you don’t just want to work with the best golfers or the worst golfers: you want every team member to become the best golfer he/she can be.

Problem

A player’s average score shows how well he/she has played lately, but does not provide insight into how that player can improve his/her game. For that, you’ll have to investigate the other data sets in your possession.

Players and their average scores

Solution Approach

In my experience, clustering has proven effective in situations like this.

Clustering algorithms collect similar objects together into groups called, you guessed it, clusters. The output of a clustering process is a set of these clusters that are created based on the mathematical similarities and dissimilarities of the objects being analyzed.

In this example, we’ll use it to automatically find groups of golfers based on their attributes, and then we’ll interpret these clusters to help give targeted coaching.

Graphic credit: Wikipedia

Step 1: Feature Creation

Using the unaltered data, brainstorm a list of common features you can get or create that might help characterize specific aspects of a golfer’s game. This list of common features will serve as the inputs to your clustering method.

As a coach, you know what to measure to help assess a player’s performance. Regardless of the problem, it’s important that those performing the clustering analysis work closely with relevant subject matter experts (SMEs).

Some features that might represent different aspects of a golfer’s game

Step 2: Feature Selection

Clustering can suffer when too many features are included (if you’re curious why, read here about the curse of dimensionality), so now you must carefully pare down your feature set. During this stage, it’s important to evaluate each feature’s distribution, as well as its correlation with other features in the data set. It seems somewhat counterintuitive, but having highly correlated variables can often lead to less effective models, as individual features don’t get the explanatory credit they deserve.

A reduced feature set

Step 3: Fitting the Model

In this step, the data scientist will evaluate different clustering models using the features finalized in the previous step.

There are many different kinds of clustering algorithms (a classic one is K-Means clustering — when you’re done here, read this K-means explainer). In most cases, data scientists consider multiple methods before narrowing to one, with the best choice determined by the intent of the analysis and the type of data being clustered. (See this page or the visual below for information on different clustering methods.)

No single approach stands out as the best for our golf example — I’d recommend trying a few and seeing what the groups looks like. Which leads us to our final step…

Graphic credit: Scikit-Learn

Step 4: Model Interpretation

Now that the clustering model has been fit, you’ll need to interpret the results of the extracted clusters. In this stage, the data scientist will likely work closely with the business’s SMEs to ensure that results are being translated properly.

One great way to make the results of a clustering exercise stick: name each cluster and give it an action plan. If a certain cluster doesn’t have a clear action plan, you may want to revisit your approach.

Let’s see how this works in our golf example.

Photo by Fancycrave.com from Pexels

It’s usually best to start by figuring out what traits are common to all members of the cluster. Imagine that all the players in a certain cluster share the following three traits:

  • Low average driving distance
  • Decreasing performance as course distance increases
  • Decreasing performance as the round progresses

Think hard about what these similarities mean in the context of the greater problem. As a golf coach, you have enough experience to recognize these as symptoms of strength and endurance issues. Consequently, it makes sense to title this cluster “Strength Training Needed” and develop an appropriate coaching regimen to help improve these players’ strength. See below for the other clusters of golfers on the team.

Some example cluster profiles

Better Than Business Rules

With the golf example, I’ve broken down what clustering is and how it can be used. But how is this better than the alternative?

In a prior life, had I not used machine learning (and believe me, I wouldn’t have), I’d have been forced to develop different hypotheses, and then have separate reports created to analyze each individual hypothesis. Any hypotheses that turned out to be true could be used to create business rules.

This process could drag on forever, and not result in anything substantively better than guesswork.

Clustering is superior to the analog approach in several ways:

  1. Clustering is less likely to miss hidden patterns. Thinking about the golf example, I think most people taking the analog approach wouldn’t have even considered using the player data to consider the “endurance issues” hypothesis. Even if they’d used the data, they wouldn’t have clusters that automatically reveal player similarities.
  2. Clustering is unbiased. How many times have you seen someone develop a hypothesis and then find a way to make the data support it? Since clustering is an unsupervised method, meaning we’re not predicting anything or trying to hit a specific KPI, there’s less of a temptation to cherry-pick or manipulate data.
  3. Clustering is efficient. With a manual process, you have to continue to brainstorm different hypotheses, develop individual reports, and repeat for each new idea. Clustering, however, just reveals the patterns that exist in data — all you need to do is interpret them.
  4. Clustering quickly incorporates new data. Hypothesis testing does not, for all the reasons we’ve discussed. New patterns may emerge as the business environment changes, and though the cluster profiles may (and likely should) change, clustering can easily be repeated with additional data.

Other Applications

On the off chance that this blog isn’t picked up by Golf Digest, I want to highlight a few common examples where clustering might apply in business settings.

Evaluating supplier quality performance

Analyzing supplier performance with clustering allows you to ask some great questions:

  • Do certain suppliers perform well on the same product types in some plants versus others?
  • Are there headwinds in a specific commodity or region that are creating quality issues for multiple suppliers?
  • Do certain suppliers have consistently poor performance on the same specification?

Demand planning

Many products and/or customers may have similar demand characteristics. It’s possible that using different forecasting methods for each cluster could improve forecast accuracy.

Identifying patterns in stockouts or on-time delivery performance

Assessing order fulfillment performance in your supply chain with clustering might allow you to ask:

  • Are there incorrect inventory policies in specific items or distribution centers?
  • How does production run time fit the plan in specific plants or production lines?
  • Are transportation or distribution constraints more binding in specific locations or times of the year?

Customer segmentation

Segmenting customers based on buying behavior with clustering allows the business to develop specific promotional campaigns for each cluster, or pursue other targeted marketing initiatives.

Identifying patterns in store ordering behavior

Some examples of patterns that you might pick up more easily with clustering are:

  • There might be stores that consistently override automated store ordering systems. Perhaps they need more training, or maybe adjustments should be made to the store ordering system.
  • Some stores could be highly seasonal (e.g. located in a beach town), and need specific policies to support them.
  • Certain groupings of stores may see a decline in demand due to new competition, while others remain unaffected.
Photo by Fancycrave.com from Pexels

These are just a few cases where clustering can be applied to business problems — there are many more out there. With this golf example in mind, stay on the lookout for more opportunities to add data-driven decision-making to your organization.

_________________________________________________________________

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.

--

--