The Clustering Process

In my experience as a Data Scientist in an e-commerce company, when asked a question, clustering is very often the answer. The problem with clustering being the answer is that it raises a 100 other questions.

In this article I’m going to attempt to walk through the steps I normally take while tackling a clustering problem.

Step 1: Try your level best to stop this from being a clustering problem.

Step 2: Pray to the Gods of Data Science that a data set you want to pipe into your clustering algorithm already exists.

Step 3: Now that you’ve realized 1 & 2 never work, you start down the path of curating the data. This is, in my opinion, the most vital step when it comes to clustering. You’ve probably heard this a million times over but I’m going to say it anyway.

Garbage In, Garbage Out. Or in the words of Charles Babbage, “On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.”

So back to the point; Building the dataset is key. The goal should be to capture enough inputs, not too many, not too few, that are independent of each other, that capture enough information to produce meaningful clusters, but at the same time don’t directly inform the clusters in an obvious fashion that is bound to produce brilliant clusters, and that don’t overfit the model as to make the clusters too specific. Simple enough.

In other words: Keep your list concise, don’t use obvious inputs in your feature space, and don’t over complicate when you can simplify. Which brings me to another one of my favorites.

Among competing hypotheses, the one with the fewest assumptions should be selected. — William Ockham

Step 4: Scrub and Clean

Different algorithms have different requirements. Some can’t take NAs [Not Available/Missing Data], some can’t take categorical values, some can’t handle zeroes very well, some need the data to be scaled. So now it’s time to scrub and clean as your algorithm demands.

Some quick and dirty R tricks that will help you do this:

#Get only Numeric Columns
numeric.indices <- sapply(data,is.numeric)
data.numeric <- data[,numeric.indices]
#Rudimentary scaling of data. Subtract Mean then Divide by SD
data.scaled <- scale(data)
#Get rid of rows with NAs. Be vary, if your data is sparse, you may lose more than you bargained for. <- data[complete.cases(data),]

Step 5: Pick Your Algorithm

Eyeball: Keep It Simple Stupid (KISS)

  • If your data is two dimensional. Revert to Step 1. Plot it and look at it, your brain can cluster two dimensions faster than any algorithm can.

Agglomerative Clustering: Down → Up

  • For example: Hierarchical Clustering. Start with individual observations and keep merging. End result is a tree that you can chip chop into how many ever clusters you want. Best practice is to plot, eyeball and then cut where you have the longest drop before a split. Notice that the bottom cut maintains the first two clusters, while splitting the third into 2 sub clusters.

Here is some sample R code:

#Product Distance Matrix
distance.matrix <- dist(as.matrix(data),method = "euclidean")
#Produce Model <- hclust(distance.matrix)
#Plot Dendrogram
plot( #You can eyeball the graph and determine how many cuts to make.
#Cut the dendrogram. k is number of clusters
clusters <- cutree(, k = 5)
#Reassign Clusters back to original data for further analysis.
data$clusters <- clusters

Partitioning: Up → Down

  • Start with all of the data and try different permutation combinations of clusters. The algorithm iteratively optimizes until some condition of “good enough” is met. Computationally very intensive.
  • For example: K-Means, C-Means etc. Here is some sample code. Yes it can be as simple as this.
kmeans.cut <- kmeans(data.numeric,centers= 5, algorithm = "MacQueen")
$Assign Clusters back to original data set
data$clusters <- kmeans.cut$cluster

The next technique I turn to is called model based clustering, but thats a discussion for a whole other post. The library in R is called mclust, if you’d like to look into it.

As this post is getting way too long, I’m going to just list out the rest of the steps I take. Feel free to comment and ask me to elaborate.

Step 6: Rinse and repeat with different distance measures. [Or if you’re one of those special analysts that understand them, pick one carefully]. Jokes aside, the right measure can make or break your algorithm. Your work is only as informed as the distance measure you settle on.

Step 7: Rinse and repeat with different cluster counts. Note that you can actually automate this process in R. For e.g. with K-Means, loop through creating different fits for different number of clusters and pick the WSS [Within Sample Sum of Squares] you are most comfortable with.

Step 8: Check the health of your clusters. Silhouette plots are great for this.

Step 9: Bring back any categoricals you threw out and see if any of them show propensity to different clusters. Cluster 1 → Mainly Female, Cluster 2 → Mainly people between ages of 18–25. This too is great validation.

Step 10: Give up. Get some sleep. Get back at it tomorrow. Trust me it helps.

Disclaimer: This is what I like doing. It works sometimes, it doesn’t work sometimes. There is a lot I haven’t covered. Feel free to poke holes, ask questions, and overrule anything you disagree with.