I Used a Cluster Analysis to Pick My Lunch. Here’s What I Ate.

7 min readDec 1, 2018

My mouth began to salivate as I navigated the colorful Planet Smoothie website to finalize my lunch order. For the past hour my mind fixated on the Matcha Libre, a tangy sweet concoction of passionfruit, matcha tea and frozen yogurt just indulgent enough to enliven the lull of a slow weekday afternoon. After a weekend of engrossing myself in health-conscious reading, I dared to leave my comfort zone for a smoothie that contained the celebrated kale leaf: a Lean Green Extreme.

I dutifully scoured the nutrition information and made a sour face as I found the sugar content: 55 grams of sugar in a medium Lean Green Extreme. Honestly, I lacked a true sense of whether that was good or bad but 55 sounded like a major sugar rush. My favorite Matcha Libre only contained 38 grams of sugar. Matcha Libre won that round.

For an inspired moment I couldn’t help but notice the tidy little table holding all these numeric data points would make for an amusing analysis project of some kind. I wondered if there was some instance of a function that could effectively tell me which smoothies were most similar to Matcha Libre. I classified this as a clustering problem wherein I wanted to see which smoothies would be grouped with Matcha Libre.

Scraping the Nutrition Data from the PDF

The Planet Smoothie nutrition information resided in a PDF document accessible online. I aimed to complete this exercise in R. To get the data from the PDF, I used the tabulizer package.

To use tabulizer I also had to install the tabulizerjars package and follow a subset of this tutorial to make sure the Java references in R and my PC were aligned. I already had the Java SDK installed so without installing Chocolatey as the tutorial instructs, I only followed steps #1 and #4 under the Installing Java on Windows with Chocolatey section.

Using the tabulizer , I generated two CSV files. By default the extract_tables function created one file per page and the smoothies only appeared on the first two pages of the PDF. I admit; I peeked at them inExcel to make sure they looked right. I have yet to be converted into a fan of visualizing tabular data sets in R.

Preparing the Smoothie Nutrition Data

The first two rows of each file did not contain any data. I read the files back into R, removing the first two rows and created a data frame. I reestablished the column names then combined the two pages into a single data frame.

I didn’t think comparing the same smoothies of different sizes would be useful, so I limited the data set to 16-ounce drinks. To achieve the filtering quickly, I loaded the stringr library and used the str_detect function to filter down to the “16oz” options only.

Spinning Up Some New Features

Next, I created seven features to include in the analysis of the smoothies. They were all simple proportions based on calories or weight.

SatFatPercent: The proportion of fat that is saturated fat
CalFatPercent: The proportion of smoothie calories that come from fat
NetCarbs: Grams of sugar minus grams of protein
CaloriesPerGram: The number of calories per gram of smoothie
FiberPerGram: Grams of fiber per gram of smoothie
CarbsPerGram: Grams of carbohydrates per gram of smoothie
ProteinPerGram: Grams of protein per gram of smoothie

Transforming the Data Set

I whispered “division by zero” lightly under my breath as I realized that my SatFatPercent feature had some NaN values. I used the is.nan function to replace them. At this point, the data set shrank to 29 observations with 20 variables. Cluster analysis requires no specific sample size, so I felt good to go.

For cluster analysis, scale matters. The calorie features and some of the nutrients are not listed in grams like the other features so I normalized all the variables before I ran the cluster analysis. I attempted to use the scale function for this task after removing the first column containing the smoothie names. This failed miserably.

After looking at the structure of my data set, I realized my new features were formatted as characters and could not be scaled. I resolved this by removing the features and recreating them with as.numeric specified. I then ran the scale function on the numeric features. It succeeded!

There was still a snafu. The TransFatGrams column all returned as NaN. I reviewed the data set and found for all these smoothies, the TransFatGrams equaled zero. I removed this feature completely.

Clustering My Smoothies

At least five different clustering methods exist and I stuck with the partitioning method, of which K-means will be familiar to most people. The goal of K-means clustering is to minimize the distance between the points within each cluster such that when the variation of all clusters gets added together, the total variation is the smallest it can possibly be.

I utilized R’s built in kmeans function to run the analysis with four initial clusters and 15 random “starts” referring to the random group labeling that initiates the groupings for traditional centroid-based k-means clustering. Because the first step involves random group assignment, the results of a k-means clustering analysis can differ depending on the initial groupings.

The algorithm outputs an R kmeans object which provides several components of which the “cluster” component holds the meaty goodness of the analysis.

I used the base plot function to view the output and saw my Matcha Libre smoothie landed in group 2 with most of the other smoothies. The number of observations in my smoothie group was so large that I decided to try five and six clusters as well.

Plotting kmeans cluster with R base plot

Underwhelmed with the plotting visual, I decided to load the cluster library so I could try the clusplot (try saying that ten times fast) function to play around with a little bit. The default parameters generated ellipses to visually capture my clusters. I applied the RColorBrewer function brewer.pal to generate more legible ellipse colors.

I populated the clusplot using pam, Partitioning Around Medoids, as opposed to centroids. They only differ in that “medoid” represents the absolute value of the distance between points in a cluster while “centroid” represents the squared distance. Additionally instead of calculating random medoids, pam uses an observation as the initial medoid. My results from clusplot looked quite nicer but I stumbled into another option, ggbiplot.

The ggbiplot functionality allowed me to more easily add the component information to my visualization as shown in the screen shot. I not only have my clusters but can also tell which features best differentiate my smoothies. I referenced this DataCamp tutorial to work with this package.

Selecting My Smoothie Line-Up

With eight clusters, my Matcha Libre smoothie only grouped with one other smoothie: Pineapple Tropi-Kale Twist. To my delight, that option contained kale — the secret ingredient that sparked my curiosity in the first place.

With fewer than eight clusters my favorite smoothie got lumped in with quite a few others including 2-Piece Bikini and Captain Kid. According to my plot, the most factors driving over 65% of the variation between smoothies included net carbohydrates, grams of sugar and the percentage of total fat that was saturated fat.

Several days later on an unusually chilly Florida afternoon, I courageously entered the smoothie establishment ready to plunge into a new flavor experience. Content, I slowly sipped the fruits of my labor sprawled out in the front seat of my sedan. Which did I choose? The closest match: Tropi-Kale Twist.

Data Viz for Fun