Grouping Major League Hitters with Hierarchical Methods

Jonah M Simon
Analytics Vidhya
Published in
5 min readMay 7, 2020

The purpose of this analysis is to build off the first clustering article by applying many of the same procedures to recent MLB hitting data all while utilizing hierarchical clustering. This should give you a pretty solid overview of how hierarchical clustering works and will provide a point of comparison to k-means.

The Data

Similarly to my last article, I scraped this data via Baseball Savant, an incredibly thorough data hub for many interesting baseball statistics. Since my focus is on hitting data, my approach slightly changed.

Due to recent feedback, I changed the year filter to only players from the 2019 season. In addition, I also minimized the plate appearances filter, allowing all players with at least 100 PAs to qualify for this analysis. The initial variables chosen for the clustering analysis are below.

A few key notes from the data:

  • If you are not familiar with some of these metrics, Baseball Savant has an awesome variable dictionary that can be found here
  • Some of the variables above such as xba, xslg, and xobp may look familar. However, the “x” in each stands for estimated (versus actual result). Statcast computes these metrics with their vast database, providing a better depiction of how a player *should have* produced, compared to how he actually produced

Data Processing

If you are not familiar with clustering and have not read my past article, three components need to be true in order for clustering to be effective.

  1. The data needs to be the same type, preferably numeric
  2. There can be no NA’s (or empty values) in the data set
  3. The data needs to be scaled

To achieve the first parameter, each players first and last name need to be removed (due to their variable type) and will be added back in at the end of the analysis.

When analyzing point 2, there are two NA values present in the data, specifically dealing with sprint speed.

There are a few ways to go about handling this data, each of which have their own advantages and disadvantages.

First, I could simply remove the observations containing NA. However, this player would not be included in the analysis, reducing the quality of the clustering.

Second, I could replace the NA values with the aggregated average of the variable within the data set, which could become a problem if the variable has a high variance.

Finally, I could utilize R Studio’s “mice” or “caret” packages, which work by analyzing all variables and values around the NA value, predicting what the value would have been.

Due to the low variance in sprint speed and goal of keeping this analysis simple, I will simply replace the NA values with the average sprint speed of all the variables.

Finally, I scaled the updated dataset, fulfilling the last requirement for clustering. A snippet of the processed dataset is below.

Hierarchical Distance Measure

Throughout my last piece, I utilized k-means clustering due to its effectiveness and simplicity. However, hierarchical clustering is another widely used method that is incredibly effective when grouping variables.

One key difference between the two clustering techniques is hierarchical clustering requires a defined “distance measure” while k-means uses a default distance measure. A distance measure is at the core of a cluster analysis as it is responsible for determining the distance between each observation, ultimately identifying observation similarity. There are a variety of distance measures one can use for clustering, each of which can be read about here.

For this analysis, I will utilize euclidean distance due to its effectiveness with numeric data.

Hierarchical Clustering

There are a variety of hierarchical clustering methods that one can utilize, each of which serve a different purpose. I will implement the ward.D2 method which emphasizes the variance between the points in the data. The initial cluster tree diagram is below.

This probably looks pretty crowded. In order to identify the clusters, one needs to look at the top branches of the graph. The following diagram will provide a clearer picture of the clusters identified by ward.D2, with the clusters highlighted in red.

The chart below contains the same data with a different visual.

So, there we have it. Another 3-cluster solution, albeit with utilizing hierarchical clustering.

Results

Now, to the results. Below is a summary of the averages of each cluster.

Due to the size of each cluster, I will post the first 40 or so individuals sorted by xwoba. If you would like to see the rest of the results, shoot me a message and I will get them to you!

Cluster 1:

Cluster 2:

Cluster 3:

Conclusion:

So, it seems ward.D2 did a nice job of clustering the top hitters in the game. As you can see from the summary, cluster 3 is far superior to clusters 1 and 2, which is verified when looking at the specific names in that cluster (Trout, Belli, Yelich, etc…)

Thoughts?

-jms

--

--

Jonah M Simon
Analytics Vidhya

Columbia graduate student interested in machine learning, predictive modeling, and cutting-edge analytical techniques. Always learning.