UPDATED: Using a Cutting-Edge Machine Learning Technique to Identify Comparable Major League Pitchers

Jonah M Simon
Analytics Vidhya
Published in
5 min readMay 6, 2020

Note:

After receiving a significant amout of helpful feedback, I decided to update both clustering articles with a more user-friendly format. When reading through the article, you will find many updates to my methods, as well as a link to the full lists of all clusters. Thanks for reading!

Introduction

My goal in creating a Medium profile was to provide elegant, simple analysis of sports data that can be understood by a wide-range of audiences.

In my first analysis, I will demonstrate how k-means clustering can be an impactful tool when attempting to group Major League pitchers.

So, who cares and why is this helpful?

The primary reason I chose to demonstrate clustering as my first article is to provide a tangible, interesting, and helpful example of how machine learning can be applied to everyday problems. Throughout baseball, teams use a significant amount of time and energy on predicting performance and finding underlying value. While clustering is a helpful tool for prediction, it is descriptive in nature. In other words, it uses data to describe the situation, rather than predict the future.

This is especially helpful when trying to identify pitchers who may not have performed to the level that would gain popularity, yet are similar in nature to many of the games better known pitchers. Let’s begin!

Data Acquisition and Processing

I obtained the data through Baseball Savant, a large database containing a wide variety of metrics collected by Major League Baseball. This analysis, as well as all future analysis, will be conducted in R Studio.

When thinking about variables that are probably important when finding similarities between pitchers, I focused primarily on two areas:

  1. Past Production
  2. Pitch Arsenal

When looking at past production, I filtered mostly advanced metrics such as xFIP, xBA, and xSlug while adding some helpful traditional statistics such as innings pitched. Pitch arsenal primarily refers to how often a certain pitch is thrown and all the underlying factors (speed, spin, break) the impact the effectiveness of that pitch.

Some other key notes from data:

  • 2019 observations only
  • Minimum of 50 batters faced

Clustering Explained

So, you may be wondering, what exactly is clustering? Clustering is a widely-used unsupervised machine learning technique that divides data into different groups based on their similarities.

There are multiple methods one can used to implement clustering, all coming with their own benefits. In this article, I will utilize the most popular clustering technique, k-means, due to it’s simplicity and relatively easy implementation.

Quick k-means summary:

  1. Algorithm randomly assigns points (referred to as centroids)
  2. Each observation is assigned to the closest centroid
  3. K-means calculates the mean of centroid values (the clusters), creating a new centroid value
  4. Process repeats for all observations

If you are interested in a more thorough explanation of k-means, a solid summary can be found here.

When the algorithm is complete, the data will be split into the specified number of clusters, with each cluster containing observations that are similar to one another.

Data Processing

In order to implement clustering, all data needs to be of the same type. If you are not familiar with data types, they are simply the category of representation an observation has. For example, if ‘John Doe’ was an observation, it would be considered a ‘character’.

In the Baseball Savant data, the only discrepancy is the presence of the “last_name” and “first_name” variables, both being represented as characters, while all other variables are represented as ‘numbers’ or ‘integers’. As a result, I had to temporarily remove the first and last name variables, ensuring all data is the same type.

After removing the first and last name variables (being that they are not numeric, we will add those in later), I needed to determine how to handle empty values. Being that the data only contained 40 NA values, I decided to remove them entirely.

The last step before clustering is to standardize the variables. This can be done by utilizing R’s helpul scale function. At this point, the savant data is ready for the k-means algorithm.

Determining Optimal Amount of Clusters

While we are ready to utilize clustering, we need to identify how many clusters we would like to split the data into. Without getting too technical, we can utilize the “within sum of squares” metric to identify the optimal amount of clusters. I computed this in R and created a graph explaining my findings. The optimal amount of clusters can be seen in the bend or “elbow” of the graph below.

Computed in ggplot2

The bend seems to occur at the 3-cluster tick mark, inferring a 3-cluster solution seems reasonable.

K-Means Clustering

Now that our data is processed and the number of optimal cluster solutions is identified, and we can run the k-means algorithm, with the cluster solution found below.

Pretty cool, right?

Cluster Findings

K-means did an outstanding job of grouping the best major league pitchers together. Below, I will list some of the top andi nteresting names found in each cluster.

Cluster 1 (Middle of the Pack):

Expected Names:

  • Madison Bumgarner
  • Adam Wainwright
  • Kyle Hendricks

Interesting Names:

  • Chris Sale
  • James Paxton

Cluster 2 (Group of Best Pitchers):

Expected Names:

  • Shane Bieber
  • Gerrit Cole
  • Aaron Nola

Interesting Names:

  • Dakota Hudson
  • Joe Musgrove

Cluster 3 (Group of Least Effective Pitchers):

Expected Names:

  • Andrew Cashner
  • Jordan Zimmermann
  • Adrian Sampson

Interesting Names:

  • Corbin Burnes
  • Carlos Carrasco

All cluster results can be found in this link: https://docs.google.com/spreadsheets/d/11OnG0pc35bgPy6SVtSbBMmr0spB5xYtr20c43awlDTA/edit?usp=sharing

So, there we have it. I encourage you to sort through each cluster and come to your own conclusions. Like I explained earlier, clustering is a descriptive method. There are a ton of question that can stem from this analysis. For example:

  • Why was player A, a consensus top pitcher, in cluster 1?
  • How can we help player B (from a developmental standpoint), who is in cluster 3, get grouped in cluster 1?

I hope you enjoyed this analysis and please don’t hesitate to provide feedback!

What are some of your conclusions?

jms

--

--

Jonah M Simon
Analytics Vidhya

Columbia graduate student interested in machine learning, predictive modeling, and cutting-edge analytical techniques. Always learning.