Comparing MLB Pitchers Using Underlying Metrics

Brandon Fung
INST414: Data Science Techniques
12 min readMay 15, 2024

Introduction

Part of the struggle in building a roster in Major League Baseball is determining the best players that will fit well into a team’s composition. More specifically, finding an ideal pitcher for a team can be immensely challenging. There seems to be an infinite number of statistics to look into, some of which are very dependent on factors outside of that pitcher’s control such as the performance of their respective team — both on offense and defense.

Motivation

The question: “How can we identify pitchers who will be successful on a team?” is driven by several key motivations. Firstly, a successful pitcher can significantly impact a team’s win-loss record by effectively controlling the game’s tempo and limiting the opposing team’s scoring opportunities. This is crucial in a sport where pitching often plays a decisive role in the outcome of games. Secondly, a reliable and high-performing pitcher can reduce the strain on the bullpen, ensuring that relief pitchers are not overused and are available for crucial moments. Additionally, finding a pitcher who can thrive within a team’s specific system and strategy enhances overall team cohesion and performance. From a financial perspective, investing in a successful pitcher can yield substantial returns in terms of ticket sales, merchandise, and long-term competitiveness. Ultimately, the motivation extends beyond just winning games; it encompasses building a resilient, efficient, and strategically sound team that can sustain success over multiple seasons.

This question is particularly relevant for baseball team managers, coaches, and front-office executives who are responsible for assembling and managing the team’s roster. These stakeholders seek to enhance their team’s performance by ensuring they have reliable and effective pitchers who can contribute to winning games and maintaining overall team stability. Answering this question helps in scouting and recruitment by highlighting pitchers whose performance metrics align with the team’s strategic needs. It also aids in contract negotiations and financial planning by assessing a pitcher’s value relative to others similar to them. Moreover, this insight assists coaches in developing tailored training programs that maximize a pitcher’s strengths and address their weaknesses, thereby enhancing overall team performance. By leveraging data-driven insights and analyses, teams can build a more competitive and cohesive roster, ensuring long-term success on the field.

Data

The most ideal data would consist of all the pitchers from the past ten years, excluding any pitchers who are no longer in the league. Each pitcher should have numerous metrics to look at that strictly measure their ability to pitch well such as the velocity on their different pitches, their strikeout percentage, etc.

For our analysis, we were able to collect data on all pitchers from the 2023 MLB season. It was collected from Fangraphs, the go-to website that tracks all statistics from all players in the MLB. Our original dataset contained 299 different statistics for all players who have thrown a pitch in a game during the season. Using our domain expertise, we narrowed it down to nine metrics that we felt demonstrated a pitcher’s full ability to be successful or not, as well as being an accurate demonstration of finding similar pitchers. These metrics factor out variables that the pitcher cannot control and focus more on a pitcher's actual skill. After, we filtered down our data to pitchers who have thrown a minimum of 25 innings to eliminate position players and pitchers who do not have a sufficient sample size. This slimmed down the data set to 519 pitchers, with each having nine metrics to measure their skill.

Here are the nine metrics we chose to use:

  • K_pct — measures how often a hitter strikes out on a per-plate appearance basis; Higher is better
  • BB_pct — measures how often a hitter walks on a per-plate appearance basis; Lower is better
  • H_per_9 — the average number of hits a pitcher allows per nine innings pitched; Lower is better
  • xFIP — estimates a pitcher’s expected run prevention independent of the performance of their team’s defense; Higher is better
  • OSwing_pct — measures how often a hitter swings at a pitch outside of the strike zone; Higher is better
  • CSW_pct —measures how often a pitcher’s pitches are called strikes and swinging strikes; Higher is better
  • HardHit_pct — measures how often a hitter hits a ball in play that is classified as hit with hard speed (95 mph or higher); Lower is better
  • Stuff_plus — measures only the physical characteristics of a pitch, which include, but are not limited to, release point, velocity, vertical and horizontal movement, and spin rate; Higher is better
  • Pitching_plus — measures the physical characteristics, location, and count of each pitch to try to judge the overall quality of the pitcher’s process; Higher is better

Since our data comes from a reliable source, there was not much data cleaning that needed to be done. We only needed to prepare our data to be ingested into our machine-learning model. To do this, we normalized our data so that all features would be weighted the same. Some of our features were represented as percentages, while others were not. The non-percentage features were greater than 1, so they would be weighed heavier than the percentage features if left untreated. We used the MinMaxScaler module from scikit-learn to convert the non-percentage features into percentages.

Original Study

In our original study, we looked to find the most similar pitchers to a starting pitcher who has had sustained success — Logan Webb, a breakout starting pitcher from 2023 — Kyle Bradish, and a relief pitcher with elite pitches but has had a bit of a rocky career — Matt Brash. We elected to use Euclidean distance to find the closest comparisons because we were not looking to see who was on similar trajectories; we were just looking at the previous season to see who performed most similarly to one another. All metrics used are also performance-based, so scale would not be an issue in our analysis.

Logan Webb:

Webb has solidified himself as a top ten starting pitcher in the game over the past 3 years. He is currently entering his prime and is someone projected to have continued success. When looking at the top ten pitchers most similar to him, we have one pitcher with sustained success as a starting pitcher (Kevin Gausman), three starting pitchers who did not have as good surface-level stats as Webb (Jesus Luzardo, Mitch Keller, and Aaron Civale), and four relief pitchers who have been successful in their respective roles in recent history (Joe Jimenez, Seth Lugo, Matt Moore, and John Brebbia). The final two pitchers (Gabe Speier and Adbert Alzolay) are younger, less established pitchers. From Webb’s standpoint, having similar comparison scores to pitchers who have been able to maintain their success bodes well for his future. Assuming there are no injury issues, he should be able to continue performing at a high level for the foreseeable future.

The top ten most similar pitchers to Logan Webb in the 2023 MLB Season

Kyle Bradish:

Bradish is the epitome of carrying over the hot hand into a new season. He ended 2022 on a very strong note after being one of the worst pitchers in the first half of the season and continued his success for the entirety of 2023. Unfortunately, he has been injured to start 2024, but when looking at comparable pitchers, he is in some good company. His most recent player comparison is Corbin Burnes, arguably a top five starting pitcher in baseball. If Bradish can even somewhat represent what Burnes has been, that would come as a huge success for both Bradish and his team. Other pitchers that rank in the top ten of similarity to Bradish include 4 of the best closers in baseball from 2023 — Jordan Romano, Paul Seward, Clay Holmes, and Jhoan Duran. While these pitchers are all relievers, it does show that Bradish could have a future in the bullpen if he does not continue his high performance as a starter.

The top ten most similar pitchers to Kyle Bradish in the 2023 MLB Season

Devin Williams:

Since his first full season in 2020, Williams has arguably been the best relief pitcher in all of baseball. He has built up a track record of success, has proven to be successful in different roles, and has just been a model of consistency for relief pitchers. With starting pitchers throwing fewer and fewer innings with each passing year, relief pitchers are becoming more valuable. A team being able to acquire and/or develop a pitcher of Williams’ quality could push them over the edge from being a middling team to a contender. The most similar pitcher to him is headlined by Josh Hader, while Jose Alvarado, Joe Kelly, and Emilio Pagan have also had successful careers as relievers. However, the name that stands out the most is Brandon Woodruff. While he is a successful starting pitcher, he has been hampered by injuries the past couple of seasons. With him having strong numbers that compare well to the game's best relief pitchers, he could be a prime candidate to move to the bullpen once he becomes healthy again.

The top ten most similar pitchers to Devin Williams in the 2023 MLB Season

New Study

In our original study, we looked at three pitchers who fit different molds in the 2023 season. To extend our study, we looked to cluster pitchers based on their performance in the 2023 MLB season. Clusters can give us a second opinion on players and allow us to see what groups of players are similar. The ability to group players together makes it easier for front-office executives to build their roster. Instead of trying to find a duplicate of a specific player, they can try and acquire a “type” of player that best matches the type of cluster they are looking for.

We decided to use the K-means clustering algorithm not only because it was what we were most familiar with, but because it also yielded the best performance out of other clustering algorithms. For instance, we looked at Gaussian as well as Hierarchical clustering and saw that the silhouette score for both was significantly lower than that of K-means.

After, we employed the elbow method to decide on the optimal number of clusters to use for our model.

From the graph, it seems like 3 was the best, but there is no clear, distinct elbow. To gather more information, we decided to dive into dimensionality reduction to visualize our data.

Using singular value decomposition (SVD), we reduced our 9 features to 2 features and then plotted them. The generated plot also had no clear clusters, so we were still not confident in the number of clusters to use.

Through further research, we found something called the gap statistic, which is another tool that data scientists use to find the number of clusters. The gap statistic also suggested 3 clusters, so after getting the same result from multiple different methods, we had more faith in the number of clusters to use.

After running the k-means clustering algorithm, it generated these clusters:

When looking at the clusters, there is a clear difference between the three: Cluster 2 contains most of the “good” pitchers — pitchers who had a strong season in 2023, an opinion held by both fans and statisticians of the game. Most of Cluster 0 is the opposite: the majority of these pitchers struggled for a good portion of last year. They either got hit extremely hard and had poor numbers due to that or had issues finding the strike zone and allowed a lot of runs that way. Cluster 1 is pretty much the middle. That cluster contained the middle-of-the-road pitchers; guys who did not provide much of an impact, but also did not hurt their team by being a part of it. That goes for Ryan Thompson — a relief pitcher who was in the top 10 of most similar pitchers to Devin Williams but was not in the same cluster as everyone. If we look at the image below, it shows the mean for each statistic in each group, starting with Cluster 2 on the left, and Cluster 0 on the right. We can see that each statistic gets worse as we move clusters, with the biggest changes coming in strikeout percentage and Stuff+. In today’s era, those are two of the more important features when evaluating a pitcher, because those are the two main things a pitcher can control. A pitcher with a more effective arsenal is more likely to gather strikeouts and be successful.

Cluster 2 contains starting pitchers such as Aaron Nola, Corbin Burnes, Shohei Ohtani, and Zac Gallen. Those are all starting pitchers who are at the top of their respective games and have a reputation for being some of the best pitchers in the league. That cluster also includes Tarik Skubal, a young pitcher who did not have the best surface-level metrics but appears in the same cluster because of his strong underlying metrics. Being able to make this comparison can lead to an extreme advantage for an MLB executive because it can make it easier to place a value on a player if they are trying to acquire him.

On the opposite end of the spectrum, two pitchers in Cluster 0 — Adam Wainwright and Zack Greinke — used to be two of the best pitchers in the game, but have faded quickly with age and were two of the worst pitchers last year. Grouped with them is Bryce Elder, a young pitcher who started off very strong last year, but fell apart towards the end of the season. Seeing that he is in Cluster 0, an assumption could be made that he might not have a big bounce-back season due to him having poor metrics that align with poor performance. Unlike Skubal, he does not seem to have much potential for positive regression.

Limitations

While clustering pitchers based on performance metrics provides valuable insights, several limitations must be considered. Firstly, the analysis relies heavily on the accuracy and completeness of the available data. Any inconsistencies or gaps in the data can lead to misleading conclusions. It is also from only the 2023 MLB season, so it is not very robust. Secondly, performance metrics are influenced by numerous external factors such as the quality of the opposing teams, ballpark effects, and even weather conditions, which may not be fully accounted for in the clustering process. Lastly, the human element, such as a player’s mental toughness and ability to handle pressure, is difficult to quantify but crucial for success in high-stakes situations. In terms of technical limitations, the model had a relatively low silhouette score of 0.22, suggesting that the data may not fit well into its assigned cluster and that the clusters may not be that distinct from each other. When visualizing the data, it was also concerning to see that there were no obvious clusters to the naked eye. This could also suggest that either k-means is not the correct model or that clustering is not the right technique for the data. To combat this, more thorough testing of all possible clustering models should be conducted, and if the silhouette score still remains low, it may be time to consider adding more features or even using a different machine-learning technique altogether. These limitations highlight the need for a holistic approach that combines quantitative analysis of MLB analysts with qualitative assessments from MLB scouts to make the most informed decisions.

Click here for the full source code.

Appendix

Jaron Richman — Intro, Motivation, Original Study

Brandon Fung — Data, New Study, Limitations

--

--