Building the Ideal MLB Pitching Prospect Through Various Statistical Methods

Jonah M Simon
The Startup
Published in
8 min readAug 24, 2020
Source

Overview

My next analysis was inspired from various baseball-industry related experiences. Before pursuing my graduate degree, I had numerous jobs working in baseball, specifically in the player evaluation/scouting sector. As such, I learned that when evaluating an amateur pitching prospect, there are many things to keep in mind, including:

  1. Player Age
  2. Athleticism Traits
  3. Data points such as spin rate, velocity, horizontal and vertical break, etc..
  4. Competition Level
  5. Body Type

With this said, traditional player evaluation typically stems from human observation. In other words, how is this player special from the traits that can be observed by the human eye? This process has worked effectively since the inception of the game. This got me thinking, however, is there a way we can quantify a baseline for a prospect? While we generally know what traits are important, can we determine which are more important than others? This lead me down an interesting path that resulted in some insightful conclusions.

The Data

For this project, I utilized two data sources: Baseball Savant and the Lahman Database. If you are familiar with my past articles or are a baseball fan in general, I am sure you have heard of Baseball Savant. The Lahman database is less-widely used by the general public but contains some incredibly helpful data. While the database has information on a diverse set of topics such as traditional statistics, player awards, and general team info, I used it for biographical data.

The Lahman database provides the user many different connection options, one being an R-Studio package that provides access to all of its capabilities. Below is a snippet of the data gathered from the Lahman database.

Lahman — 8/8 Variables Shown

Regarding the Baseball Savant data, my primary goal was to extract pitch arsenal related information, in addition to a few modern statistics that will be utilized in my success metric (which I will get into later). Here is a quick guide to the pitch arsenal features pulled in addition to the metrics used in my success feature:

Arsenal Data:

  • Pitch Type
  • Percent Thrown (per pitch type)
  • Velocity (per pitch type)
  • Spin (per pitch type)
  • Horizontal and Vertical Break (per pitch type)
  • Total Break (per pitch type)

Metrics:

  • Innings Pitched
  • K%
  • xBA
  • Exit Velocity
Savant — 8/50 Variables Shown

Data Processing

In order to transform the data into a structure that is optimal for the questions I am trying to answer, a great deal of processing was needed. This was by far the most time consuming portion of this analysis and required multiple steps. In order to provide a clear layout of my strategy, I will break each processing portion into a descriptive step.

Step 1: Merge the Lahman and Savant Data

This step was relatively simple. Once I identified a feature prevalent in both the savant and lahman data sets, they were aggregated into one master group. Since they both contained a variable with each player’s first and last name, this was a relatively straightforward step.

Newly Formed Data Set — 10/55 Variables Shown

Step 2: Average Out Repeated Player Data

Now that the master set was created, I had to determine how I was going to deal with repeated player seasons. Since the savant data tracks metrics from 2015 and on, there were many players that participated in seasons 2015–2020. As a result, I needed to average out their metrics, creating one observation per player, rather than an observation per season of each player. An example of this process is found below. Specifically, compare the top table “Fernando Abad” observations to the newly transformed table below.

Aggregated Individual Seasons 10/46 Variables Shown

Step 3: Create a ‘Success’ Feature

In order to create the ideal prospect, I needed to create some sort of baseline of characteristics to look for. While there are a wide variety of paths I could have chosen, I based my success metric off of the following variables: AV_K_Perc, AV_xBA, AV_EV, and Total_Innings. Now that the variables are selected, I needed some sort of threshold that would classify a player as successful or unsuccessful.

In order to do so, I simply looked at league averages, determining what quantile the player would need to fall under in order to be considered successful. This fangraphs article was incredibly useful and was the driving force in determing the overall thresholds. My final thresholds were as follows:

  • K% ≥= 22
  • xBA ≤ .240
  • EV ≤ 88
  • Total Innings ≥ 80

Now that these metrics were decided, I created a new binary feature in the data named ‘success’ that identifies a player as successful (1) or unsuccessful (0). It is key to note that in order for the player to be considered successful, EACH threshold must be met.

Step 4: Use cut2 Function to Put Data Into Quantiles

Similarly to the ‘success’ feature created above, I needed to continue the process with the majority of other variables present in the data. Before I could do so, I needed to seperate the data into quantiles.

For example, the data contained players with heights ranging from 67–78 inches. To have the capability to determine the optimal height, ranges are required. As such, I split the height variable into the following 4 quantiles: 67–70, 71–73, 74–78. The identical process was required for ALL other variables including each individual pitch’s metrics.

Step 5: Create Binary Variables for Each Cut Quantile

The final data processing step required me to take each quantized variable and create its own unique variable based on its value. To explain this, I will use the weight variable. Before the binary creation, the weight variable classified each observation in a specific quantile.

For example, say player A weighs 210 pounds. Since 210 falls between 200–224, player A has a value of [200,225) in the weight variable. In order to analyze this, I needed to create a binary variable for the [200,225) observations, specifically named ‘weight_200_224’. If a player’s weight falls within this category, a ‘1’ will be utilized as the value, with a ‘0’ otherwise.

While this may seem like a simple example, there were a ton of instances where this process needed to take place. Take the four seam fastball velocity metric as an example. The following variables were created for FF velocity: ff_velo_80_89, ff_velo_90_92, ff_velo_93_94, ff_velo_95_96, ff_velo_97_plus.

Feature Importance — Optimal Arsenal

Now that the data is split into quantiles with unique binary variables per quantile, the data is ready for analysis, leading into the first question I aimed to answer: what is the optimal arsenal for a pitcher?

By utilizing the newly created success metric, I ran a random forest algorithm to determine feature importance. For this model, I created a data.frame containing only the success metric in addition to each binary pitch type variable.

From here, I was able to determine what pitches have the highest impact on overall success. The model resulted in some pretty interesting findings, which can be found below.

Random Forest Variable Importance

In order to corroborate these observations, I created a correlation matrix, aiming to identify the variables that have the highest correlation with success. As you can see below, the matrix agreed that throwing a CB and CH have the highest impact on success.

Correlation Matrix

Based off the model and correlation matrix, a pitcher who throws both a CB and CH has the highest probability of success. Now that the optimal arsenal has been identified, I went ahead and subsetted the data based off pitchers who threw these pitches, resulting in the following groups:

  1. Pitchers who throw a CB, CH, FF (optimal)
  2. Pitchers who throw a CB, CH, FF, SL

Creating the Optimal Pitching Prospect

Now that the arsenals are created, I conducted a similar process using all of the binary variables created in the data processing section. Below you will find the results of each one of the arsenals, as well as a summary for the optimal pitching prospect. If you would like to see ALL the recommended features for each aresenal type, they can be found here.

Arsenal 1: A Prospect Who Throws a FF, CB, CH (optimal)

Key Characteristics:

  • Four-Seam fastball spin of 2500 RPMS +
  • Four-Seam fastball velocity between 93–94 MPH
  • Curveball thrown 10–14 % of the time
  • Height between 67–70 inches
  • Weight between 175–199 pounds

Arsenal 2: A Prospect Who Throws a FF, CB, CH, SL

Key Characteristics:

  • Four-Seam fastball spin of 2500 RPMS +
  • Four-Seam fastball velocity 97 MPH +
  • Changeup velocity 85 MPH +
  • Slider total break of 10 inches +

Conclusions

There were a lot of questions answered throughout this analysis. Here are a few findings.

  1. The impact of a high spin fastball

This feature, above all else, should be considered the most important factor in evaluating players. If a pitcher has a spin of 2500+, there is a good chance he will be able to succeed.

2. Curveballs!

This can be a difficult subject. As research has concluded, including another one of my personal projects, breaking balls increase the chance of UCL injuries.

It’s also interesting to note that while curveballs are the most important pitch, the model concluded they should only be thrown 10–14% of the time.

Improvements

This was a fun and interesting project. However, there are multiple areas that it can improve, beginning with:

  1. Data, data, data

After all the data processing, the master set contained just over 1000 variables. The main limitation here was that Statcast data is only tracked from 2015-on. As such, there are only so many players to work with, limiting the overall scope of the analysis.

2. Can’t Account for Projectability

While the goal was to use the ‘success’ metric to help evaluate amateur players, the reality is that the player at 18 years old won’t be the same player at 23 years old. There is a ton of development time that occurs throughout those crucial years and are not accounted for in this analysis.

3. Height/Weight Distribution

While the height/weight data was interesting to include, it is important to note the data stems from when the player first entered the league. As a result, these biographical features (especially weight) are probably slightly different than the player’s current weight.

So, there we have it. I genuinly hope you enjoyed this! I really appreciate feedback so don’t hesitate to shoot me any thoughts or recommendations. If you want to see my code, it can be found here.

-jms

--

--

Jonah M Simon
The Startup

Columbia graduate student interested in machine learning, predictive modeling, and cutting-edge analytical techniques. Always learning.