Predicting SwStr using an XGBoost model

Jamin Kim
University of Illinois at Urbana-Champaign
8 min readJan 3, 2023
Randy Johnson’s First No Hitter on June 2nd, 1990, via MLB.com.

CSW, called strikes and whiff percentage, is an advanced pitching statistic that can measures a pitcher’s effectiveness in drawing successful strikes, whether it be through pitches that aren’t swung at, called strikes, or pitches that don’t make contact, swinging strikes. From Alex Fast’s article here, a clear relationship between SIERA, Skill Interactive Earned Run Average, and pitch CSW is defined.

Through the article, we can see that CSW, in comparison to SwStr and whiff percentages, is the greatest indicator of SIERA. While true, since SwStr is a component of a pitcher’s CSW, there’s still value to be looked into. In another article by Mike Podhorzer, the benefits of a swinging strike rate are outlined even further. Swinging strike rate is one of the stronger indicators for comparing year on year strikeout percentages compared to other indicators such as called strikes.

Backtracking a bit to SIERA, the equation for SIERA is long and complicated.

But from the equation, we can see how heavily weighted strikeouts (Ks) are in the equation. Therefore, since swinging strike rates are the greatest indicator of strikeouts, the correlation between swinging strike rates and SIERA becomes clearer.

Taking the value of swinging strike rates into account, I chose to build a model trying to predict swinging strike rates.

Variables

For my variables, I used variables describing a pitch given through stat cast. The variables that I used are

- Release speed (pitch velocity)

- Pfx_x (horizontal movement in feet)

- Pfx_z (vertical movement in feet)

- Release_position_x (horizontal placement of release point)

- Release_position_z (vertical placement of release point)

- Release_extention (distance from release point to pitching rubber)

- Release_spin_rate (the spin rate at release)

- Plate_x (location at home plate horizontally)

- Plate_z (location at home plate vertically)

Afterwards, I then created variables in order to try to gauge a pitcher’s “deception” qualities. I used three difference stats from Devon Wright’s article, velocity difference, horizontal movement difference and vertical movement difference. The difference stats are measurements for the difference between a pitcher’s fastball, either four-seam or sinker/two-seam, and their off-speed pitches.

Pre-modeling

After selecting the variables that I wanted to use, I then went on to deciding how I was going to create the model. I chose to split the pitcher profiles between 6 pitch types. The pitch types I used were four-seam fastballs, two-seam fastballs/ sinkers, cutters, curveballs, changeups and sliders. From these pitch types, for each pitcher, I used the mean from the variables listed above to create a summary for a pitcher’s pitch. Afterwards, I filtered out potential outliers through filtering the bottom 25th quantiles in pitches thrown for each pitch type. Since most of the pitches were around 50/40, with a little bit of tuning and testing I found that setting most of the pitches at 50 was the best filtering method.

After filtering the pitch types, I then created a training and testing set for each of the pitch types. I split the training and testing sets into 70% and 30% of the data respectively.

Modeling

I first created a few baseline models in order to evaluate the effectiveness of the XGBoost model later on as well as making sure that the variables chosen above do in fact predict SwStr rate. For my baseline models, I inputted all of the variables listed above as the covariates predicting swinging strike rates into a linear model with the training data set. In order to evaluate the baseline models, I found the RMSEs and standard deviations from the testing data sets.

RMSEs from the Baseline Models

As you can see, the RMSEs and standard deviations are already fairly low already. This indicates that the data we use will have predictive power. However, the next step was to create a more predictive model using an XGBoost process.

Black Box Modeling

For my XGBoost models, I tuned each separate model’s parameters and found the lowest RMSEs from multiple rounds of boosting the trained model.

Fastball Profiles:

xSwStr- Top Ten Four-Seam Fastballs

Minimum 350 Four-Seam Fastballs Thrown

xSwStr- Top Ten Sinkers

Minimum 350 Sinkers Thrown

xSwStr- Top Ten Cutters

Minimum 350 Cutters Thrown

Off-speed Profiles:

For off-speed pitches, as they’re more commonly used in two strike counts, I added two-strike usage percentages as a predictor variable as well. To create the percentages, I divided the total times they threw a certain pitch on a two strike count by the total pitches thrown on a two strike count. In total, I added two-strike usage percentage, velocity difference, horizontal movement difference and vertical movement difference into the off-speed models.

xSwStr- Top Ten Sliders

Minimum 350 Sliders Thrown

xSwStr- Top Ten Changeups

Minimum 350 Changeups Thrown

xSwStr- Top Ten Curveballs

Minimum 200 Curveballs Thrown

The RMSEs for the XGBoost models:

Interpretive Modeling on the Four-seam fastball

There’s no doubt that the black box model is going to be the most accurate at predicting swinging strike rates. However, for interpretation sakes, our XGBoost model lacks in that we don’t see what the model is trying to fully portray to us. We can see glimpses of what’s important to the XGBoost model through the importance plots.

From the graph, the weight of each variables are given but not any correlations between the variables. This is not to say that the XGBoost model didn’t take into account interaction effects, but rather we just can’t interpret them from the importance given. To look further into any interaction effects between the variables, I added all combinations of interactions to the main effects model and ran a backwards stepwise regression with cross-validation in order to whittle down the variables that were deemed to be predictor variables.

The resulting variables for the fastball.

From Trevor Power’s article here, we can see there’s a lot of correlation between the variables given and what makes a fastball more valuable in terms of swings and misses. Vertical movement is a key component in a fastball that draws swings and misses. And the key component into creating vertical movement is spin efficiency along with spin axis. Spin efficiency is a percentage of the spin that contributes to the pitches movement. The spin efficiency gives the pitch “life” or movement that tends to cause more whiffs.

Spin axis is a component of the arm slot, wrist angle and orientation of the fingers at release that creates a movement profile. Spin efficiency is the amount of true spin that correlates to the movement of a spin. If a pitcher has a spin axis that maximizes backspin, it’s more likely that a pitcher has a higher spin efficiency. From this article by Driveline Baseball, we can see that spin direction, along with horizontal movement and vertical movement is affected by the arm slot due to the changes in spin direction from the arm slots.

Graphic by Driveline Baseball

For arm slots, a pitcher throwing over the top at a 12:00 axis will more likely have vertical movement, while a pitcher that throws on a sidearm axis will have more horizontal movement.

Chris Sale’s sidearm has greater horizontal movement than Justin Verlander’s 12:00 arm slot on 4-Seam fastballs

The variables selected from the stepwise regression show that it values arm-slot, location, vertical movement and velocity. Although spin isn’t outright listed by the model, the surrounding variables chosen show that the model favors pitchers with an arm-slot that produces vertical movement.

Referencing the importance of the variables from the XGBoost model, we can see that they both value similar variables.

Pearson’s correlation coefficient (r = .580)

From this we can see a few pitchers that are valued more by each model. From the interpretive model, Jacob DeGrom is valued much more than the XGBoost model. The interpretive model puts his SwStr at 0.128 while the XGBoost model has his at 0.091. His actual SwStr was 0.128. What’s also interesting is that Jacob DeGrom, from the list above, has the second highest SwStr from the interpretive model.

Top Ten Interactive Model Four-Seam Fastballs

Top Ten Highest SwStr from the interaction terms model

Questions-

  1. Why was the fastball the only pitch looked into with interpretive modeling?

The fastball is unique in that it’s a pitch where higher spin efficiencies are more important because backspin is usually preferred. For other pitches, spin efficiency becomes less important where gyro spin or side-to-side spin can take over to create more horizontal movements. Therefore, I only looked deeper into the four-seam fastball for this project.

2. Chris Sale has been one of the most effective pitchers across the 2010s, he had a high K% along with a good swinging-strike rate. So why would pitchers similar to him have a lower swinging-strike rate from the interpretive model?

Although Chris Sale didn’t meet the criteria of pitches thrown to be in the models, he most likely would’ve had an undervalued swinging strike rate from the interpretive model. This introduces a con of the interpretive model. You could look at it and say it undervalues a pitcher with a four-seam fastball with horizontal movement. However, pitchers similar to Chris Sale with a lower arm slot who generate greater horizontal movement produce more groundballs than swinging-strikes. Although Chris Sale’s SwStr projection from the model would be undervalued, the model excels in valuing a certain prototype of a fastball.

Final Thoughts

This project was originally supposed to be based on creating expected SwStr through an XGBoost model because of the notion that it would be the most accurate. I still see the value in the XGBoost model, but this project has made me realize the benefits of interpretive modeling for a specific feature. Building upon this, I hope to look further into comparing interpretive modeling and a black box model for other pitch types. The four-seam fastball was the only pitch chosen because there’s a certain type of fastball that’s valued more across all pitchers. But I believe that it’s possible to look into interaction effects between variables for a specific prototype of a pitcher, i.e. a sidearm pitcher that throws a hard slider, rather than every pitcher across the league.

Inspiration

This article was originally inspired by Devon Wright’s article in which he created an expected CSW stat. I wanted to add onto this and create an expected swinging strike rate stat and hopefully an expected called strike stat later on to ultimately create a stronger expected CSW stat.

If you have any constructive criticisms or questions, input would be appreciated at jaminkim0504@gmail.com

Link to the github code.

--

--