Creating a Grade on Pitch Quality
Website: pitching-plus.streamlit.app/
Disclaimer: All data came from Baseball Savant
Introduction
Pitching is one of the most important aspects in baseball. It’s what drives the game. To keep track of the “success” of a pitcher, basic counting stats were derived such as ERA, FIP, K per 9, etc. These are good stats at providing context of a pitcher, but these are result based stats. In other words, they don’t look at the overall quality of a pitcher. They don’t care about the pitches that led to a particular result. It doesn’t care about the fact that a pitcher threw 2 good sliders off the plate, one which was chased, then throwing a hard sinker in on the hands that created a broken bat groundout. These minute details of speed difference, changes in eye level, and in some cases where the batter stands are not considered with these counting stats. That is where the inspiration of the metric Pitching+ comes in. I drew inspiration from Fangraphs’ pitching models while developing my own, using them as a reference to validate and refine my approach.
Methodology
I decided to tackle this by breaking it into two different pitch quality metrics, then combining the two into one. These two metrics are Stuff+ and Location+.
Stuff+ looks at the characteristics of a particular pitch: it’s horizontal and vertical movement, velocity, spin rate, release position, and extension. This metric only cares about the pitch before it reaches the plate. It doesn’t look at the location of a pitch. With some advice, I used a pitcher’s fastball speed/break averages and took the difference of that to their other pitches to quantify deceptiveness of a pitch.
Location+ looks at the characteristics that Stuff+ doesn’t look at. These include the location of a pitch, the count, and where the batter stands in the box.
After breaking down the metrics, I needed to figure out what to train my model against. Public models I’ve seen use run value as a metric. After some trials of computing the metric on my own, I decided that it was probably not worth making run value. Especially when I was advised to try whiffs. While the objective of a pitcher is to prevent runs, runs in it itself takes into a lot of context and detracts from the pitch itself. Run value stems from Run Expected Value which cares about the base out situation. I think there are still some factors that may not be necessary to reflect on the quality of a pitch such as an error or stolen base. While the distribution of these outcomes are very minimal, I still wanted to make something simple so whiffs were used instead. Whiffs are a good metric to use as any pitcher can produce one. I think a good pitch thrown in a good spot should have a good chance of being swung on and missed, regardless of the type of pitcher.
Last thing to consider is how to calculate the metric. Some models divide their predicted probability by the average then multiply it by 100. For example, if a fastball has a 30% chance of producing a whiff and the proportion of pitches that are whiffs are 11% then the value of that pitch is ~273: (.3 / .11) * 100. I didn’t want to have a wide spread of values from this so I decided to normalize by computing their z-scores, fiddling with the standards deviation. I felt 15 standard deviations gave it a nice enough spread so I left it with that. I then added 100 to convert it to a 100 scale where 100 is the average pitch.
Pitching+: (z-score of prob * 15) + 100
Building the Model
So, I’ve explained that I initially split the models into 2 metrics first before combining the results, the features that I used, and how to calculate the metric. It’s time to feed the model. Due to their differences in characteristics, I decided to split my data to improve the model’s confidence: Fastballs (4-seam and Sinker), Breaking (Curveball, Sweeper, Slider, Cutter, Slurve, and Knuckle Curve), and Off-speed (Change-up and Splitter). Next, I wanted to consider what model to use. I knew that I was going to use classification, so I needed something that was good at working with large data inputs. That’s where I decided to use XGBoost.
With the dataset being really large (over a million observations) and almost 2GB of data, because it consists of all pitches thrown in the regular season from 2019–2022. I then tuned my model using Hyperopt. I’m not going into detail about how that works, but you can find the documentation here: http://hyperopt.github.io/hyperopt/#documentation. It’s pretty useful and fast at finding optimum parameter values if you don’t want to use grid search.
Some notes to consider for my model (If you care more about the Pitching+ metric results, you can skip this part):
- Min Child Weight : Hovers around 8 which means each leaf node needs at least 8 pitches.
- Tree Pruning: Gamma is used to control the minimum loss reduction of a tree. Gamma was around 1 which is normal. If the node score is less than gamma, that node will be pruned (removed).
- Lambda: Lambda was around 0.02–0.1 and is used to decrease the similarity score as a way of regularization. This implies a low level of regularization.
- Max Depth: Maximum tree depth was around 18 which implies trees can have at most 18 nodes. This is a little risky as it can increase the risk of overfitting. This was combatted through lambda and gamma by indirectly combatting the complexity of trees via pruning.
- Eta (learning rate): This value was not tuned and kept at 0.3. When tuned, it didn’t have a significant effect to the model so the model is able to learn rather quickly.
Model Results
Since I used 3 different datasets based on pitch type, I split them into three sections for readability.
- Note: I used Log Loss as my evaluation metric with the goal to minimize it as much as possible. I did not really use any other error metric, which could be a weakness of this model. Think of Log Loss as how close the model was at predicting a whiff (1) or non whiff (0). The goal is to get Log Loss as close to 0 as possible.
Fastballs
Log Loss: 0.247
ROC Curve:
Feature Importance:
Breaking Pitches
Log Loss: 0.359
ROC Curve:
Feature Importance:
Off-speed Pitches
Log Loss: 0.383
ROC Curve:
Feature Importance:
Overall, I’m happy with the results of these metrics. The AUCs hover around .75 which means my model is pretty good at predicting a true whiff. While my model could be improved with better features or maybe even removing some and using other models, I do think that the results are pretty good. Here’s a statistical summary of each pitch.
Another thing I wanted to consider is the year-to-year correlation to see if this model is sustainable at predicting future seasons and making sure it wasn’t overfitting to one particular season. I used 2022 predicted whiff% and matched it up with 2023 actual whiff%. Here are the results.
This model correlates pretty moderately season over season with an R² value of 0.32. Not that great but it is somewhat sustainable and will vary season to season (For reference, Fangraphs has their R² at 0.42 for 2021–2022, so not too far off).
Leaderboards
Now, the moment you all have been waiting for. I have compiled a leaderboard for each pitch type for 2023. For simplicity, I will limit it to the 6 pitches (Fastball, Sinker, Slider, Cutter, Curveball, Changeup). Note: Minimum pitches thrown are 200 for that particular pitch.
Four Seam Fastballs
Sinkers
Slider
Cutter
Curveball
Changeup
Last thing I’ll show is what the best pitch thrown for the 2023 season. It was a low Sinker thrown by Sean Hjelle to strike out Jose Azocar. It had a Pitching+ value of 201. That means it was 100% better than an average pitch thrown. Pretty filthy. Here’s a video from the MLB Film Room.
This model isn’t perfect by any means and shouldn’t be interpreted as the end all be all of a pitcher. All this metric does is quantify a bunch of random features into one objectified number. Some pitchers like Emmanuel Clase were average on pitches as my model didn’t find their pitches effective enough at producing a whiff. If you look at Emmanuel Clase’s case, my model isn’t too wrong about him being average. His savant profile shows he was a little above average at generating whiffs (61st percentile). He was able to be productive through other means that season such as drawing soft ground ball contact on pitches. You could also argue that he throws pitches that the model doesn’t favor like Cutters, which is a fair argument, but like I said before, the model only cares about what qualities of a pitch produce a whiff. Cutters may not have been the best fit in the breaking category, so maybe that could be something you could try out on your own.
Conclusion
There are a lot more things I could show about my results that I decided to leave out for the sake of length of this article. I hope you were able to take something away from this whether it’s the model itself or learning about a new baseball metric. My methodology isn’t perfect, no model is. I personally learned a lot about classification models, imbalanced datasets, and feature engineering. I wished it was a little more predictive for players, but there are other variables that my model doesn’t account for such as a change in pitch strategy or batters adjusting from the previous year. This model is much better at predicting a whiff on an individual pitch so I’m not too concerned about its predictive ability for particular players.
Overall, I’m pretty satisfied with these results. My player leaderboards are pretty similar to other public models (such as Fangraphs) so it’s on the right track. There could be other areas of improvement such as model itself, parameter tuning, or even the features themselves. I could also instead change the metric to run value as that would have a higher variability and could be a better representation for a model. Consider this as version 1.0 of the model as I may comeback and tweak it in areas that could improve the model if I find any.
Feel free to comment or reach to me on Twitter if you think there is something I could try to implement into the model. My handle is @da_mountain_dew. I also have a website that will have updated leaderboards for the 2024 season if you want to check that out as well. Link at the top of the article.