Predicting Baseball Pitches
Utilizing Machine Learning to Predict The Next Pitch
Introduction
The task of hitting a baseball at the major league level is extremely difficult- consider the benchmark for a GREAT hitter is a batting avg of .300- that’s only a 30% success rate. If the batter knew what pitch the pitcher was likely to throw next, his chances of success could improve significantly. Current analysis does a poor job of telling the story of the game through data. Pitchers are more predictable than people realize — people just aren’t using the proper tools to measure them.
Based on data obtained from the current pitcher’s prior tendencies, batter’s prior tendencies, game state, and the last n-chronological pitches, we use both techniques laid out in previous studies, and our own novel methods, to build machine learning models capable of predicting the next pitch.
Our work is intended to form the foundation for further study, and eventually, a consumer application encapsulating the pitch prediction functionality, integrating this with other information about the current game, pitcher, and batter, and presenting all of this information within an attractive user interface. Design decisions were made with this eventuality in mind, and many components were designed to be generalizable and scaled with elastic resources.
Literature Review
We scowered the web for relavent work and found numerous studies and posts relating to the problem of predicting the next pitch, but one stood out.
Predicting the Next Pitch
By: Gartheeban Ganeshapillai and John Guttag, Massachusetts Institute of Technology
Using data compiled from a MLB STATS Inc. dataset the authors utilized pitcher specific linear support vector machines trained on input vectors from the 2008 season of 48 features and a binary label indicating the type of pitch. The model was tested on data from the 2009 season. A total of 359 individual pitchers were examined.
Features used included: the previous pitch, the count, the current score differential, the current base runners and some information about the pitcher’s tendency to throw a pitch (prior) with that property in various situations.
The previous actions of the pitcher on the specific count, and the pitchers conduct with regards to the specific batter were found to be some of the most important features. An engineered feature referred to as shrunk prior was also found to be of prime importance.
A majority baseline accuracy of 59.5% on average was used to assess the performance of the model, which on average saw an accuracy score of 70%, with an average improvement of 18% on a by pitcher basis.
A suggestion for a future improvement was to add features pertaining to the batter who was on deck.
Data Gathering
Pitch data was collected from Statcast via the pybaseball package. Pybaseball’s statcast class allows us to query pitches by date range. After identifying the start and end dates of the 2016, 2017, 2018, and 2019* regular seasons, we were able to pull in Pandas Dataframes containing every pitch for each regular season. These Dataframes were then subject to rudimentary data cleaning described further in the Data Cleaning & Feature Engineering section of this paper. After cleaning, the dataframes were compressed and stored in github as pickle files.
With future project developments in mind, the data, in addition to being compressed and stored in the main project GitHub repository as pickle files, was also stored in an S3 bucket as both pickle files and csvs. These csv files were then used to create a Redshift table, containing all regular season pitches from 2010–2019. This storage architecture is revisited in the following sections.
Data Cleaning and Feature Engineering
As is often the case in the field of data science, the majority of the time spent on this project was devoted to data cleaning, processing, and feature engineering. We built on the procedures and methods of previous iterations of similar projects undertaken in the past. First, we took a sort of high level overview of the project, and tried to layout a blueprint of all the necessary steps and how we wanted to approach transforming the raw pitch data from statcast into the format we wanted for our input vector into machine learning models. The main steps can be summarized as follows:
General Data Cleaning
Without going too far into the details, we encountered a few glitches in the data here and there during the process of data wrangling and feature engineering, so we added some code upfront to try to take care of most of it. Basically this included things like weeding out a small number of games played overseas w/o the Statcast camera tracking system installed, replacing unknown pitch type w/ NaN values, and fixing the number of balls from 4 to 3 in a few instances.
Game State Features
Here, we used the balls and strikes features to create a feature representing the current count for that at-bat, and then mapped the count into 3 categories representing whether the count favored the pitcher, the batter, or if the count was neutral. It stands to reason that pitchers approach these situations differently, and their choice of pitch should be heavily variable based on the favorability of the count.
Next we created a feature that represented the score differential, or how many runs ahead or behind the pitchers team was. Next, using the data for baserunners, we converted the baserunner features from using the baserunner id to a binary 1/0 whether or not a runner was on each base. Also we added a feature for whether ANY player was on any base vs bases empty, and also a feature whether or not the bases were loaded. A runner on first base has the potential to steal a base and advance, so perhaps some pitchers are less likely to throw as many breaking balls in the dirt that could get away from the catcher and allow the runner to advance into scoring position. Perhaps a runner on 2nd base with a clear view of the catchers pitch signals and potential to steal the sign and tip off the player at bat with a signal of their own could influence a pitchers choice of pitches. Maybe some of these tendencies could contribute to a machine learning model in making predictions.
Strikezone and Batting Features
Using the location data from the statcast camera system, of both the boundaries of the strikezone for that particular hitter based on his height/stance, as well as the dimension of the size of home plate, we created a feature that classified each pitch as in or out of the strikezone (regardless of whether the umpire called it a ball or a strike). Also, using the description feature, we extracted the descriptions and created a binary feature representing whether or not the batter swung at the pitch. By combining these engineered features, we created a feature to represent whether or not the hitter ‘chased’ a pitch that was outside of the strikezone. We thought this would be useful later when creating our scouting report on each batter, since if a batter has a tendency to chase at certain pitch types outside the strike zone, that information may be known by the pitcher in advance of the matchup and he may be more likely than otherwise to throw certain pitch types vs that player.
Pitch Category and Null Value Imputation
Unfortunately a very small percentage of the pitches in the statcast database, perhaps due to random glitches in the camera system or some other reason, had missing pitch type classification for that pitch. Rather than simply dropping all these rows from the dataframe, in order to preserve continuity of pitches on a game by game basis, we decided to impute these values. Our imputation strategy was just to use the overall distributions of pitch type for a given pitcher, and make a random guess (using those distributions as the weights for the random guess). Next, we mapped different pitch types into a more general pitch type category: fastballs (4-seam, 2-seam, cutter, sinker), breaking balls (curveballs, sliders, knuckle curves, screwball), and offspeed pitches (change-up, knuckle ball, and eephus pitch).
Batter Specific Features
Based on the intuition that most teams at the major league level create a scouting report on their opponents, we wanted to try to recreate that using data from the statcast pitch database. In addition to velocity, location, and break distance and angles, and various other data related to the pitch itself, the statcast data includes some cool features regarding balls put into play off the hitters bat. This data includes the launch speed and the angle/trajectory of the ball off the bat into the field of play. It also includes estimated values for batting avg, woba (weighted on base avg), and iso value (a power statistic) based on aggregate data for balls hit into play w/ a similar speed and angle/trajectory. Essentially sometimes a batter gets unlucky and hits a line drive squarely into play directly at where a defensive fielding player is positioned and makes the catch. If the ball had perhaps been a few feet in either direction, it would have resulted in a base hit or multiple-base hit. These estimated statistics sort of smooth out the variance of getting unlucky by hitting a ball directly at a fielder, and provide a more accurate representation of when a hitter makes solid contact on the pitch and the associated percentage of the time that contact would result in a base hit or the batter reaching base.
For each hitter, we filtered the dataframe for all pitches that hitter faced, and recorded the overall percentage of pitches faced for each pitch type category, thinking that if a batter faces an unusually high percentage of a certain pitch type category, perhaps there was an underlying reason that pitchers were doing that vs that hitter. Also, we aggregated those estimated batting avg, woba, and iso values for balls hit into play for each pitch category. Finally, we computed a feature for the batters taken strike percentage for each pitch type (how often the pitch was a in the strikezone and the batter chose not to swing), as well as how often the batter chased a pitch outside the strike zone, for each pitch category. Lastly, based on the ratio of the number of times the batter swung at each pitch type relative to the number of times the ball was put into play (rather than missed or fouled), we created a ‘ball in play swung’ percentage for each pitch category.
The goal of all of these batter scouting report features was essentially to try to capture any strengths/weaknesses for a particular hitter vs the different pitch type categories, that would perhaps be known in advance of the matchup and would replicate a real world scouting report that influences the pitchers choice of pitch vs a given hitter.
In order to prevent data leakage, where future/unknown data regarding a hitters tendencies was used to create these aggregates, the batter scouting report was created iteratively, month by month. The initial seed statistics were computed from 2017 pitch data, and then for each month of the regular season from 2018 thru the end of August 2019, we calculated the batter scouting report for all pitches in that month based only on prior information. The statistics for that month were then concatenated with the prior 2017 data and included in prior data for the next months calculations. Also, in order to lessen the effects of outlier data due to small sample sizes, if a batter had not faced 100 pitches of that pitch type, we assigned NaN values for that hitters scouting report for that pitch type. Later, we used aggregate data based on that batters position in the batting order to impute those NaN values, figuring that similar batters in the same batting order position was a close enough approximation.
At this point in the data pre-processing, we decided to save the dataframe to a pickle file and store in github, and eventually could be imported to csv or SQL db. This was a natural stopping point, because further pre-processing was done after choosing a particular pitcher.
Pitcher Specific Features
Once a particular pitcher has been chosen and a subset of the pitches thrown by him has been filtered from the database of all pitches, we used a similar approach to the batter scouting report to make features regarding prior tendencies of that pitcher. Here again we tried to avoid data leakage of future/unknown data to influence our aggregate statistics for that pitcher, so we used the same month by month iterative approach with 2017 data as the seed/prior database. We split the data into 2 subsets based on the handedness of the batter, assuming that the pitcher would approach left and right handed batters differently enough to be considered separately. We calculated the tendencies of that pitcher to throw each pitch type category, both overall, and further stratified by the count category (whether he was ahead, behind, or neutral situation relative to the batter).
We only chose a few select pitchers with a large sample of pitches, so, unlike the batter scouting report, we didn’t have to address imputation of NaNs or small samples from the prior database.
Game by Game features
Next, on a game by game basis, we added some features to the dataframe related to that particular game. These features can further be categorized as relating to the overall game state, or relating to the game-flow/ results of recent pitches.
Batter Order and Pitch Count
Using the at bat number and pitch number for each at-bat (and the batters id), we reverse-engineered the batting order for each hitter throughout the game. We created a batting order slot feature (including pinch hitters and pitchers, as well as added a binary feature for if the batter was a pitcher or position player), and additionally kept a running pitch count total for the full game. If a pitcher has any tendencies such as throwing more breaking balls later in the game the 2nd or 3rd time through the batting order, we wanted our machine learning models to be able to pick up on that.
Game Flow Features
Just because a pitcher historically throws pitches of a certain type at a certain frequency, there is a large degree of variance on a game by game basis, for various reasons. One major contributing factor is that on any given day, each pitcher may throw a certain pitch with more command and control than on another night. Maybe precipitation or humidity are affecting his grip and ability to generate spin on his curveball, so that game he may throw a higher percentage of fastballs and change-ups. Or maybe for whatever reason he is struggling to throw strikes and/or get guys out with his fastball, but he’s having more success with his slider and curveball on that particular day. In order to attempt to encapsulate some of these variables, we created a few features based on the trailing pitches he has thrown recently.
For the last three pitches, we tracked the pitch type, the location, whether the batter swung, chased, and what was the result of the pitch (ball, strike, or hit into play). For both the previous five and also the previous fifteen pitches, we created features that calculated the pitch type category percentages of those recent pitches, and we also tracked the percentage of strikes thrown.
Finally, we wanted to attempt to encapsulate any tendencies the pitcher may have, perhaps even at an unknown/subconscious level to the pitcher himself, on pitches thrown after giving up a walk, a base hit, run, or homerun, or after striking a hitter out. We created binary features for each of those scenarios relating to the previous at bat.
Pitcher-Batter Prior Matchup Features
The final category of features that we engineered were based on the pitch type frequencies from all previous at bats where a particular hitter faced off against the pitcher. Again, we used the iterative month by month approach here to prevent data leakage. For each pitch type category, we just created a feature for the overall percentage of that pitch category thrown vs that batter in all prior matchups. If the pitcher had never faced a particular batter in the prior database, we just used his overall tendencies vs batters of the same left or right handedness.
Model Development and Hyperparameter Tuning
Binary Classification
Initially, we approached predictive modelling as a binary classification problem. We wanted to establish some baseline models to predict whether a pitch was likely to be a fastball or not a fastball. After all, as a hitter, of course it would be useful to know the exact pitch type a pitcher is about to throw, but without such specific clairvoyance, knowing whether a fastball or not a fastball would also provide tremendous competitive advantage. If he knows a fastball is forthcoming, he can be on high alert to start his swing in time to catch up with the higher velocity. Conversely, the few extra milliseconds of anticipation/holding back for something breaking or offspeed would also confer an advantage to a major league caliber hitter.
Multiclass Classification
We used a similar approach as with our binary classification models, but instead of using fastball/not fastball as the target variable, we used the specific pitch type. The number of different types of pitches varies by pitcher, of course, since not every pitcher has every possible pitch type in their arsenal.
For all of our predictive modelling, we tested several different types of models and compared the accuracy of the predictions across model types for a handful of select starting pitchers who had a large sample size of pitches thrown in 2018–2019.
Encoding Categorical Variables
For our binary classification models, we encoded the categorical variables using a custom ordinal encoding strategy. For features that contained information about the description of the result of a previous pitch, we used a sliding scale that mapped values along a spectrum. The spectrum ranged from lower values for strikes and foul balls, to more neutral values for pitchout, hit into play where the result was unknown, to higher values for more negative results such as balls, hit by pitch, hit into play and no out was recorded, and finally, hit into play run scored as the highest value. A similar spectrum was used for previous pitch result, with strike as lowest, hit into play-unknown higher, and ball the highest. The count and count category (ahead, neutral, behind) was also encoded along a sliding scale, from most favorable to the pitcher at the low end of the spectrum and least favorable at the high end, and neutral count situations in between. It is difficult to determine exactly what an appropriate scale would be for these values as part of the input vector in machine learning models, however, compared to random label or ordinal encoding, we felt that mapping according to some type of polar spectrum from good to bad outcomes/situation was a superior choice.
For our multiclass classification models, we decided to try and test a couple different encoding strategies for the categorical variables to see how that choice would affect model accuracy. In addition to the custom ordinal encoding, we also used one-hot encoding, as well as one-hot encoding plus Principal Component Analysis, with 99% of variance explained as the threshold.
Numeric Features Scaling
In order to best account for potential outliers in the numeric data to have an oversized affect on the machine learning models, we chose Robust Scaler as our scaling method for all of the numeric features (the batter scouting report features, pitcher tendencies and percentages, etc).
Train Test Split
For each pitcher, we split the data into an 85% / 15% train-test split. The split was calculated based on date, so the 85% test set was comprised of the first 85% of pitches thrown, and the test set was the most recent 15% of pitches thrown. This method ensures that no leakage of future unknown data can be used in the training of the model.
Model Selection
For the binary classification models, we trained several different types of models from the sklearn library, including Random Forests, Gradient Boosted Tree Classifiers, Support Vector Machines, Linear SVC, Linear Discriminant Analysis models, and a Stochastic Gradient descent Classifier. For multiclass classification models, we used all of those, but substituted XGBoost instead of sklearn for gradient boosted trees, and also added in a Logistic Regression Classifier as well as a K-Nearest Neighbors classifier.
Hyperparameter Optimization
For each model, we performed either a grid search or a randomized search across a range of various hyperparameters, including different regularization strategies to prevent overfitting of the models to the training set, performing a minimum of three-fold cross-validation for each. We stored the results of that search in a pandas dataframe, and sorted by rank of the validation accuracy score. Depending on the processing power required and how long that model type took to train, we then used the top 30–100 hyper-parameter tuned models for each type and tested the accuracy on the test set, and saved the top 10 most accurate models for each different model type, for further analysis and for later input into an ensemble Voting Classifier
Model Interpretation and Feature Importances
Binary Models
The aforementioned cross-validated gridsearch and randomized searches were performed on four different pitchers, chosen from among the starting pitchers with the largest sample of pitches in 2018–2019 data.The pitchers selected were Jacob deGrom, Trevor Bauer, Max Scherzer, and Zack Greinke.
For each pitcher, the majority class from the training set was used as the naive guess, baseline model accuracy to compare vs the accuracy of the different models. Specifically, whichever percentage of pitches for the target variable of fastball / not-fastball was higher in the training set, was selected as the naive guess for every pitch in the test set. The accuracy score for each of the models was then compared vs the accuracy of the naive guess. Among the four pitchers, after grouping all models together and taking the mean of the difference in model accuracy vs naive guess, Jacob deGrom showed the highest percentage increase in accuracy, at just under 15% better than naive guess. The model accuracy for Max Scherzer was far less successful, with an average difference of about 3% better than naive guess.
When grouping by pitcher and comparing across the types of models used, Random Forests had the highest accuracy relative to baseline naive guess, coming in at around 12% higher. Slightly under Random forests, gradient boosted trees, LDA models, and Linear Support Vector Machines averaged about 9–10% above naive guess. Stochastic Gradient Descent Classifier and sklearn SVC performed much worse, averaging around 6% above naive guess.
We realize that this analysis of just four pitchers is unlikely to be conclusive or statistically significant, in the time constraints we faced for the duration of this project, we felt it was likely demonstrative enough of some of the differences among model types for these selected pitchers.