Pitch Predict — Part 3

Using Machine Learning to Predict the Next Pitch

Josh Mancuso
Analytics Vidhya
8 min readOct 31, 2019

--

This is part 3 of 3 in a series of posts covering the work from a recent Data Science project at Lambda School. The project github repo can be found here. To gain some experience with Plotly Dash, I also created a dashboard app, here. Part 1 of this series can be found here, and part 2 here.

Model Development and Hyperparameters

Explanation of model development process, techniques used, and hyperparameters

Binary Classification Models

Initially, we approached predictive modelling as a binary classification problem. We wanted to establish some baseline models to predict whether a pitch was likely to be a fastball or not a fastball. After all, as a hitter, of course it would be useful to know the exact pitch type a pitcher is about to throw, but without such specific clairvoyance, knowing whether a fastball or not a fastball would also provide tremendous competitive advantage. If he knows a fastball is forthcoming, he can be on high alert to start his swing in time to catch up with the higher velocity. Conversely, the few extra milliseconds of anticipation/holding back for something breaking or offspeed would also confer an advantage to a major league caliber hitter.

Multi-class Classification Models

Next, we used a similar approach as with our binary classification models, but instead of using fastball/not fastball as the target variable, we used the specific pitch type. The number of different types of pitches varies by pitcher, of course, since not every pitcher has every possible pitch type in their arsenal.

For all of our predictive modelling, we tested several different types of models and compared the accuracy of the predictions across model types for a handful of select starting pitchers who had a large sample size of pitches thrown in 2018–2019.

Categorical Variable Encoding Strategies

For our binary classification models, we encoded the categorical variables using a custom ordinal encoding strategy. For features that contained information about the description of the result of a previous pitch, we used a sliding scale that mapped values along a spectrum. The spectrum ranged from lower values for strikes and foul balls, to more neutral values for pitchout, hit into play where the result was unknown, to higher values for more negative results such as balls, hit by pitch, hit into play and no out was recorded, and finally, hit into play run scored as the highest value. A similar spectrum was used for previous pitch result, with strike as lowest, hit into play-unknown higher, and ball the highest. The count and count category (ahead, neutral, behind) was also encoded along a sliding scale, from most favorable to the pitcher at the low end of the spectrum and least favorable at the high end, and neutral count situations in between. It is difficult to determine exactly what an appropriate scale would be for these values as part of the input vector in machine learning models, however, compared to random label or ordinal encoding, we felt that mapping according to some type of polar spectrum from good to bad outcomes/situation was a superior choice.

For our multiclass classification models, we decided to try and test a couple different encoding strategies for the categorical variables to see how that choice would affect model accuracy. In addition to the custom ordinal encoding, we also used one-hot encoding, as well as one-hot encoding plus Principal Component Analysis, with 99% of variance explained as the threshold.

Numeric Feature Scaling

In order to best account for potential outliers in the numeric data to have an oversized affect on the machine learning models, we chose Robust Scaler as our scaling method for all of the numeric features (the batter scouting report features, pitcher tendencies and percentages, etc).

Train-Test Split

For each pitcher, we split the data into an 85% / 15% train-test split. The split was calculated based on date, so the 85% test set was comprised of the first 85% of pitches thrown, and the test set was the most recent 15% of pitches thrown. This method ensures that no leakage of future unknown data can be used in the training of the model.

def train_test_split_by_date(df, train_fraction):
train_idx = int(len(df) * train_fraction)
train_end_date = df.loc[train_idx].game_date
train = df[df['game_date'] < train_end_date]
test = df[df['game_date'] >= train_end_date]
print('train shape: ' + str(train.shape))
print('test shape: '+ str(test.shape))
return train, test

Model Selection

For the binary classification models, we trained several different types of models from the sklearn library, including Random Forests, Gradient Boosted Tree Classifiers, Support Vector Machines, Linear SVC, Linear Discriminant Analysis models, and a Stochastic Gradient descent Classifier. For multiclass classification models, we used all of those, but substituted XGBoost instead of sklearn for gradient boosted trees, and also added in a Logistic Regression Classifier as well as a K-Nearest Neighbors classifier.

Hyperparameter optimization

For each model, we performed either a grid search or a randomized search across a range of various hyperparameters, including different regularization strategies to prevent overfitting of the models to the training set, performing a minimum of three-fold cross-validation for each. We stored the results of that search in a pandas dataframe, and sorted by rank of the validation accuracy score. Depending on the processing power required and how long that model type took to train, we then used the top 30–100 hyper-parameter tuned models for each type and tested the accuracy on the test set, and saved the top 10 most accurate models for each different model type, for further analysis and for later input into an ensemble Voting Classifier.

Model Interpretation

Analysis of model results

Binary models:

The aforementioned cross-validated gridsearch and randomized searches were performed on four different pitchers, chosen from among the starting pitchers with the largest sample of pitches in 2018–2019 data.The pitchers selected were Jacob deGrom, Trevor Bauer, Max Scherzer, and Zack Greinke.

For each pitcher, the majority class from the training set was used as the naive guess, baseline model accuracy to compare vs the accuracy of the different models. Specifically, whichever percentage of pitches for the target variable of fastball / not-fastball was higher in the training set, was selected as the naive guess for every pitch in the test set. The accuracy score for each of the models was then compared vs the accuracy of the naive guess. Among the four pitchers, after grouping all models together and taking the mean of the difference in model accuracy vs naive guess, Jacob deGrom showed the highest percentage increase in accuracy, at just under 15% better than naive guess. The model accuracy for Max Scherzer was far less successful, with an average difference of about 3% better than naive guess.

When grouping by pitcher and comparing across the types of models used, Random Forests had the highest accuracy relative to baseline naive guess, coming in at around 12% higher. Slightly under Random forests, gradient boosted trees, LDA models, and Linear Support Vector Machines averaged about 9–10% above naive guess. Stochastic Gradient Descent Classifier and sklearn SVC performed much worse, averaging around 6% above naive guess.

We realize that this analysis of just four pitchers is unlikely to be conclusive or statistically significant, in the time constraints we faced for the duration of this project, we felt it was likely demonstrative enough of some of the differences among model types for these selected pitchers.

Multiclass models:

As you can see in the chart below, the choice of categorical variable encoding did not have much of an overall effect on model accuracy. Although the results displayed are for just a sample of one pitcher, similar results (very minor % differences) were observed for the handful of others we compared. It should be noted that while performing PCA did generally decrease accuracy, slightly, the amount of time it took to perform hyperparameter optimization and model training was lessened. Unfortunately we didn’t measure these time differences, something I will definitely remember to track in future projects. So depending on the ultimate goal, sacrificing a small amount of accuracy for the sake of time, one hot encoding + PCA may be the way to go.

The following chart shows the accuracy difference among different model types for Trevor Bauer. As you can see, baseline majority class of his most frequent pitch type from the training set would result in test accuracy of around 37%. Using the best hyperparamter optimized model of each model type shows ~9–11% improvement in accuracy above that baseline. The ensemble model VotingClassifier using a hard vote from the other 8 model types tied w/ XGBoost Classifier for the highest accuracy.

Next Steps/ Further Research

Unfortunately due to the time constraints of this school project, we were unable to dive in any further into analysis of the models. We did perform some rudimentary feature importance analysis, using ELI5 permutation importances, but our first few attempts to prune features which did not contribute to the model resulted in decreased accuracy scores, so we abandoned that path. I would like to further dive into feature importances and look for general trends on which features contribute the most to the models and which contribute the least, and use that for further refinement and perhaps spark new ideas for feature engineering.

Given more time, and perhaps more computing resources, I would like to train models for a much larger sample of pitchers. This would allow for further examination of overall improvements in accuracy vs baseline, and allow for further study and comparison of what types of pitchers are more predictable than others, for example. It stands to reason that a pitcher who throws 5 different pitch types would probably be harder to predict than one who only throws 2, but what about the overall difference among starters vs relievers?Perhaps pitchers who are less predictable have greater success, which could be reflected in other statistics such as earned run average (ERA) and/or wins above replacement (WAR).

Finally, if models can be demonstrated to have a meaningful accuracy above baseline predictive capability for a much larger sample of pitchers, a real-time prediction application could have some actual use-cases. The MLB broadcast team could use the app to display predictions on screen to viewers at home. Alternatively, actual baseball managers could use the app generated predictions in real time and signal the prediction to the hitter in advance of the pitch, particularly in situations where accuracy scores are the highest. Conversely, from the perspective of the pitcher, such data could be used to spot particular trends and situations where he may be more predictable, and he can potentially take steps in order to randomize pitch selection a bit more in those situations.

Overall, this was a fun project- the Statcast data, while not perfect, is an extremely valuable baseball resource, ripe with insights to be extracted. I enjoyed the challenge of being creative creating new features, especially the batter scouting report features. Anyways, If you made it this far, I hope you enjoyed reading about the project! If interested, here again are the links to github repo and the dashboard.

--

--