Pitch Predict — Part 2

Using Machine Learning to Predict the Next Pitch

Published in

Analytics Vidhya

8 min readOct 31, 2019

This is part 2 of 3 in a series of posts covering the work from a recent Data Science project at Lambda School. The project github repo can be found here. To gain some experience with Plotly Dash, I also created a dashboard app, here. Part 1 of this series can be found here, and part 3 here.

Batter Scouting Report

Based on the intuition that most teams at the major league level create a scouting report on their opponents, we wanted to try to recreate that using data from the Statcast pitch database. In addition to velocity, location, and break distance and angles, and various other data related to the pitch itself, the Statcast data includes some cool features regarding balls put into play off the hitters bat. This data includes the launch speed and the angle/trajectory of the ball off the bat into the field of play. It also includes estimated values for batting avg, woba (weighted on base avg), and iso value (a power statistic) based on aggregate data for balls hit into play with a similar speed and angle/trajectory. Essentially sometimes a batter gets unlucky and hits a line drive squarely into play directly where a defensive fielding player is positioned and makes the catch. If the ball had perhaps been a few feet in either direction, it would have resulted in a base hit or multiple-base hit. These estimated statistics sort of smooth out the variance of getting unlucky by hitting a ball directly at a fielder, and provide a more accurate representation of when a hitter makes solid contact on the pitch and the associated percentage of the time that contact would result in a base hit or the batter reaching base.

For each hitter, we filtered the dataframe for all pitches that hitter faced, and recorded the overall percentage of pitches faced for each pitch type category, thinking that if a batter faces an unusually high percentage of a certain pitch type category, perhaps there was an underlying reason that pitchers were doing that vs that hitter. Also, we aggregated those estimated batting avg, woba, and iso values for balls hit into play for each pitch category. Finally, we computed a feature for the batters ‘taken strike percentage’ for each pitch type (how often the pitch was a in the strikezone and the batter chose not to swing), as well as how often the batter chased a pitch outside the strike zone, for each pitch category. Lastly, based on the ratio of the number of times the batter swung at each pitch type relative to the number of times the ball was put into play (rather than missed or fouled), we created a ‘ball in play swung’ percentage for each pitch category.

The goal of all of these batter scouting report features was essentially to try to capture any strengths and weaknesses for a particular hitter vs the different pitch type categories, that would perhaps be known in advance of the matchup and would replicate a real world scouting report that influences the pitchers choice of pitch vs a given hitter.

In order to prevent data leakage, where future/unknown data regarding a hitters tendencies was used to create these aggregates, the batter scouting report was created iteratively, month by month. The initial seed statistics were computed from 2017 pitch data, and then for each month of the regular season from 2018 thru the end of August 2019, we calculated the batter scouting report for all pitches in that month based only on prior information. The statistics for that month were then concatenated with the prior 2017 data and included in prior data for the next months calculations. Also, in order to lessen the effects of outlier data due to small sample sizes, if a batter had not faced 100 pitches of that pitch type, we assigned NaN values for that hitters scouting report for that pitch type. Later, we used aggregate data based on that batters position in the batting order to impute those NaN values, figuring that statistics from similar batters in the same batting order position was a close enough approximation.

start_dates = ['2018-03-29', '2018-05-01', '2018-06-01', 
               '2018-07-01', '2018-08-01', '2018-09-01', 
               '2019-03-28', '2019-05-01', '2019-06-01', 
               '2019-07-01', '2019-08-01']
end_dates =  ['2018-04-30', '2018-05-31', '2018-06-30', 
              '2018-07-31', '2018-08-31', '2018-10-01', 
              '2019-04-30', '2019-05-31', '2019-06-30', 
              '2019-07-31', '2019-08-31']def pre_process_step2(pre_processed_step1, start_dates, end_dates):
    df = pre_processed_step1.copy()
    
    #initialize empty list to store dfs (concat them together later)
    df_list = []
    
    #iterate over each period
    for i in range(len(start_dates)):
        #make the prior and current dfs:
        prior_df = df[df['game_date'] < start_dates[i]]
        current_df = df[(df['game_date'] >= start_dates[i]) &   (df['game_date'] <= end_dates[i])]
        
        #add the batter scouting report
        batters_df = make_batters_df(prior_df)
        current_df = pd.merge(current_df, batters_df, how='left', on='batter')
        
        #append the df to the list
        df_list.append(current_df)
    
    step2_df = pd.concat(df_list, sort=False)
    return step2_df

At this point in the data pre-processing, we decided to save the dataframe to a pickle file and store in github. This was a natural stopping point, because further pre-processing was done after choosing a particular pitcher.

Pitcher Scouting Report

Due to time constraints for the project, we were not able to train models and compare predictions for every single pitcher in the dataset. We decided to focus our modelling on a few select pitchers with a large sample of pitches.

Once a particular pitcher has been chosen and a subset of the pitches thrown by him has been filtered from the database of all pitches, we used a similar approach to the batter scouting report to make features regarding prior tendencies of that pitcher. Here again we tried to avoid data leakage of future unknown data to influence our aggregate statistics for that pitcher, so we used the same month by month iterative approach with 2017 data as the seed, or prior database. We split the data into two subsets based on the handedness of the batter, assuming that the pitcher would approach left and right handed batters differently enough to be considered separately. We calculated the tendencies of that pitcher to throw each pitch type category, both overall, and further stratified by the count category (whether he was ahead, behind, or neutral situation relative to the batter).

Game by game features

Next, on a game by game basis, we added some features to the dataframe related to that particular game. These features can further be categorized as relating to the overall game state, or relating to the game-flow/ results of recent pitches.

a) Batting order and pitch count

Using the at bat number and pitch number for each at-bat (and the batters id), we reverse-engineered the batting order for each hitter throughout the game. We created a batting order slot feature (including pinch hitters and pitchers, as well as added a binary feature for if the batter was a pitcher or position player), and additionally kept a running pitch count total for the full game. If a pitcher has any tendencies such as throwing more breaking balls later in the game the 2nd or 3rd time through the batting order, we wanted our machine learning models to be able to pick up on that.

b) Game flow features

Just because a pitcher historically throws pitches of a certain type at a certain frequency, there is a large degree of variance on a game by game basis, for various reasons. One major contributing factor is that on any given day, each pitcher may throw a certain pitch with more command and control than on another night. Maybe precipitation or humidity are affecting his grip and ability to generate spin on his curveball, so that game he may throw a higher percentage of fastballs and change-ups. Or maybe for whatever reason he is struggling to throw strikes and/or get guys out with his fastball, but he’s having more success with his slider and curveball on that particular day. In order to attempt to encapsulate some of these variables, we created a few features based on the trailing pitches he has thrown recently.

For the last three pitches, we tracked the pitch type, the location, whether the batter swung, chased, and what was the result of the pitch (ball, strike, or hit into play). For both the previous five and also the previous fifteen pitches, we created features that calculated the pitch type category percentages of those recent pitches, and we also tracked the percentage of strikes thrown.

Finally, we wanted to attempt to encapsulate any tendencies the pitcher may have, perhaps even at an unknown/subconscious level to the pitcher himself, on pitches thrown after giving up a walk, a base hit, run, or homerun, or after striking a hitter out. We created binary features for each of those scenarios relating to the previous at bat.

Pitcher-Batter Prior Matchup Features

The final category of features that we engineered was based on the pitch type frequencies from all previous at bats where a particular hitter faced off against the pitcher. Again, we used the iterative month by month approach here to prevent data leakage. For each pitch type category, we just created a feature for the overall percentage of that pitch category thrown vs that batter in all prior matchups. If the pitcher had never faced a particular batter in the prior database, we just used his overall tendencies vs batters of the same left or right handedness.

Feature Engineering Recap

Overall, while the Statcast data, off the shelf, so to speak, includes a large number of features, we created dozens more to hopefully augment our predictive models. Here is a quick high level overview summarizing those features:

Game state: count, count category, run differential, baserunner features
Strikezone related features: in_strikezone, batter_swung, chased
Pitch Type Category
Batter Scouting Report: % of pitches faced, estimated batting avg, woba, and iso value, taken strike %, chase %, ball in play when swung %, for each pitch type category
Pitcher Scouting Report: Pitch type tendencies, overall and based on the count category, split by left/right handedness of the batter
Individual Game Features: batting order, pitch count
GameFlow Features: Trailing pitch tendencies (L3, L5, L15) and location info, batter swung, chased, result of the pitch, and trailing strike %
Prior at bat result: walk, strikeout, basehit, homerun, run scored
Pitcher-Batter Matchup History: Pitch tendencies in previous at bats vs that batter

Up Next:

Part 3, covering model selection, training, and analysis of predictions.

Pitch Predict — Part 2

Using Machine Learning to Predict the Next Pitch

Written by Josh Mancuso