Calculating xWOBA Using Collegiate Data

Jaron Richman
INST414: Data Science Techniques
4 min readApr 29, 2024

With the growing presence of advanced analytics in professional baseball, college baseball is trying its hardest to keep up. More and more teams are purchasing a Trackman stadium unit, which collects metrics such as pitch velocity, spin, and movement, along with batted ball metrics such as exit velocity, launch angle, and distance. Having these Trackman systems allow teams to measure their players performance, along with sharing the data to professional teams to help them with their scouting for the draft.

One of the most common ways to evaluate hitters is xWOBA. It takes wOBA — a statistic that weighs each batted ball outcome (out, single, double, triple, home run) — and takes the exit velocity and launch angle to determine the expected wOBA of that ball in play. With MLB teams, it is publicly shared, but there is no public leaderboard for NCAA Baseball. Having the ability to evaluate hitters on this metric is important for recruiting players in the transfer portal, as coaches can determine based on a players xWOBA if they outperformed their actual stats from the previous season, underperformed them, or if they performed as expected. On the professional side, scouts and team analysts can perform the same research to see if they think a player will perform at a high enough level in the minor leagues to be worthy of being drafted.

I collected the data through the private Trackman data sharing network I have access to. After downloading the data to my local drive, I then had to import it into R, making sure I only imported verified files, and not the unverified or player positioning files that are also included in the sharing network. Once I had all the correct data in one singular data frame, I then had to do clean the dataset of any terms that were spelled wrong. Once the data was clean, I filtered to the data I wanted: only pitches where the ball was put into play, and only rows that contained a play result, exit velocity, and launch angle. With my filtering done, that brought my dataset to 328,830 rows.

data <- season_2023_24 %>%
filter(Date >= '2023-01-01',
PitchCall == "InPlay",
TaggedHitType != 'Bunt',
complete.cases(PlayResult, ExitSpeed, Angle)) %>%
mutate(HitCheck = ifelse(PlayResult %in% c("Out", "Sacrifice", "Error", "FieldersChoice"), 0, 1),
PlayResult = case_when(PlayResult == "OUt" ~ "Out",
PlayResult == "SIngle" ~ "Single",
PlayResult == "sacrifice" ~ "Sacrifice",
PlayResult == "homerun" ~ "HomeRun",
TRUE ~ PlayResult),
woba = case_when(PlayResult %in% c("Out", "Sacrifice", "FieldersChoice", "Error") ~ 0,
PlayResult == "Single" ~ 0.883,
PlayResult == "Double" ~ 1.244,
PlayResult == "Triple" ~ 1.569,
PlayResult == "HomeRun" ~ 2.004)) %>%
select(Batter, PlayResult, ExitSpeed, Angle, PlayResult, woba) %>%
filter(PlayResult != "Undefined")

After creating a new column with the wOBA weights for each outcome, the data was all cleaned and ready to be modeled.

Sample image of the top 10 rows in the dataset. The batter column has been removed to protect identities.

I decided to use a KNN regression model to evaluate xWOBA, as I am not trying to classify what the outcome of a batted ball will be; we are trying to find the average outcome. ExitSpeed (exit velocity) and Angle (launch angle) are the predictor variables, with wOBA being the response. Once the model was trained, I was able to apply it to the entire dataset, and find not only the predicted xWOBA for individual play outcomes, but also the average xWOBA for any player in my dataset. This allows us to get live answers on if a player got lucky/unlucky with their result, as well as long term answers on if a player is more likely to repeat their performance as they continue their career.

Like any model, mine does have samples that are wrong. Here are four examples of balls that are expected to be Home runs based on the exit velocity and launch angle, but are instead outs due to some other condition that is not accounted for.

Going across left to right, the columns are play result, exit velocity, launch angle, wOBA, and xWOBA.

On the flip side, the model also has plays that it expected to result in an out, but they instead ended as a double or better. Incorrect predictions can be caused by a multitude of factors, such as wind, stadium dimensions, or player positioning. Some stadiums will have a short fence in certain parts of the outfield, while others will have deeper dimensions. If the wind is blowing a certain direction, that can have major effects on the outcome of the play as well.

Going across left to right, the columns are play result, exit velocity, launch angle, wOBA, and xWOBA.

The main limitations of the model are the reasons it may predict an outcome wrong; with the data available, it is hard to factor in all of the variables that have an affect on the outcome of the play. If some of them could be factored in, it could improve the performance of the model.

Attached is a link to my GitHub repository with my entire code:

https://github.com/jaronrichman/INST414-Module-Assignments

--

--