Modelling xwOBA (With KNN)

10 min readFeb 14, 2024

Introduction

Analytics in baseball has developed throughout the years to better capture a player’s value at all facets of the game. One of the most noteworthy developments was the creation of the all encompassing offensive metric “Weighted On-Base Average”, also known as wOBA. This article will dive deeper into wOBA and its applications. Additionally, I will go through my process of creating an “expected” wOBA metric (xwOBA), similar to the one that is defined by MLB and accessible via Baseball Savant.

What is wOBA?

wOBA was created by Tom Tango, Mitchel Lichtman, and Andrew Dolphin and was defined in their book “The Book: Playing The Percentage in Baseball”, which was originally published in 2006. wOBA is defined as the following:

wOBA is a version of on-base percentage that accounts for how a player reached base instead of simply considering whether a player reached base. The value for each method of reaching base is determined by how much that event is worth in relation to projected runs scored (example: a double is worth more than a single).

wOBA is a valuable metric because it is weighted to consider the events in which a batter has reached base. wOBA is a linear function and the weights are the coefficients for each of the outcomes. These coefficients are based on the expected run value for that specific event. This differs from a metric like SLG, which considers the value of a single (1B) to be half that of a double (2B), when in reality, this does not accurately capture the expected run value of the event. Additionally, the coefficients change each season depending on how often each event occurs.

The following is the formula for wOBA during the 2023 MLB Season:

In summary, wOBA is an all encompassing offensive metric which weighs outcomes by their expected run value. This allows wOBA to accurately measure a batter’s value more comprehensively than metrics such as AVG, OBP, SLG where the lack of weighting fails to fully capture the context of the event.

A popular metric which is commonly used to combine the effects of OBP and SLG is On-Base Plus Slugging (OPS). OPS is calculated simply by summing OBP and SLG. While it does an effective job of capturing run values, it has an inorganic scale (which overvalues the effects of slugging) and is mathematical malpractice, as it sums two fractions with different denominators. For these reasons, OPS is frowned upon by baseball analysts.

Here is a table summarizing wOBA Rules of Thumb:

Expected wOBA (xwOBA)

Now that we have covered wOBA, we can look at expected wOBA (xwOBA) and its applications. MLB defines xwOBA as the following:

Expected Weighted On-base Average (xwOBA) is formulated using exit velocity, launch angle and, on certain types of batted balls, Sprint Speed.

Similar to wOBA, xwOBA tries to capture the offensive production of a batter based on what has occurred. One key difference is that xwOBA does not only consider one outcome; it predicts the outcome of an event based on metrics such as Exit Velocity and Launch Angle and provides probabilities for the outcome of each event. These probabilities are calculated using Statcast Data from prior seasons. Since walks and HBP are binary, they are always either assigned a probability of 1 or 0 depending on the outcome. This means that where wOBA and xwOBA are different is how they treat batted ball events.

xwOBA is a descriptive statistic. This means that xwOBA takes an event that has occurred and provides a value (in this case, wOBA) for that given event. It describes what should have happened given the features of the batted ball event and removes defense from the equation. Batted balls cannot be influenced by a batter or pitcher once they enter play.

Recreating xwOBA

You may be asking this question:

“If xwOBA already exists, why should we try to recreate it?”

My answers to that question would be:

I want to help others understand some more advanced topics regarding baseball analytics and machine learning (in this case the KNN algorithm)
I can easily compare my results with MLB to gauge my approach and methodology
I think it’s a cool project to recreate

With that out of the way, we can proceed!

As mentioned previously, xwOBA is formulated using features of batted balls events, and in some cases, sprint speeds. To simplify our modelling, we will limit our xwOBA to batted ball metrics. Thankfully, MLB has provided Exit Velocity and Launch Angle data for every batted ball event since 2015. With this data, we can begin to think about how to apply it for the purpose of recreating xwOBA.

K-Nearest Neighbours

K-Nearest Neighbours (KNN) is a machine learning algorithm which aims to classify an unknown data point given the classifications of data points that are nearest to it.

The steps of the algorithm are as follows:

Select the number of neighbours (k)
Calculate distance between data point and all other data points
Select the k-nearest data points (neighbours) to data point of interest
Classify the data point depending on classifications which appeared most often in the neighbours

In my opinion, a visual example of this concept is easiest to understand. Figure 3 illustrates KNN well:

In this example, k was 4. The 4-nearest data points (neighbours) to the data point of interest had the following classifications

Yellow
Yellow
Green
Brown

Since there were more yellow classes within the 4-nearest neighbours than any other class, the data point of interest is classified as yellow using this algorithm.

Applying KNN to Baseball Data

Now that we have a clearer understanding of how the KNN algorithm works, we can better understand how we can apply it to baseball data. KNN is an intuitive algorithm that is very powerful when working with classifications whose features tend to be distinct from one another. In our example of xwOBA, there are 5 batted balls outcomes which we must consider:

Field Out (including errors)
Single
Double
Triple
Home Run

As we are trying to recreate xwOBA, and are limited to batted ball metrics, we can use KNN and train a classification machine learning model with the following features:

launch_speed: How fast, in miles per hour, a ball was hit by a batter.
launch_angle: How high/low, in degrees, a ball was hit by a batter.

The reason for not including spray angle stems from a Tom Tango blog post. Since xwOBA is trying to describe a player rather than the play, spray angle can be ignored because it is mostly noise. Tom goes on to show that without spray angle, xwOBAcon (xwOBA on Contact) is a better predictor of performance of next season xwOBAcon than with spray angle. This observation is also the reason why MLB xwOBA does not include spray angle.

In order to train a classification model, we need a target. Since we are working with bated ball outcomes, we can classify each event using total bases. The target for the model is:

total_bases: number of bases gained by a batter through his hits.

Now that we have the features and the target, we can proceed to training the model. Before we do that, we should visualize how exit velocity and launch angle impact total bases. This is illustrated in Figure 4.

Figure 4: Total Base Frequency by Exit Velocity and Launch Angle

As I mentioned previously, the KNN algorithm is powerful if the features of each class are distinct. The frequency plot illustrates that this is not the case when it comes to our features and classes. Fortunately for us, the machine learning model which we will train will be able to provide probabilities of outcomes, rather than a singular outcome. We can multiply these probabilities by the wOBA coefficients to calculate our own xwOBACON (xwOBA on Contact) metric.

The formula to calculate xwOBACON for a single event is the expected value of wOBACON returned by the model:

We now have a way to calculate xwOBACON with a machine learning model utilizing the KNN algorithm. To calculate xwOBA we will need to consider events that are not batted ball events that also impact wOBA. These events include:

Unintentional Base on Balls (walk)
Hit By Pitch
Strikeouts

To calculate xwOBA for a single batter, we utilize the wOBA formula from before and substitute their summation of xwOBACON into the equation. This yields:

Let’s get to training and testing the model.

Training and Testing

Data Preparation

For this project, I gathered MLB batted ball data from the 2020 through 2023 seasons. The data from 2020 to 2022 will be used for training. The 2023 data will be used to calculate xwOBA and compare to MLB xwOBA which is accessible via Baseball Savant.

The model we are creating does not explicitly calculate xwOBA. Rather, it is a model to predict total bases, which then we can apply the wOBA coefficients to in order to get xwOBA. For this reason, during training, we must only consider batted ball events. To prepare for this, all non-batted ball data were filtered from the 2020 to 2022 dataset.

To ensure greater accuracy and a more robust model, I have applied Train Test splitting to the dataset.

Parameter Tuning

The only parameter in a KNN machine learning model is the number of neighbours, k. For this model, I selected a k of 11. This was selected iteratively.

Performance

The accuracy of our model is 76%. This means that from the test set, our model correctly predicted the total bases of 76% of the data points given their launch speed and launch angle. While this might not seem encouraging, it should be noted that we are not fully interested in a single classification, rather the probabilities. As we visualized previously, there is a lot of overlap in outcomes with the selected features. In my opinion, an accuracy of 76% shows that we are on the right track because it is unlikely that a machine learning model can predict an single number of total bases at a high degree of accuracy when limited to our two features.

An interesting observation that I noticed was that the model predicted 0 triples in the test set. This outcome may seem odd, but given the rarity of triples, and the likelihood that when they are hit, they have similar batted ball features to other events. The model predicted a lot more outs than reality, which makes sense because the model does not consider factors such as spray angle and defence into the equation. These observations are illustrated in the confusion matrix in Figure 6.

Comparing xwOBA

Using the previously defined formula for xwOBA, we can calculate xwOBA for each batter during the 2023 season. This can be done by using the model to predict batted ball probabilities while accounting for walks, hit by pitches, and strikeouts.

At this point, we know our model does a good job at classifying batted ball events, but we do not know if we did an effective job at recreating MLB xwOBA. An easy way to understand the relationship between our model and MLB, is to plot them in a scatter plot. To maintain clarity in the comparisons, the model we trained will be referred to as “xwOBA (TJStats)” and MLB xwOBA will be referred to as “xwOBA (MLB)”

Figure 7 is a scatter plot showing the relationship between xwOBA (TJStats) and xwOBA (MLB).

With a R2 of 0.96, xwOBA (TJStats) matches up exceptionally well with xwOBA (MLB).

xwOBA Output

Here is a link to a spreadsheet with the predicted xwOBA generated by the model.

This spreadsheet includes all qualified batters during the 2023 MLB season and lists their wOBA, xwOBA (TJStats), xwOBA (MLB), and xwOBA differences.

Limitations

In machine learning, there will almost always be limitations when training a model. In our case, not considering sprint speed would lose context in respect to certain batted ball event. xwOBA (MLB) utilizes sprint speed during certain specific batted ball events to better capture a batter’s offensive ability. In xwOBA (TJStats) we limited ourselves to just 2 features, Launch Speed and Launch Angle, for simplicity. Despite this limitation, we trained a model to predict xwOBA which matched up very well with the xwOBA (MLB).

Looking at the difference column from the output spreadsheet, the batters with the largest deviation between xwOBA are those that rank on the extreme ends of the sprint speed leaderboards. For example, Jon Berti and Alek Thomas are undervalued in our model, and they rank in the 95th and 97th percentile for sprint speed respectively. On the opposite end, our model overvalues Dominic Smith and Pete Alonso, who rank in the 14th and 18th percentile in sprint speed respectively.

The use of KNN was most likely limiting, as it only considers Euclidean distance between data points and cannot capture any interaction between features. An algorithm such as Random Forest may have proven to be more robust and accurate.

Conclusion

The purpose of this article was to provide readers a deeper dive into a very valuable baseball metric, wOBA. This deep dive included defining the metric, explaining its “expected” variant, and training a machine learning model to recreate MLB’s outstanding work on this topic.

The choice of KNN was very deliberate. In my experience, KNN is one of the most interpretable machine learning algorithms, as it is intuitive and can be clearly illustrated with graphics.

To all readers, especially those who are new to baseball analytics and/or machine learning, I hope this article provided you with knowledge and enjoyment. Thank you for taking the time to read, and I hope you are now a wOBA truther (if you weren’t one already)!

Code Repository

Here is the link to the GitHub repo which contains all code used for this project: https://github.com/tnestico/xwoba

If you enjoyed this article, check out my other article covering baseball analytics and machine learning.

Consider following me on Twitter