Modelling tjStuff+ v2.0

14 min readMay 4, 2024

You can find “Modelling tjStuff+ v1.0” here

Introduction

A few months back, I undertook the very interesting task of pitch modelling. More specifically, training a machine learning model which takes the physical characteristics of a pitch and then predicts the expected run value. This concept was not a novel one, as pitch models like this have grown in popularity in recent years. The most common vernacular used to describe such models is “Stuff”, a relatively simple term to describe the effectiveness of a pitcher’s arsenal.

I have learned a lot more about pitch modelling and machine learning since then and decided that an update on my “tjStuff+” model was due. While I have tinkered with the model since then, I am writing this article to cover my methodology in updating the model (referred to as v2.0 from now on)).

Let’s begin!

Data Selection

I trained my original tjStuff+ (referred to as v1.0 from now on) using pitch data from the MLB Regular Seasons ranging from 2020 to 2022. This was done because all data post-2019 was captured using Hawkeye technology, and I was more confident that the tracked data was more precise. I did not include 2023 in the training set because I wanted use it as the test set. By doing so, I validated that my model was performing well, as a predictive and descriptive metric compared to more conventional metrics such as ERA, FIP, and xFIP. I will continue to test on 2023 data, so it will not be included in the training set.

Data Preparation

The same data preparation from v1.0 was undertaken in v2.0.

The physical characteristics of a pitch are well-defined and accurately measured, however these measurements are not normalized between pitchers of different handedness. This means that metrics such as Horizontal Release Point and Horizontal Break for left-handed pitchers would be scaled by a factor of -1 compared to right-handed pitchers. We can normalize these “mirrored” metrics so that during training, pitches thrown from either hand are on the same scale, which should improve performance.

Feature Selection

Feature selection is one of the biggest changes from v1.0 to v2.0

The metrics which remained unchanged are as follows:

start_speed

The speed of a pitch as it is released from the pitcher’s hand, measured in miles per hour

spin_rate

The rotation per minute of a pitch as it travels through the air

Horizontal movement in inches from the catcher’s perspective

ivb

Induced Vertical movement in inches from the catcher’s perspective

Horizontal Release Position of the ball measured in feet from the catcher’s perspective

Vertical Release Position of the ball, measured in feet from the catcher’s perspective.

From my research, these metrics were fundamental in physically characterizing a pitch. Extension was the only direct feature which was changed in v2.0. I received a lot of questions regarding the use of pitch locations and other location related metrics such as Vertical Approach Angle (VAA). Similar to v1.0, I did not include location-influenced metrics in my model as my goal was to specifically look at the physical characteristics of the pitch itself, and minimize the impact of location in my model.

With that out of the way, let’s cover which features did change:

Extension

Extension, the release extension of a pitch measured in feet, is a valuable feature in pitch design. It provides us information about the pitcher’s delivery. A larger extension causes the ball to feel faster than it actually is due to shorter distance it travels from the pitcher’s hand. However, during training and testing, outlier extension (for example Logan Gilbert’s consistent 8 ft extension) causes the model to behave in weird ways. Even with regularization parameters in the XGBoosted Decision tree, these extension outliers were simply too impactful on the model.

Removing extension seemed like the easiest option, but knowing its importance, I needed to find a better way to limit the gravity of outlier extension. In the v2.0, I applied a logarithmic transformation on extension. The new feature is:

log_extenstion

The natural logarithm of release extension of a pitch measured in feet

Primary Pitch Features

v1.0 had 3 features which compared a pitcher’s fastball to the rest of their arsenal (90th Percentile velocity, 90th percentile iVB, 90th Percentile HB). While there is some merit to comparing a pitcher’s secondaries to their fastball, there were a few oversights in the way I approached this:

I didn’t limit it to just one type of fastball. This means that the values which the features were being calculated could use metrics from different pitches. For example, if a pitcher threw 4-Seam and sinker, the model would typically use the sinker HB and the 4-Seam velocity and iVB. I used the 90th percentile of each fastball velocity, iVB, and HB to help consider usage rates (i.e. very low sinker usage would not impact the new features) but even moderate usage would cause this multiple-fastball situation.
Fastballs aren’t always a pitcher’s primary offering. My model effectively treated fastballs as the primary offering, even if their usage was very low. It makes sense that we should be looking at a pitcher’s true primary pitch and then comparing their remaining pitches against that.
If a pitcher didn’t have a fastball the model would not know how to handle it. This situation typically occurred with position players pitching as their pitch types are either unknown or assigned a random. This led to bizarre tjStuff+ grades, such as Isiah Kiner-Falefa recording a -404 tjStuff+ on his “Fastball”

In v2.0, instead of comparing a pitcher’s arsenal to their fastballs, the feature will consider the metrics of the pitcher’s primary pitch. This should help better capture the interaction between a pitcher’s offerings and avoid the oversights listed above.

The new features are as follows:

primary_start_speed_diff

Difference of pitch speed and mean speed of the pitcher’s primary pitch

primary_ivb_diff

Difference of pitch speed and mean Induced vertical break of the pitcher’s primary pitch

primary_hb_diff

Absolute Difference of pitch speed and mean Horizontal break of the pitcher’s primary pitch

I used the mean for each feature to capture the most common characteristics of a pitcher’s primary pitch. The median would be a suitable choice as well.

This leaves us with 10 features for v2.0:

start_speed
spin_rate
log_extension
hb
ivb
x0
z0
primary_start_speed_diff
primary_ivb_diff
primary_hb_diff

Target Selection

v1.0 and v2.0 have the same target, expected run values (xRV). Refer to v1.0 article to understand how I calculated and assigned run values to each pitch in the training set.

Model Selection

v1.0 and v2.0 both use an XGBoosted Decision Tree Regression model to model the run expectancy of a pitch.

In the past few months I have done more research into XGBoosted decision trees. I learned more about parameter tuning, and I have applied it to v2.0 to better account for outliers and make the model more conservative. In doing so, it led to more accurate results.

The parameters used for the model are as follows:

params = {
'objective': 'reg:squarederror', # Regression task using squared error
'max_depth': 6, # Maximum depth of the tree
'learning_rate': 0.1, # Learning rate
'subsample': 0.75, # Subsample ratio of the training instances
'colsample_bytree': 0.75, # Subsample ratio of columns when 
                          # constructing each tree
'reg_lambda': 0.9, # L2 regularization term
'reg_alpha': 0.8, # L1 regularization term
'random_state': 42 # Random seed for reproducibility
}

These parameters were selected through iterative training and testing. Similar to v1.0, I trained only 1 model for v2.0.

Feature Importance

Figure 1 summaries the importance of a feature in v2.0

Comparing this to v1.0, we can see that there is a much more even distribution of importance within the model. This makes sense as we tuned the model to be more conservative. No single feature greatly outweighs another, and as we will see later, the v2.0 is both more predictive and descriptive than v1.0

The most important features relate to the movement and velocity of the pitch. Intuitively this makes sense because these features control how the pitch moves throughout space. The velocity difference is a substantial feature because is a large part of sequencing and catching batters off-guard with varying pitch speeds. The rest of the features have similar importance, which helps the model be more balanced.

Calculating tjStuff+

tjStuff+ v2.0 is calculated the same as v1.0:

The output of my model is expected run value, which means that for any given pitch, the model can predict how effective that pitch is at limiting runs based on its physical characteristics. We can use a standardization technique to assist in comparing pitchers and pitches to one another. This is where the calculation of tjStuff+ arises.
tjStuff+ is similar to the prospect tool grade scale. The prospect tool grade is a normal distribution which uses 50 as the average and 10 as the standard deviation. This means that a prospect with a “60 Grade” hit tool, has a hit tool 1 standard deviation above the mean, which would slot them approximately into the 84th percentile. Increase that to a “70 Grade” hit tool, and now the prospect sits at the 97th percentile of hit tools. tjStuff+ follows this same standardization, but uses 100 as the mean and 10 as the standard deviation.

Model Performance

This goal of the model is to assess a pitcher’s ability to limit runs based on the physical characteristics of their pitches. To assess the model’s performance, I will be looking at its descriptive and predictive performance.

Descriptive: I will calculate the correlations of v2.0, v1.0, FIP, wOBA, and K-BB% to ERA, wOBA, and K-BB%

Predictive: I will calculate the correlations of v2.0, v1.0, FIP, wOBA, and K-BB% to Next Season FIP, wOBA, and K-BB%

I will not be using ERA to test the predictiveness of tjStuff+ because ERA is not a good indicator of future performance. As Tom Tango states:

You really do NOT care about ERA for three important reasons:
- sequencing is irrelevant
- baserunning is irrelevant
- ER v R distinction is irrelevant

To further drive home this point, Figure 2 illustrates the correlations of current season ERA to Next Season metrics over time. As we can tell, ERA is not a valuable predictive metric

Descriptiveness

I modelled tjStuff+ to be a predictive measure of a pitcher’s performance. However, since we can easily see how it operates as a descriptive measure, we will do so.

We will be looking at sample sizes of 100, 250, 500, and 750 pitches and the correlation between the specified metrics from the same season to evaluate the descriptiveness of tjStuff+ v2.0. This is illustrated in Figure 3.

As mentioned previously, tjStuff+ was not modelled to be a descriptive measure. v2.0 is a better descriptive metric than v1.0, however all of FIP, wOBA, and K-BB% are significantly better at describing ERA, wOBA, and K-BB% than tjStuff+. This makes sense as those 3 metrics consider actual results.

Figures 4 and 5 illustrate how v2.0 and FIP correlate to specific metrics over time. Unlike v2.0, FIP is a fantastic descriptor of the metrics, even very small sample sizes. v2.0 significantly improves as a descriptive metrics as the sample of pitchers grows. While its descriptive capabilities are limited, it has can describe ERA, wOBA, and K-BB% moderately well and plateaus at approximately 500 pitches.

Figure 5: tjStuff+ Descriptive Correlations

Predictiveness

I modelled tjStuff+ to be a predictive measure of a pitcher’s performance. We will be looking at sample sizes of 100, 250, 500, and 750 pitches and the correlation between the specified metrics from the following season to evaluate the predictiveness of tjStuff+ v2.0. This is displayed in Figure 6.

Figure 6: Metric Predictive Correlations

v2.0 performs well at small samples and better than v1.0 overall. K-BB% is the best predictive metric of the bunch at small samples, but as we increase from 250 pitches, v2.0 is the best predictor of wOBA. Being a strong predictor of wOBA is important as it is an all-encompassing offensive metric which summarizes the ability of a pitcher to limit men-on-base and TB. As we increase the pitch minimum to 750, v2.0 and K-BB% have very similar correlations in FIP, but v2.0 takes the edge in wOBA.

Overall, v2.0 and K-BB% are the best predictive metrics of the group, and at larger samples they perform similarly.

Figures 7 and 8 illustrate how v2.0 and K-BB% correlate to specific metrics over time. In small samples, K-BB% is the better predictive metric but at around 400 pitches, v2.0 catches up and even surpasses K-BB% as a wOBA predictor.

Figure 7: tjStuff+ Predictive Correlations

Figure 8: tjStuff+ Predictive Correlations

From this point forward, tjStuff+ will refer to v2.0.

With this analysis, tjStuff+ is valuable as a predictive measure to assess the performance of a pitcher in the future.

Stabilization

We determined the tjStuff+ is a valuable predictive measure with our analysis on future metric correlations. It is important to consider how quickly the metric stabilizes to understand when the information provided by tjStuff+ is meaningful.

To determine the stabilization points of tjStuff+, I grabbed all pitchers with at least 250 pitches thrown between 2020 and 2023 and calculated tjStuff+ on each of their pitches. Following that, I wrote a script which looks at each pitcher and determines the point at which their tjStuff+ does not deviate more than ±0.5 tjStuff+ over every 10 pitches, starting at 10 pitches.

For example, if a pitcher had their tjStuff+ of pitches 0–90 at 100 tjStuff+ and their tjStuff+ of pitches 0–100 at 100.2 tjStuff+, that pitcher would be considered to have their tjStuff+ stabilize at pitch 100.

I decided to keep the threshold at ±0.5 tjStuff+ as it felt both reasonable and non-restrictive due to the scaling of the metric.

From Figure 9, we can see that the median stabilization points for tjStuff+ was 220 pitches, which is equivalent to approximately 3 starts or 10 relief appearances. This point is likely met within the first few weeks of an MLB season. A limitation with looking at all pitches to find the stabilization point is that pitchers may have many different pitches with varying tjStuff+ and usage. Looking at individual pitch types may provide a more accurate stabilization point.

Figure 10 summarizes the stabilization points of different pitch types using the aforementioned method.

Figure 10: tjStuff+ Stabilization Points by Pitch Type

Understandably, tjStuff+ on individual pitch types stabilize in fewer pitches than tjStuff+ on all pitches. This makes sense since induvial pitches will tend to have less variability in their physical characterises, and therefore less variability in their tjStuff+ than looking at all pitches a pitcher has thrown.

Stickiness

An important aspect of a predictive statistic is that it is “sticky” year over year. Stickiness is the property of a statistic to exhibit consistency over time. We can calculate the statistics stickiness by calculating the coefficient of determination (R2) of the statistics between two consecutive seasons. For this example, we are looking at N Season tjStuff+ vs N+1 Season tjStuff+, which is illustrated in Figure 11.

From 2020 to 2023, tjStuff+ has an R2 of 0.66 across seasons. This value indicates that tjStuff+ is a sticky statistic. The stickiness of tjStuff+ is desirable as it means that a player likely to attain a similar tjStuff+ in consecutive seasons, which supports the use of tjStuff+ as a predictive statistic.

Results

Figure 12 illustrates the distribution of Single-Pitch tjStuff+

Figure 12: Distribution of Single Pitch tjStuff+

The largest differences from v1.0 and v2.0 is the overall decrease in tjStuff+ across the board. This aligns with making the model more conservative. Additionally, v2.0 is less favourable to Sweepers as the model considers their unique movement profile less important than in v1.0.

Here is a link to a spreadsheet which compares v2.0 tjStuff+ with v1.0 tjStuff+ for the 2024 Seaseon (Through May 2, 2024)

Outlier Extension

Addressing outlier extension was a goal for the updated model. There are two pitchers known for their extreme extension: Alexis Díaz and Logan Gilbert. Both average 7.7 ft of extension and have great results. The model has greatly affected both in terms of their tjStuff+ score before and after the update.

Alexis Díaz saw the largest increase in tjStuff+, while Logan Gilbert saw one of the largest drops. Digging deeper into the data, my best hypothesis for why this occurred is that the model has seen Díaz’s outlier extension in the training set with his pitch shapes, but Gilbert’s extension saw a huge increase this season. The model is likely struggling to accurately evaluate Gilbert’s pitches because it has never seen outlier extension tied to his pitch shapes. This hypothesis seems to stand if I force the extension on all of Gilbert’s pitches be 7 ft, which results in a substantial increase from v1.0.

I do not want to restrict the model in this way. We will proceed forward with the knowledge that my model, although performing better than v1.0, cannot understand the effectiveness of Logan Gilbert and his huge extension.

Negative tjStuff+

Few position players, like Matt Mervis, have high grading tjStuff+ which is likely due to the lack of training data on those pitches. While it is likely over valuing these players, we are not seeing any negative values in v2.0. Understandably, position players tend to grade out poorly. Seeing a position player get a -400 tjStuff was humorous, but having reasonable results for them makes me more confident in my model. Overall, position player tjStuff+ can essentially be ignored.

Limitations

No model is perfect, and tjStuff+ is no exception. There are a few limitations which I hope to address in future iteration of my model.

1) Handedness

Considering batter and pitcher handedness could help capture the advantage certain pitches have depending on the matchup. For example, Sweepers vs Same-handed batters and Changeups vs Opposite-handed batters.

2) Weather and elevation

Weather and elevation can greatly impact the movement of a pitch and the mechanics of a pitcher. We know that pitches behave differently in different environments, and considering and adjusting for these factors may more accurately assess a pitch’s value

3) Count and Situation

Pitches which can secure strikeouts tend to grade out really well in tjStuff+. This is the best possible outcome for a pitcher, but this may lead to certain pitches being overvalued in the model. For example, sweepers and slider will tend to grade out well because they are mostly thrown in situations where the pitcher is already ahead in the count. Pitches that are great for getting ahead in the count early could be just as valuable, but the model will likely not value them similarly.

Conclusion

I hope this article provided you with knowledge regarding pitch modelling as I work through trying to improve and refine my work. Models will never be able to explain everything, but as we saw, they can help us predict future performance to a moderate degree, and even better than more convential metrics. I am still early in my journey of pitch modelling, and I hope to provide more insight and information into my methodology as I continue to learn.

Learning is awesome. Baseball is awesome. I hope you enjoyed!

Follow me on Twitter: https://twitter.com/TJStats