XGBoost: Theory and Application

Sam Wirth
Hoyalytics
Published in
10 min readApr 18, 2023

Introduction

In this article, I will provide an explanation of the mathematical concepts behind XGBoost (eXtreme Gradient Boosting). I will then demonstrate a practical application of this algorithm to professional baseball data to determine if pitch characteristics like velocity, spin, or movement can predict the effectiveness of a pitch.

XGBoost is a gradient-boosting decision tree machine learning algorithm. Decision trees are supervised machine learning algorithms that predict the class (for a classification model) or the value (for a regression model) of a target variable through a set of decisions that are created by learning decision rules from past training data. Boosting is a technique that starts out with a prediction model and adds new models over time to correct the errors made by previous models. This process continues until further improvements cannot be made. Gradient boosting means that the newly added models are created to correct the errors of the previous models and then combined to make the final prediction. The process is called “gradient boosting” because it uses a gradient descent algorithm to minimize prediction error when adding new models. The whole process consists of combining several models with weak learning rates (meaning the next iteration does not learn from the previous one as quickly) into one strong model that gives the final predictions.

XGBoost is well regarded as one of the premier machine learning algorithms for its high-accuracy predictions. Furthermore, XGBoost is faster than many other algorithms, and significantly faster than other gradient-boosting algorithms due to its capacity to do parallel computation on a single machine. This means that several processors work simultaneously to complete an overall, larger problem by completing multiple, smaller calculations. Albeit, it does have drawbacks, namely it is very complex. While XGBoost often outperforms single decision trees, it sacrifices the intelligibility of the decision tree for higher accuracy. For instance, it is easy to follow the path of a single decision tree, however, it may be impossible to logically follow the hundreds or thousands of trees used in an XGBoost algorithm. Thus, although XGBoost often achieves higher accuracy than other models in both classification and regression problems, it sacrifices the intrinsic interpretability that other models possess. XGBoost is often referred to as a “black-box” algorithm due to its complexity. Black boxes can be dangerous because they can often increase their accuracy through confounding variables, meaning that the model could pick up on third-party variables that affect both the dependent and independent variables, causing the appearance of correlation when there is not any.

The Theory Behind XGBoost

Now, I will walk through the theory behind a general XGBoost algorithm. The first step is identifying a training set with any number of factors and y as the target variable, as well as a differentiable loss function L(y, F(x)). A loss function simply compares the actual value of the target variable with the predicted value of the target variable. We will also determine a learning rate that indicates how much the new models will learn from the previous ones. The learning rate is a value commonly between 0.1 and 0.3 and is meant to slow down the algorithm to prevent over-fitting.

Next, initialize the XGBoost model with a constant value:

For reference, the mathematical expression argmin refers to the points at which the expression is minimized. In the case of the XGBoost algorithm, it is the point at which the loss function is minimized, so these are the points at which the prediction error is at its smallest. θ represents any arbitrary value (its value does not matter) that serves as the first value of estimation for the regression algorithm. The error of estimation with θ will be very large but will get smaller and smaller with each additive iteration of the algorithm.

Then, for all m ∈ { 1, 2, 3, …, M }, compute the gradients and Hessian matrices for the gradient boosting of the trees themselves:

The gradients, often referred to as the “pseudo-residuals,” show the change in the loss function for one unit change in the feature value. The Hessian is the derivative of the gradient, which is the rate of change of the loss function in the feature value. The Hessian will help determine how much the gradient is changing, and therefore how much the model will change. Each of these is imperative for the gradient descent process.

Using these new matrices, another tree is added to the algorithm by completing the following optimization problem for each iteration of the algorithm:

This optimization problem uses a Taylor approximation, which is necessary to be able to use traditional optimization techniques. This means that the exact loss will not have to be calculated for each individual base learner. What this is doing is estimating the point that the algorithm is at in the gradient boosting process. If the rate of change of the gradient is steep, meaning that the residuals are large, then the algorithm still needs significant change. On the other hand, if the rate of change of the gradient is flat, the algorithm is close to completion.

Notice the role the learning rate α plays in the optimization problem. p_m(x) is the new tree that is added to the model, and the learning rate determines how much the model changes by. A higher learning rate means a greater influence of the past iteration of optimization on the next set of trees that are added to the model.

The model is then updated by adding the new trees to the previous model:

This process is repeated for every single weak learner (each m ∈ { 1, 2, …, M } ). Weak learners are used so that the model gains accuracy over time based on the process of minimizing the loss function. Then, the final output of the XGBoost algorithm can be expressed as the sum of each individual weak learning method:

Baseball Application

After explaining the mathematics behind a basic XGBoost machine learning algorithm, I will use such an algorithm to perform data analysis. In the game of baseball, there is a significant amount of data publicly available. Seemingly everything is tracked on the field, from how much a pitch spins to how fast a runner runs. We can use an XGBoost algorithm to extract insights from this data. A general question in the field of baseball analytics is how to determine how “good” any given pitch is. Is a pitch good because batters have made poor contact on it in the past? Is a pitch good because batters have swung and missed on it in the past? The answer to both of these is likely yes, but one can do better by building an algorithm to predict how well a pitch will do in the future, rather than describe how well it did in the past. For the purpose of this project, I define a successful pitch as one that generated a swing and miss from the batter.

I will look at every fastball thrown during the 2022 Major League Baseball season, training an XGBoost algorithm to determine the relationship between certain pitch movement characteristics and whether or not that pitch was a “success.” Then, I will use the algorithm to predict whose pitches will generate the most “successes” in the 2023 MLB season based on 2023 Spring Training data. The pitch movement characteristics are:

  • Velocity (release_speed): The velocity of the pitch in miles per hour.
  • Extension (release_extension): Measure of the true release point from the pitching rubber. The distance in feet closer to home plate than the 60.5 ft from pitching rubber to home.
  • Spin rate (release_spin_rate): Rate of spin on the ball after it was released by the pitcher, measured in RPM.
  • Induced vertical break (pfx_z): Vertical movement of the ball in inches caused by the spin of the pitch. For example, the induced vertical break of a fastball is positive, as the backspin of a fastball counteracts gravity, in a sense lifting it up from its normal path. On the other hand, the induced vertical break of a curveball is negative because its over-the-top spin augments the vertical drop caused by gravity. Each of these movement patterns can be explained by the Magnus effect.
  • Horizontal break (pfx_x): Horizontal movement of the ball in inches.

The 2022 leaders in fastball swinging strike rate were:

Note that rather than using horizontal break, I will use the absolute value of horizontal break. I did this because right-handed and left-handed pitchers will throw pitches that break in opposite directions, and I worried that this might confuse the model. Rather than looking at whose pitches move the most in a certain direction, I was interested in whose pitches move the most, in general. After splitting the 2022 data into training and testing groups with an 80/20 train-test split, I trained the XGBoost algorithm like so:

# X is a data frame of pitch movement characteristics
# Y is a data frame of pitch results (1 = success, 0 = failure)

# split data into train/test sets
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.2)

# run algorithm without hyperparameter tuning
xgbr = xgb.XGBRegressor(objective = 'reg:squarederror')
xgbr.fit(xtrain, ytrain)

Without any adjustments, I obtained an RMSE of 0.289, indicating a narrow spread of residuals (prediction errors). Next, I used a grid search to minimize the RMSE of the XGBoost algorithm by optimizing tree depth, learning rate, estimators, and the proportion of training data fed into the model for each tree. Grid search is a process for hyperparameter turning that divides the set domain of hyperparameters into a discrete grid, and then calculates the RMSE of the model at each point. Whichever combination of hyperparameters yields the lowest RMSE is selected for the final algorithm.

# tune parameters
params = { 'max_depth': [2, 3, 4, 5, 6],
'learning_rate': [0.01, 0.05, 0.1, 0.3],
'colsample_bytree': np.arange(0.5, 0.6, 0.7, 0.8, 0.9),
'n_estimators': [100, 150, 300, 1000] }

xgbr = xgb.XGBRegressor(seed = 20)

clf = GridSearchCV(estimator = xgbr,
param_grid = params,
scoring = 'neg_mean_squared_error',
verbose = 1)
clf.fit(X, y)

Each of the hyperparameters is important in its own way:

  • max_depth: the maximum depth of each tree
  • learning_rate: the learning rate of the model
  • colsample_bytree: the fraction of columns to be random samples for each tree.
  • n_estimators: the number of trees in the forest
# update algorithm with tuned hyperparameters
xgb1 = xgb.XGBRegressor(learning_rate = 0.05,
n_estimators = 150,
max_depth = 3,
colsample_bytree = 0.8,
objective = 'reg:squarederror',
seed = 20)
xgb1.fit(xtrain, ytrain)

With these new parameters, I achieved an RMSE of 0.286. The model outputs probabilities, from 0 to 1, of the likelihood that a pitch is a swing-and-miss or not. After running the algorithm on the test data set, rounding the predictions, and then comparing them to their true value, I obtained an accuracy of 90.96%. This means that model correctly predicted the “success” of the pitch 91% of the time.

Which pitch movement characteristics were most important in the model’s prediction of whether a pitch would result in a swing and miss or not? The answer can be found by running the following code to generate a feature importance graph of the model:

from xgboost import plot_importance

plot_importance(xgb1)

Based on the feature importance plot, the induced vertical break of a pitch (pfx_z) is the most important metric in predicting whether or not a pitch is successful. The F score, also known as the F1-score, is a metric that reflects precision and recall of a feature in the model. It can be calculated with the following formula:

Precision measures the accuracy of positive predictions given by a feature of the model. It is calculated by the number of true positives divided by the total number of positive predictions, which is calculated by adding the number of true positives and false positives). Recall measures the proportion of true positives that were actually measured correctly and is calculated by dividing the number of true positives by the number of true positives plus the number of false negatives. For reference, a “true positive” refers to the model correctly estimating the value of the target variable.

Applying the algorithm to 2023 spring training data, we can predict the likelihood of a swing-and-miss on any given fastball for each pitcher. Here is a list of the pitchers who are predicted to generate the most swings and misses on their fastballs:

  1. Felix Bautista
  2. Taj Bradley
  3. Ryan Helsley
  4. Eury Perez
  5. Peter Fairbanks
  6. Trevor Megill
  7. Justin Martinez
  8. Spencer Strider
  9. Beau Brieske
  10. Jhoan Duran
Felix Bautista

For a far more nuanced look at pitch success prediction, I recommend the work of Eno Sarris, which can be found here. If you are interested in other Spring Training stats, check out my Shiny App (shameless plug).

Conclusion

eXtreme Gradient Boosting (XGBoost) is a versatile gradient-boosting decision tree machine learning algorithm that can be used for both classification and regression problems. While it is far more complicated and hence harder to understand than a simple decision tree algorithm, it achieves superior accuracy. It has practical applications in a vast array of fields, one of them being baseball analytics, as shown in this article. The associated GitHub for this article can be found here.

--

--

Sam Wirth
Hoyalytics

Georgetown ’25. Double majoring in mathematics and economics.