Let's Model! Using Random Forest to Predict UCL Reconstruction

Published in

Analytics Vidhya

9 min readMay 13, 2020

Overview

For my next analysis, I will attempt to identify underlying factors that are strongly correlated with ulnar collateral nerve reconstruction (aka UCL or tommy john), followed by assembling a model that uses these factors to predict individuals who are candidates for the surgery.

The goal of this analysis is to provide additional research into the causes behind the career-altering injury, allowing the baseball community to take preventitive measures to help players reduce their chance of significantly injuring their ulnar collateral nerve.

In order to achieve this goal, I will utilize the state-of-the-art modeling technique known as the Random Forest Algorithm. While I will provide an in-depth explanation of how the model works later in the article, random forest is one of the few modeling frameworks that creates astonishingly accurate predictions with a simple, understandable computation.

Like most data science projects, I spent most of my time gathering, processing, and creating features for the data in order to improve predictive power. I will begin this analysis by explaining the raw data gathered from a variety of sources.

The Data

In order to obtain the data for this analysis, I used two sources: 1) baseball savant’s database and 2) a list of players who went under UCL reconstruction, which can be found here.

The linked UCL document contains a ton of information about the individuals who had the surgery. The creator of the document has a Twitter handle, @MLBPlayerAnalys, and I recommend giving a follow if you are interested in baseball research!

The baseball savant dataset contains a variety of different metrics ranging from the standard innings pitched to more advanced metrics such as horizontal break on breaking pitches. If you are interested in all the variables used in this analysis, a link to the csv can be found here. My goal was to cast a wide net when downloading savant’s data, hoping to identify unexpected underlying factors that could lead to UCL reconstruction.

Another key note is the savant data contains all pitchers from 2015 and on who threw to at least 100 batters. Obviously, this list of players will contain individuals who have and have not had UCL reconstruction, emphasizing the importance of utilizing the tj dataset.

A sample of the raw, uncleaned data sets can be found below.

Baseball Savant Data — 11/34 Variables Shown

Processing

In order to process the data, I began by removing all unnecessary variables from the tommy john frame (I will refer to this as tj), leaving the following four variables: Player, Position, Level, and Year.

From here, I needed to conduct multiple subsetting techniques in order to meet the following parameters:

Players from 2015-on. Since the Statcast data only goes back to 2015, these years needed to match. As such, I needed to remove all instances before 2015 in the tj data.
Pitchers only. Since the tj data contains information on all players who have had the surgery, the data needed to be filtered to pitchers only.
MLB level. The tj data contains both MLB and MILB data, so it needed to be filtered to only MLB.

Finally, I needed to add a feature stating if the player had tommy john surgery. Since each player in the tj data had the surgery, adding this feature was relatively straightforward. I named this variable “tj_history”, giving a value of 1 for all players who had received the surgery (so the whole data set received a 1). The fully cleaned and updated tj dataframe can be found below.

Now that the the tj data was cleaned, it was ready to be merged with the baseball savant data, a process very similar to a join one would conduct with an SQL query. Once the data was joined, I completed the following steps:

Finished the “tj_history” feature by giving all players a 0 who have not had UCL reconstruction (So if the player name was not listed in the tj set but was listed in the savant set, they did not have tj , giving them a 0)
Replaced each NA with the mean of their respective column
Altered variable names to aid interpretation

These steps concluded the data processing portion of my analysis. The fully cleaned data frame is below.

Merged Data Frame — 10/35 Variables Shown

Exploration

In order to better understand the newly-formed data frame, I put together a few simple bar charts with ggplot2. The purpose of these charts is to give us an idea of some key features in the data, allowing us to make inferences before we model. Remember, a 1 indicates individuals who have had tommy-john since 2015, while a 0 indicates they have not had the surgery.

So, as you can see, the dataset contains almost 2200 players who have not received the surgery, and just under 400 players who have received the surgery.

The chart above shows there has been little deviation in surgeries since 2015. 2018 was the worst year for UCL injuries, albeit by a small margin.

This chart shows the variation between average fastball speed and UCL surguries. As shown above, the highest number of UCL surgeries occurred when the indivudal threw in the low 90’s.

It is interesting to note that 2 out of the 5 individuals who averaged a fastball over 100MPH tore their UCL. While this is a small sample size, it would be interesting to see the results with more data.

Random Forest Explained

Now that we have explored the data, we can move forward with modeling. Just so you understand what type of model we are using, I will give a brief overview of random forest.

The random forest algorithm is a type of machine learning technique that can be used for both regression and classification problems. In our case, we are utilizing the algorithm for classification.

Classification is simply determining which group an observation belongs to, or in this analysis, will a pitcher have UCL reconstruction? In other words, we are answering a “yes” or “no” question, compared to a quantitative prediction or projection question that would require regression.

Random forest has many similarities with a traditional decision tree. Decision trees are quite simple to understand and one can get a pretty good grasp on the iterations of a decision tree by the image below.

The key difference between random forests and decision trees is that a random forest uses many different decision trees. So, think of the random forest algorithm as a bunch of decision trees working together to come to a final solution.

Without getting too technical, random forests are incredibly effective due to the randomness and uncorrelated trees. This reduces error and produces an accurate conclusion.

If you want to learn more about the random forest algorithm, there is a great article written here.

Variable Importance

Using Random Forest in R is incredibly efficient and straightforward. While there are multiple options one can use to run a random forest, I prefer the ‘randomForest’ package.

In order to gauge model accuracy, I split the data into a train and test set. While random forest already takes a random sample through the tree splits, I always make sure to split the data as good practice.

Now that the data was split, I needed to determine the optimal number of mtrys, or the number of variables available to split at each node of the tree. In order to achieve this, I used cross-validation, something I will not get into in this article. The results from running cross validation are below.

As you can see, cross validation determined the optimal mtry = 5, or five variables to split per node

As mentioned in the overview portion of my analysis, a key feature of random forest is its ability to identify its own variable importance. The algorithm computes this as it runs through each variable and can be found with the ‘varImpPlot’ function in the ‘randomForest’ package. The model’s variable importance identification is below.

So, as shown above, the variable that has the highest correlation with UCL reconstruction is breaking_avg_spin. In other words, pitchers who have a high spin rate on their breaking pitches have the highest risk of getting UCL reconstruction.

Another interesting find is the high correlation zone_swing_percent has with the surgery. One may think that pitchers who struggle with command would have a larger risk of injuring their UCL due to often violent arm action. This model, however, states the opposite, as shown through the importance of the zone_swing_percent variable.

Model Accuracy

After running the model on the train set, the next step is to determine overall accuracy. While there are multiple accuracy measures one can use for Random Forest, I will utilize AUC, or area under the curve.

AUC essentially determines how successful a model is at determining 1’s and 0’s, which is a perfect fit considering this model’s primary goal is to predict 1’s and 0’s (TJ or Not).

In order to compute this accuracy metric, I needed to use the test set from the orginal data split. Next, I utilized R’s ‘predict’ function, calculated ROCR, and computed AUC. If you would like to know more about this process, I would be happy to go into more depth!

The model’s AUC is as follows:

So, you’re probably wondering, what is a good AUC? The AUC intervals can be found below.

As you can see, this model tested “good” at predicting TJ surgeries, which is pretty cool considering the limited data and overall observations.

Conclusions

After completing this analysis, I wanted to highlight a few key features that could improve overall model accuracy.

More data. While I had enough observations to create a successful model, adding more data will aid overall accuracy. The Statcast data only goes back to 2015 which ultimately limited overall observations. When baseball begins (hopefully soon!!) I will be able to add to the data, improving accuracy.
Pitcher history. This is a challenge due to the limited data available to the public. However, having each individuals pitching history, such as innings pitched in HS and/or at the collegiate level, would significantly improve accuracy.
Physical features. Variables such as a players height, weight, and arm slot would be interesting to add to the model. While I am unsure about the degree in which each variable impacts the risk of UCL reconstruction, it would be interesting to see the variable importance computed by random forest.

… and that concludes this analysis! I hope you learned something and have some takeaways of your own. If you would like to see my code, I posted it on github and it can be found here.

Thoughts?

-jms