The Pick Is In! NFL Draft Classification with Combine Metrics

13 min readJul 15, 2022

By Eric Au

The NFL Combine has received scrutiny about whether there is value in participating and using the combine to make draft decisions.

The Gameplan — Business Understanding

Every year, the National Football League (NFL) holds a week long showcase where college football players, otherwise known as prospects, perform physical drills and tests in front of team coaches, scouts, and general managers. In addition to this physical showcase, teams have the opportunity to interview players and get a better sense of the person behind the player. This event is more formally known as the ‘NFL Combine’.

How do we translate combine drill performance into making business decisions? Is there any value?

The physical drills conducted at the combine are intended to measure a player’s physical ability such as speed, quickness, strength, and overall athleticism. However, if you’re a fan of football you may be aware that there has been conjecture regarding the combine and whether it even holds any value when making draft decisions.

But what can NFL teams learn from these workouts?
What exactly do non-football athletic testing measurements contribute to prospect evaluation? Is there any value to the combine?

These are questions that many fans ask to this day and NFL teams try to interpret to make the best decision possible when drafting their players.

Keeping this in mind, I wanted to take a deep dive into the combine and develop a supervised machine learning model that is able to classify whether someone was drafted or not based solely on combine data.

If you are a fan of football, some of the topics discussed may be familiar to you. This article is intended to accommodate any football fan or interested data science individual as I will breakdown both the sport and analytical sides.

For a more extensive detailed breakdown of my project, feel free to check out my Github repo!

The Measurables

In this analysis, I scraped player combine data using Beautiful Soup from Pro-Football Reference over the last 22 years (2000–2022). Here is an article by Michael ODonnell as reference for similar code to assist in your own scraping efforts. I know it helped me out tremendously!

After formatting and preparing the dataset for actual cleaning and investigation, we have the following features/columns to work with:

Player Name
Position
School (College Attended)
Height & Weight
40 Yard (Seconds)
Vertical (Inches)
Bench (Repetitions)
Broad Jump (Inches)
Shuttle (Seconds)
Drafted (Team/Round/Year)

Probably the most well known combine event is the 40 Yard drill.

As mentioned, this modeling analysis only takes into consideration NFL combine data and nothing else. We want to examine the value of the combine itself.

So a player’s college statistical performance were not incorporated. This would involve gathering all the data for every player in the combine and merging this information with the combine data. While this seems ideal, there is a lack of data when it comes to measuring each individual unique position in football.

For example, keeping track of an offensive guard’s pass protection stats is not as interpretable and easy as tracking a wide receiver’s amount of catches and yard totals. Additionally, further analysis would involve splitting up the positional categories as not all statistics across positions are equal.

The Field — Visualizing the Data

Draft Status

To prepare the dataset for further analysis, I assigned a target classification for each player in the dataset. Each player was either classified as 1 for Drafted and 0 for Undrafted. These ‘classes’ were relatively balanced as approximately 2/3 of all players in the dataset were drafted.

Percent of drafted players grouped by Offense, Defense, and Special Teams categories. 2/3 of all players in the dataset were drafted between 2000–2022.

Breakdown of Positions

When examining the positions in the draft, it is clear that the skill positions such as wide receiver and running back dominate the dataset. Certain niche positions such as kickers and long snappers are far less prevalent.

Count of players by position in the NFL Combine. Wide receivers far surpass any other position while the long snapper the least frequent.

Top Colleges

Out of curiosity, I wanted to understand the distribution of drafted players grouped by college. We always expect the big football schools such as Alabama and Ohio State to produce the most talent, but what did this actually look like?

I instantly decided to implement Tableau Public’s software to help me visualize this aspect of the data. The ease of dragging and dropping features into a bin couldn’t have been easier to help accomplish this.

Alabama is no stranger to producing top draft ready talent each year.

Combine Performance

When examining the median performance in drills sorted by position, there are some clear conclusions when it comes to distinguishing between drafted and undrafted players. Drafted players tend to perform better at the combine drills than undrafted players.

The below chart illustrates the performance of drafted and undrafted players for the Shuttle, 40 Yard, 3 Cone, and Bench drills where green represents drafted and red represents undrafted.

Drafted players tend to perform better when taking the median performance for combine drills.

Data Cleaning

In total there were over 7600+ combine records for players since 2000. A lot of this data was missing; notably for the various combine drills. This was expected, as I knew many college players choose to skip out on the combine drills for many different reasons. Generally, the “cant-miss” prospect is deemed by scouts to succeed at the professional level based solely on college performance and physical stature alone. So what purpose is there to participate in a combine if these drills can only hurt you if you don’t perform well?

Would you want to take part in a public workout display if you knew it could potentially hurt your draft status if you didn’t do well?

Whatever the reason was for not participating in the combine, I knew I couldn’t afford to just drop all these rows of missing data. One way to handle this problem was to impute the missing values, or assign a value based on the average measurement of each combine drill. The average measurement would be grouped by position since wide receivers and cornerbacks tend to be faster in the 40 yard than defensive tackles and offensive linemen.

While imputing the average values for each drill grouped by position, there were still values missing for kickers. It turns out kickers dont participate in the 3 Cone and Shuttle drills (which just shows the importance of speed and agility when it comes to such a niche position like a kicker). To avoid any form of data leakage, these 76 total missing values (NaNs) for kickers were ultimately filled with 0s.

The Drive — Data Modeling

Now that the dataset was cleaned and imputed, a train-test split whereby 25% of the entire dataset was held out for testing. Essentially, 75% of the total dataset was used to ‘train’ my model and learn from the 75%. Once this model has been trained, we would then input the remaining 25% testing set to determine how well this model actually performs on unseen data.

Following this splitting procedure, I then prepared a preprocessing and modeling pipeline. The preprocessing simply transforms the dataset so that the various models implemented can perform at its best. For this analysis, preprocessing consisted of standard scaling, one-hot encoding, and min-max scaling.

Standard Scaling - Features that are measured at different scales do not contribute equally to the fit of the model and the learning function of the model and could end up creating a bias. To combat this, each feature is ‘scaled’ to the mean to reduce the variability, and creates a more level distribution for modeling.
One-Hot Encoding (OHE) - Because we have categorical features, or features that are not numeric (ie. college, position), we have to translate these features into numerical values for the model to analyze. Effectively, this process creates new columns with each unique college and position and assigns a value of 1 (applicable for that player) or 0 (not applicable) for each row in the data.
Min-Max Scaling - Scales each feature by its maximum absolute value. Each feature is scaled such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

Passing the Data Through Pipelines

To streamline the above preprocessing, I developed a pipeline which takes in the data and performs all the above transformations in one shot. Once the data is preprocessed in the pipeline, the data then goes through another pipeline which ‘fits’ or ‘trains’ the data to a new model to determine which model performs the best when classifying between drafted and undrafted players.

Each model has a different method (often very mathematical) of classifying data and has hyper-parameters which determine how that model performs. Think of a single hyper-parameters as a knob on a radio; in order to get the strongest signal or result, we need to find that sweet spot. Now imagine for each model, we have several of these knobs we need to tune to find the optimum result.

Tuning a radio knob is a lot like tuning a model’s hyper-parameters to find the optimum result.

This process can take a long time to figure out depending on the number of parameters the model can use and outcomes the model can produce. It can take even longer with a larger dataset. I effectively implemented a gridsearch (which I’ll talk about later) to find the best hyper-parameters.

But before we figure out how to tune the model, how do we measure performance? Typically, we tend to use the term accuracy. In the data science world, accuracy is defined as the amount of times the model was correct at classifying in the overall data set. In other words, if a model simply classified that everyone in a dataset was drafted, it would perform with a 63% accuracy! Thus, we need a different way to measure performance other than accuracy.

The goal is to determine how accurate the model can classify not just those who were drafted, but also undrafted.

We’ll need to strike a balance between these two classes. The main metric for this analysis is F1-Score which takes into consideration a balance between what the model classifies as False Positives and False Negatives. Technically speaking, this is a balance between precision and recall.

False Positives are players who were labeled as ‘Drafted’ but they were actually ‘Undrafted’
False Negatives are players who were labeled as ‘Undrafted’ but they were actually ‘Drafted.

We want to have a “harmonized” balance False Positives and False Negatives and this is where F1-Score matters.

The Modeling

As mentioned, each model performs differently when it comes to the mathematical concepts applied to classification. Without going through each individual model, what we should understand is that each model utilizes a supervised learning algorithm.

Supervised learning algorithms try to model relationships and dependencies between the target prediction output and the input features such that we can predict the output values for new data based on those relationships which it learned from the previous data sets.

For each model, the training dataset gets passed into the pipeline, and a gridsearch is performed to find the optimum F1-Score. Remember those hyper-parameters? Plugging in (or tuning) different hyper-parameters for each model can understandably take some time especially when you don’t know what the combination of hyper-parameters produce the best results.

A gridsearch assists with this process as follows:

A range of hyper-parameters are defined in the gridsearch which then gets passed through each pipeline model.
More hyper-parameters means more combinations. Therefore, we should be mindful when selecting a range of parameters since this can take a long time especially with a larger dataset.

Finally, when the gridsearch finds the optimum hyper-parameters and passed through each model’s pipeline, the following are produced:

A confusion matrix illustrating how the model performed when classifying whether someone was drafted or undrafted
A classification report summarizing accuracy measurements including F1-Score, accuracy, precision, and recall.
ROC/AUC Curve that measures how well the model is capable of distinguishing between drafted and undrafted players.

The Pick Is In — Evaluation of the Model

After a series of training multiple models and fine tuning each model, the following table summarizes the results.

The best model was the 8th Model — XGBoost having yielded the highest area under the curve and is slightly higher than the Random Forest and Bagging Decision Tree Classifiers.

The 7th Model — Gradient Boost performed the best overall in terms of F1 Score, but only slightly better than the XGBoost model.

Based on the above, we will say that the XGBoost performed the best overall when it comes to classifying between players who are drafted and undrafted. It is no surprise XGBoost performed well, as this supervised learning algorithm has gained usage popularity within the last 6 years for its speed and performance.

XGBoost stands for Extreme Gradient Boosting, and is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems. The model is built upon the idea of “gradient boosting” and “boosting” or improving a single weak model by combining it with a number of other weak models in order to generate a collectively strong model.

To help visualize how the XGBoost model performed on the test dataset, below is a confusion matrix with 4 quadrants classifying actual and predicted classes of undrafted and drafted players.

Confusion matrix of players classified by the XGBoost model in the testing set.

While the XGBoost Model performed with an overall F1-Score of 80%, the model still struggles when it comes to classifying a player’s draft status.

There is still a significant amount of False-Positives (or players classified as drafted when they were not) amounting to a precision score of 72% correctly classified drafted players. Consequently, there is a precision score of 66% when classifying undrafted players.

Feature Importance

Feature importance provides a score that indicates how useful or valuable each feature was in the construction of the model. The more an attribute is used to make key decisions within the model, the higher its relative importance.

The XGBoost model determined that the 40 Yard and Bench were the most important metrics when it came to classification of drafted and undrafted players.

We make the following observations on the above:

The model suggests that the most important combine measurables that classify draft status are the 40 Yard and Bench. The Shuttle drill follows thereafter, though not as important of a feature.
The schools with the most importance for draft classification are Notre Dame, Ohio State, Purdue, and Boston College. Alabama is surprisingly behind these other schools even though they have been powerhouses recently when it comes to producing NFL talent.
Positions with the most importance are surprisingly the OLB (Outside Linebacker), RB (Running Back), OT (Offensive Tackle), and followed by the QB (Quarterback).

It should also be noted that the model does not suggest that these are the most important features for determining success on the field. The model has simply determined that these features are most important when classifying draft status.

Conclusions

Recall that the purpose of this analysis is to place a value on the NFL Combine and the various drills that are performed.

While the best model performed with an overall F1-Score of 80%, there is still evidence to suggest that draft combine metrics alone are not a strong indicator of distinguishing whether a player will be drafted or undrafted.

It is clear that classification of draft status is not an exact science. Though, it is also important to note that there are potentially additional data that could be implemented into the model. As mentioned earlier, college statistics were not factored into the analysis. Further analysis would have to be split up into positional categories as not all statistics across positions are the same.

At the pro level, there are efforts to obtain more data to measure player’s performance. NFL Next Gen Stats has been collecting data to assist in these efforts, though it has not been applied at the collegiate level yet.

Model Value & Limitations

Not withstanding, there is at least some value to the combine based on our model since we only assessed combine measurements alone.

The best indicators from the model suggest that the 40 Yard, Bench Press, and Shuttle, are the most important combine drills when it comes to classification of draft status. It also goes without saying that we expected player height and weights factor in heavily when it comes to a physical sport like football. We also observed that on average, players who are drafted, tend to perform better at each combine drill than those who were not drafted.

We can confidently conclude that taking in consideration of combine metrics alone does not provide a sure-fire determination of whether someone should be drafted or not. There are other metrics not accounted for in the combine alone when it comes to explaining draft status.

The combine should be used as a guide when it comes to predicting whether someone should be drafted or not with an 80% F1-Score accuracy. Possible inaccuracies in the model could be explained in the value of traditional scouting and a player’s college career. However, additional data and collection methods are needed to account for these inaccuracies.

Finally, it is also important to note the limitations of the model and that the model does not predict whether a player will be successful at the professional level. The model does not take into account any other ‘intangible’ measures such as the player’s overall character, demeanor, or work ethic, all of which have value and factor into draft status.