Do Synthetic Fields Cause Higher Injury Rates?

Who knew fake grass could be so controversial

Published in

Sports Analytics

12 min readJan 31, 2020

Turf — an often-debated topic. There’s a lot going for it — durability, water conservation, and cost-effectiveness to name a few things. But it also wears a badge of notoriety when it comes to twisted ankles and busted knees.

Many soccer players have refused to play on turf, including several MLS players and the U.S. Women’s Soccer Team. The NFL is evidently interested in the topic as well. Several recent studies (Mack et al., 2018; Loughran et al., 2019) and a systematic review in 2015 on football players found a higher incidence rate of injuries on synthetic turf.

There’s some evidence that the injuries may be related to synthetic surfaces not releasing cleats the same way as natural surfaces, but not a lot of research has been done to investigate whether players move or perform differently across different fields, and whether movement, and other factors, contribute to injury (very likely due to lack of publicly available data).

The NFL is a bit late to the game in terms of analyzing player tracking data considering the advances in soccer and basketball. Although the data-gathering began in 2014, it wasn’t available league-wide till 2018. But to their credit, the league has since dedicated generous efforts to source talent from the analytics community to help in its pursuit of deriving insights from the data. There has already been four analytics competitions come and gone since the decision in 2018 — the Big Data Bowl (2019, 2020) and the 1st and Future competition (2019, 2020), now all hosted on the Kaggle data competition platform.

I know, it’s confusing. One league, one dataset, two different data science competitions

Last year, Lucas (business partner at Torneo and data magician/coder extraordinaire) and I participated in the inaugural Big Data Bowl. We were fortunate to have been selected as one of the three finalists in the Open Entry category to be flown into Indianapolis to present our findings on speed efficiency. We decided to participate again this year, but this time in the competition hosted by 1st and Future.

Spoiler: We didn’t win, but got an unofficial honorable mention. Working with the tracking data again was a lot of fun though.

Analytical Imperative

The Kaggle community was tasked with the challenge of figuring out what factors contribute to a higher risk of non-contact lower-limb injury, including player movement metrics and environmental factors such as the turf type.

Does synthetic turf change the way that players move, and therefore lead to an elevated rate of player injuries?

Hypothesis

We did some background research and decided to tackle this problem from a physics/physiological perspective. The knee and ankle are both hinge joints. Hinge joints tend to be characterized as stable joints that can support a lot of load in one plane of movement, but are not as mobile as ball and socket joints like the shoulder and hips. For that reason, lateral forces produced by changes in direction may contribute to risk of injury.

We speculated that plays with sharps turns, especially high-speed sharp turns, may have a positive relationship with non-contact injuries.

It turns out that we weren’t the only ones who thought about investigating lateral forces during change of direction. John Miller (one of the three winners in the competition) separated velocity into the forward component and the lateral component to calculate change in direction. Elijah (one of the other three winners) attempted to quantify how “zig-zaggy” the player’s path was, and looked at extreme changes in player orientation (where the player was looking at) combined with speed.

We also wanted to see the interaction effects of environmental and contextual factors such as precipitation, temperature, days of rest, and type of play.

What We Found

Based on the dataset provided, playing on synthetic fields do seem to increase the likelihood of injury. From the perspective of pure incidence rate, risk of injury is 1.7x higher on synthetic fields than on natural fields (we excluded plays with QBs because there were no injuries, and plays in rainy conditions because rain seems to increase incidence rate of injuries)

Testing the the synthetic variable along with other variables in a logistic regression also showed that synthetic fields increases the likelihood of injury
That said, other factors seemed to be more powerful predictors injury: maximum velocity during a play, maximum angle change during a turn, maximum centripetal acceleration (a factor of angle of the turn and the speed during the turn), and temperature during the match
Movement factors which increase risk of injury do not seem to be different on natural vs. synthetic fields

High-level Methodology

Summarize characteristics of each play based on factors which may have an impact on risk of injury
Use modelling to understand interaction factors that contribute to player injury, and therefore predict the likelihood of injury for any given play

Deriving Variables

Numerous variables of interest were generated and explored. The first category encompasses player movement information, which we summarized on a player-play level. The second category is environmental and contextual information, which was available in the provided data set on a player and game level.

Characterizing Player Movement

The NFL provided player, play, and game-level information, including player tracking data for every 10th of a second. Although we were given the speed, distance, direction and orientation variables, we decided to mostly work with the coordinate data as it seemed to be the most reliable information.

1. Deriving Basic movement data

We calculated basic movement data using the player’s x and y coordinates:

Distance traveled using the euclidean distance between all points
Speed using the difference in time from the current to the last frame and the calculated euclidean distance, smoothed with a rolling window of 5 centered on the current frame. We also applied a cap to the speed because there seemed to be some data issues where the speed was faster than what’s humanly possible
Acceleration using the speed and the difference in time from the current to the last frame
Jerk using the acceleration and the difference in time from the current to the last frame

2. Identifying drastic changes in direction

The data set provided us with the direction that the player is going and the orientation that they’re facing; however, we wanted to filter out the noise from small movements and focus on the parts of a route that involved drastic changes in direction. To achieve this, we simplified player paths using a modified Ramer-Douglas-Peuker algorithm, which basically goes through every point in a path and adjusts their location so to eliminate noises in the path beyond a certain threshold.

Demonstrating the Ramer-Douglas-Peuker Algorithm

After we simplified player routes, we calculated angles for every frame in the route (where the route is mostly straight, the angle would be 0 because of the path simplification). Drastic changes in direction were then identified by marking angles of greater than~42 degrees.

We also calculate the approach speed and acceleration for every turn by averaging the respective metrics for the 0.5 seconds before the turns.

3. Identifying lateral acceleration

Another set of variables that we created was aimed at identifying the lateral acceleration during a turn. Because velocity has both a direction and an intensity, an object can be accelerating even if speed is constant. Both skaters and cyclists are familiar with this type of acceleration.

The equation for centripetal acceleration perfectly embodies the intensity of a turn. The faster the speed during a turn, and the smaller the radius of the turn (i.e. sharper turn), the higher the centripetal acceleration.

To estimate the radius of the turn, we used three points in the path to approximate the osculating circle of the curve, whenever the angle we calculated in the previous step is greater than 0.

The final step in looking at player movement was to summarize these metrics on a play level:

Basic movement variables: mean, min, max of speed, acceleration, and jerk
Change in direction variables: mean and max of change in direction, change in orientation (calculated with data provided), and calculated angle of turn; sum of sharp turns, sum of total turning points
Directional change intensity variables: mean and max of centripetal acceleration, sum of “dangerous” centripetal accelerations (above 10 yards/s/s), mean and max of approach velocity and acceleration for what we identified as “sharp turns”

Environmental and Contextual Factors

We also cleaned the play-level data provided to flag matches with rain, calculate rest between matches, flag very cold (<50 degrees Fahrenheit) and very hot (>80 degrees Fahrenheit) temperatures, and flag matches played on synthetic fields.

Analysis

Modelling

Although there are a lot of sophisticated modelling approaches out there (trees, neurons, gradient-boosted), we decided to use a logistic model for its interpretability. To us, it was more important to be able to make actionable recommendations by understanding the directional impact of variables, rather than getting (potentially) higher predictability with a black-box model that’s difficult to explain.

Three main steps were used to prepare the data for modelling:

Create summary statistics for each play using player tracking data, and merge with contextual factors (days rest, rain, temperature, etc.)
Standardize variables — a common approach to reduce multicollinearity in regression models
Make the data set more balanced by using the Synthetic Minority Over-Sampling Technique (SMOTE). This had to be done because there were so few incidences of injury (104 out of 267,005 plays). Without balancing, even if the model predicts all plays as “no injury”, the accuracy will still be at 99.96% — which is a false indication of how good it is at predicting outcomes

For training the model, we split the data into train (70%) and test/holdout (30%) sets. We opted to use k-fold validation on the train data set to tune the model, then ran the model on the test/holdout set to evaluate generalizability.

We used a two-stage elimination process to determine which variables to use for the logistic model.

First, we ran three statistical tests (Welch’s T-test, Wilcoxon Rank-sum, Mann-Whitney U Test) on the player movement variables to determine which ones to keep. Since p-values are not very reliable on large data sets, we decided to use a slightly different approach for statistical testing. For each variable, we tested 10,000 random samples of non-injury plays against injury plays, and stored the resulting p-test scores in a vector. We then computed the % of tests that had a p-value of under 0.05. The variables that scored more than 70% on the Mann-Whitney U test simulation were eliminated
We then calculated and plotted the correlations between the variables that we had left, and picked out variables to reduce correlation

We eliminated 23 of the 39 independent variables by trimming out correlated variables

Model evaluation was done with a confusion matrix and the Receiver Operating Characteristic (ROC), both of which take into account the sensitivity (false positives) and specificity (false negatives) of the model. On our 10-fold cross-validation, the model scored 0.7530 on the ROC, which is fairly acceptable performance.

The model did less well on the test/holdout set, scoring 0.65 on the ROC, and predicting only 52%(16) of the 31 injuries correctly. This means that the model is not too generalizable — but further tuning and testing different types of models might improve the performance.

Confusion matrix and ROC AUC score for the holdout/test dataset

Regardless of model performance, we can still interpret the results by running a logit model on the variables.

Coefficients and confidence intervals for the variables used in our logistic model — orange bars indicate a positive relationship with injury, while navy bars indicate a negative relationship

Although synthetic turf (the opposite of natural_field) does seem to have a positive correlation with injuries, it seems that there are other variables that are more important:

v_sm_max: maximum velocity during a play — the faster the top-speed during a play, the more likely the player will be injured
angle_0.25_max: maximum turning angle in a play (simplified routes) — the higher the maximum change in direction during a play, the more likely the player will be injured
cent_acc_0.25_mean: average centripetal acceleration for turning angles in simplified routes — drastic changes in direction combined with high speeds increase likelihood of injury
a_min: maximum de-acceleration/ minimum acceleration during a play — lower maximum de-acceleration tends to increase likelihood of injury
temp_cold: players are less likely to be injured in colder temperatures (<50 degrees Fahrenheit)

Movement on Synthetic vs. Natural Fields

So, are synthetic fields more dangerous because players somehow move differently on the field? We compared the distribution of the most significant player movement variables in our logistic model, and found no discernible or statistically significant difference between natural and synthetic fields.

Comparison of player movement variables between natural and synthetic fields

This doesn’t necessarily mean that players don’t move differently on synthetic fields —it just means that we can’t prove that players do move differently with our set of data, and specifically the variables that we looked at. Specifically, micro-movements including the cleat-turf interaction on different types of fields can’t really be captured with player coordinate data.

Recommendations and Future analysis

Since synthetic turf does seem to contribute to injuries, it would not be wise to install synthetic turf on new arenas. For existing synthetic turfs, one can exercise precautionary measures such as ensuring that the turf is well-maintained with a proper amount of infill, and players should ensure that they’re wearing the correct type of cleats
The league should ensure that players receive extensive training for, and are executing, the correct techniques for change of direction movements
The data set that we analyzed had a small sample of injuries to work with — only 105 out of 250k plays led to injuries. As such, it might make sense to conduct repeated research over a longer period of time to confirm findings before replacing synthetic turf with natural fields in existing arenas
Not all synthetic surfaces are the same — future research should include different types of synthetic surface for determining injury risk
Information such as previous injuries for players may be helpful in further distilling factors that contribute to injuries
Other, modelling methods can be tested for improving predictive power

Important Assumptions

We were provided with a data set of 105 injuries, of which 29 could not be identified with a specific play. In this case, we assumed that the play that caused the injury was the player’s last play. Different assumptions may yield differing results, since 29/105 is a large proportion of plays
For calculating change in direction and change in orientation, we assumed that if a player changes direction or orientation from 355 degrees to 5 degrees, the change is likely not -350 degrees, but 10 degrees. With this methodology, we can avoid potential exaggerated changes in direction due to the way that data is recorded
We filtered out any tracking data before the play started and after the play ended

*Note: the analysis discussed here is slightly different than the submission we made to the Kaggle competition, as we made some modifications to the methodology after the competition ended in the process of re-running the analysis