Do MMA fighters with longer arms win more fights? (Part 3)

Thomas Manandhar-Richardson
Nerd For Tech
Published in
9 min readMar 18, 2021

Analysing the clean data

In Part 1 of the trilogy, I showed you how to scrape MMA fighter data from UFCstats.com. In Part 2, I showed you how I would clean it to make it suitable for analysis. Finally, here in Part 3 we’ll get down to analysis.

Before we get started: If you’ve been following from the start, you’ll have scraped your own MMA data. Because you’ll have scraped it on a different day to me, you’ll have slightly different data: some fighters will have won fights and others will have lost. Some new fighters will be added to the table and some might have been taken out. Your numbers will be slightly different to mine. If they’re radically different, odds are you’ve done something wrong. If you are sure you’ve done it all right and still get radically different results, I want to know! MMA changes with time, so it may well happen eventually.

Let’s get started

First, load the packages

library(‘ggplot2’)
library(‘magrittr’)
library(‘tidyr’)
library(‘dplyr’)
library(‘broom’)
library(‘modelr’)
library(‘purrr’)
library(‘GGally’)

This is a lot of packages! ggplot2 makes report-ready graphs, magrittr gives us pipes (%>%), tidyr and dplyr are for easy data manipulation, broom allows us to convert multiple regression outputs into easy to read tables, modelr allows us to easily extract residuals from regression model outputs, purrr is used in code for plotting multiple histograms and GGally is used for making the correlation plots.

Read in your data using the code below:

data = read.csv(‘UFC_data_cleaned.csv’)
data %<>% select(-X)
data %>% str()
data %>% head(20)

str() tells me we have 12 variables and 3598 fighters, and head(20) gives us a quick look at what the data looks like.

Exploratory analyses: histograms

Next we use a bit of code that I found online somewhere years ago and have been using ever since. It makes histograms for all numeric variables in your dataset. It’s very flexible, and can be copied and pasted into any project and will probably work. I used it in all my data exploration.

data %>% 
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value))+
facet_wrap(~key,scales=’free’)+
geom_histogram(fill=’navy’)

We take our data, keep the numeric variables, gather all of these variables into a single column, then call ggplot and tell it to make a separate histogram for each facet. In this case, by facet we mean variable. If anyone has a simpler way of doing this, I’d love to see!

So it seems that extremely large values are giving us strange looking histograms! There seem to be a minority of fighters that have high numbers of wins, losses and as a result, total fights. Let’s look at these:

data %>% filter(total_fights >= 100)

We filter all fighters who have 100 or more fights.

Some fighters have a seriously large number of fights!

Exploratory analyses: correlation analyses

Let’s explore how our variables might be related with a correlation plot

data %>% 
select(height_inches, weight_lb, armspan_inches, losses, wins) %>%
ggcorr(label = T, label_size = 6, label_round = 2,
low = 'red', high = 'green',
method = c('pairwise','spearman'),
name = 'r', legend.size = 12)

We select() only the variables we want, then make a correlation plot using ggcorr(). Label = T means add the correlation coefficients, setting label_size to 6 gave me a nice looking plot (a different number might be better for you) and label_round = 2 means we want the correlations to 2 decimal places as is the usual standard. Low and high specify the colours we want negative and positive correlations to take; I set red for negative correlations and green for positive, as it seems pretty intuitive to me. The method argument has 2 parts, the second of which is very important: we tell it to plot Spearman correlations not the default Pearson’s correlations. This is because Pearson’s correlations are much more affected by extreme values (like the 770lb fighter we saw in Part 2!). Spearman’s correlations can deal with extreme values. Name sets the title of the legend, and legend.size is self-explanatory. There’s lots to play around with here, so I encourage you to change things and see how they affect the plot!

Two things jump out at me:

Firstly, height, weight and armspan are all very highly interrelated. This makes sense: taller, heavier people have longer arms. These are all measures of a sort of ‘bigness’. This means that if we want to see if armspan influences the number of wins, the correlation isn’t a good measure, because armspan is confounded with height and weight. We need to build models that look at the effects of armspan while removing the influence of height and weight, which we do with multiple regression below.

Second, number of wins and number of losses are highly correlated at r = 0.63. This is because most fighters who have many fights will accumulate both wins and losses over time. This could pose a problem for our analysis: fighters might have lots of wins simply because they have lots of fights. We want to look at number of wins while taking into account number of fights.

Feature engineering: creating win_percentage as a measure of fighting ability

One way to do this is to calculate the percentage of wins. We define a new variable called win_percentage as the number of wins + half the number of draws, divided by total number of fights. This way, a draw counts as half a win.

data$win_percentage <- 
(data$wins +(0.5*data$draws)) / data$total_fights
data%>% ggplot(aes(x = win_percentage))+ geom_histogram(fill = 'navy')

It seems that percentage won is mostly normally distributed, but with bumps at 0 (i.e. fighters who’ve lost all their fights) and 1 (fighters who’ve won all their fights).

Armspan, height and weight

We’ve seen above from the correlations that armspan, height and weight are all interrelated. One thing we might wonder is whether height and weight are uniquely related to armspan. It might be that height and armspan are linked, and height and weight are linked, but armspan and weight are not linked. It’s not clear to me why being heavier would also mean longer arms. Perhaps the correlation between weight and armspan is spurious and just due to their shared relationship with height. To separate these out, we run a multiple regression testing if height and weight each predict armspan.

For linear and multiple regression in r we use the lm() function, and display the results with summary().

lm(armspan_inches~ height_inches + weight_lb, data = data) %>% summary()

We get this (I’ve highlighted important bits in yellow):

The p values indicate that both height and weight are uniquely related to armspan. This means that if we got people of all the same height, the heavier ones would probably have longer arms, and if we got a load of people who were the same weight, the taller ones would have longer arms. As the coefficients show: increasing height by 1 inch increases armspan by 0.92 inches. Increasing weight by 1lb increases armspan by 0.017 inches . The adjusted R squared tells us that together, height and weight explain 79% of the variability in armspan, which is a lot! If we know someone’s height and weight, we can tell reasonably well what their armspan would be.

This analysis also tells us that, to be sure that armspan is really related to fighting ability, we need to control for both height and weight. Otherwise we might conclude that armspan is related to fighting ability, when in reality it’s height and/or weight that are responsible for the effect.

Do fighters with longer arms win more fights?

We are ready to tackle our main question. We fit a multiple regression, where the response variable (or dependent variable, they mean basically the same thing) is win_percentage and our predictors (or independent variables) are armspan, height and weight.

Another reason that we have to control for weight in this is because fighters compete in weight classes. If Connor McGregor has a win percentage of 81%, that only tells us he’s a good fighter compared to fighters of a similar weight. It’s unknown whether he’d be that good against heavyweights. As such, it only makes sense to compare fighters to other fighters of similar weight to them. Controlling for weight makes it so that the effect of armspan represents the effect of longer arms assuming all fighters have equal weight (well, kind of, it’s complicated).

main_model = 
lm(win_percentage ~ armspan_inches + height_inches + weight_lb,
data = data)
tidy(main_model, conf.int = T) %>%
select(-statistic) %>%
rename(Predictor = term,
Coefficient = estimate,
`Standard error` = std.error
)

This time, instead of just printing our model using summary(), let’s tidy it up with… tidy()! This turns our regression table into a dataframe, so we can manipulate it easily. conf.int = T means we want 95% confidence intervals for our coefficients. We use rename() to rename the columns to make the table easier to read.

You should get something like this. We look to the p value section and see that only the effect of armspan is < 0.05. The others are not. This suggests that only armspan has a statistically significant effect on win percentage: height and weight do not have independent effects on the percentage of fights a fighter wins. You can also see this in that the confidence intervals for the effect of armspan do not contain 0, whereas the confidence intervals for the other effects do contain 0.

However, looking at the coefficient, it seems that the effect of armspan is small. Increasing armspan by 1 inch increases the percentage of fights won by 0.005 or 0.5%. Still, that might be big if fighter vary by 10 or 20 inches in their armspan. To get a feel for the real world size of the effect, let’s take all the fighters with a weight of 155lb and find the longest and shortest armspan.

data %>% 
filter(weight_lb == 155) %>%
summarise( max(armspan_inches, na.rm = T),
min(armspan_inches, na.rm = T)
)

summarise allows us to apply multiple functions to a dataset, in this case, max() and min(). Because some fighters have NA as their armspan, we have to tell max and min to remove these NAs when they do their thing, which we do with na.rm = T.

We find that the longest armspan at 155lb is 80 inches, and the shortest is 64 inches, a considerable difference. We’d expect the longer armed fighter to have a win percentage 16 x 0.005 = 0.08 or 8% higher! So while for most fighters the influence of armspan will be small, in some extreme cases it may make a large difference.

Conclusion

So it seems that fighters with longer arms really do win more fights! And it’s not just because they’re bigger. In fact, it seems that the only reason taller fighters win more fights is because of their longer arms: there is no unique effect of height on win percentage when armspan is also in the model.

Thanks for following me for this long! I hope you’ve learned a thing or 2 from this trilogy, whether that’s web scraping, data cleaning, data analysis or making graphs. If you do other analyses on this data, let me know! I’d love to see what you get up to. Thanks!

--

--

Thomas Manandhar-Richardson
Nerd For Tech

Data scientist at https://peak.ai/. Interested in AI explainability, AB testing, causal inference and recommenders