Simple Statistical Tests to Compare Categories

Published in

Geek Culture

5 min readJul 29, 2021

Testing the heights and weights of NHL players for significance across positions, and playoff vs non playoff teams.

In my previous article (here), I demonstrated some simple data exploration analysis on NHL players heights and weights. This left me thinking: Are there big differences in heights and weights across the different positions in hockey? And also, in the 2021 NHL playoffs, a lot of noise has been made about how big the two finalist teams were, especially at the defenseman position. So, I am looking to see if there was any truth that playoff teams were bigger and taller than non-playoff teams.

The code used below can be found here.

Testing for Difference in Height and Weight Across Positions

Let’s get the boring part out of the way. We need to load the data and transform the heights and weights from feet/inches to cm, and lbs to kg.

import pandas as pd

## We have to reload the data
data = pd.read_csv("nhl_bio_2020.csv")
data.head()
data = data[['first_name', 'last_name', 'height','weight', 'position_name', 'team_name']]

## Need to perform data manipulation to set weight in kg and heigh in cm
data['weight'] = round(data['weight'] * 0.453592,1)
data['height'] = data['height'].str[0:1].astype(int)*30.48 + data['height'].str[2:5].replace('"','', regex=True).astype(int)*2.54

Next, I want to understand and visualize the average heights and weights by position, to have an idea of what I’m working with.

To do this, I use the .groupby() method to group my data by position name, and use the .mean() method to calculate the mean of my metric values by position.

Intuitively, I expect goaltenders to be the tallest and thinner group, as they need to cover a lot of net while being quick and athletic. In the weight category, I expect defensemen to big bigger, since they usually play a more physical game than forwards.

by_position = data.groupby('position_name').mean()
print(by_position)

Avg height (cm) and weight (kg) of NHL players by position

Unsurprisingly, goaltenders are the tallest players, but not by quite the margin I expected. They are less than 3 cm taller on average than the average defenseman, and only 5 cm taller on on average than the average center, the smallest position according to this data.

In terms of weight, I expected goaltenders to be even slender given the small difference in heights. Still, very low deviation from one position (category) to another.

It’s time to build a chart!

by_position = by_position.sort_values('height')
by_position.plot.barh(y=['height', 'weight'])

The chart shows just how close the averages are

The above code makes use of the .sort_values() method to order the values by heights, from smaller to bigger. Horizontal bar charts are great to visualize small differences. In this case, there is not a whole lot of differences though!

Next up, we import the stats package from scipy, which will help calculating the Welch’s t-tests. The t-test is an hypothesis testing tool that helps compare the difference between the means of two samples. We first have to verify the assumption of normality, and then we can proceed to our tests. In this case, I will test 3 hypothesis for each of the height and weight variables:

Are forwards heights, weights any different than defensemen heights, weights?
Are forwards heights, weights any different than goaltenders heights, weights?
Are defensemen heights, weights any different than goaltenders heights, weights?

from scipy import stats
data_defensemen = data[data['position_name'] == 'Defenseman']
data_goalies = data[data['position_name'] == 'Goalie']
forwards = ['Right Wing', 'Center', 'Left Wing']
data_forwards = data[data['position_name'].isin(forwards)]## Test 1
ttest, pval = ttest_ind(data_defensemen['height'], data_forwards['height'])
print('p_value is ', pval)

if pval < 0.05:
    print("There is significant evidence that NHL defensemen are taller than forwards")
else:
    print("There is no statistical difference between the size of NHL defensemen and forwards")## Test 2
ttest, pval = ttest_ind(data_defensemen['height'], data_goalies['height'])
print('p_value is ', pval)

if pval < 0.05:
    print("There is significant evidence that NHL goalies are taller than defensemen")
else:
    print("There is no statistical difference between the size of NHL goalies and defensemen")## Test 3    
ttest, pval = ttest_ind(data_forwards['height'], data_goalies['height'])
print('p_value is ', pval)

if pval < 0.05:
    print("There is significant evidence that NHL goalies are taller than forwards")
else:
    print("There is no statistical difference between the size of NHL goalies and forwards")

I am first filtering the data set 3 times to obtain data for forwards, defensemen and goalies. I combined players playing at positions of Right Wing, Center and Left Wing to the position of “Forwards”. A less verbose way to perform these tests would be to code the permutations and loop through them. I had only three so simply repeated them.

The results from the Welch’s t-test are surprising:

Each test is significant! I think this is explained by the small variance between players. So, this test suggest that there is indeed, difference in heights between the positions.

I did the same tests for the weights, and:

It seems like the only statistically significant results are that NHL defensemen are heavier than forwards.

Testing for Difference in Height and Weight for Playoff and Non-Playoff Teams

Now, this was fun, but old-school fans and analysts always bring back that size is really important and help teams win. Is this true? I assume that if size was a factor in being a good team, then playoff teams would be taller and bigger right? (16 out of 31 NHL teams qualified for the playoffs).

playoff_teams = ["New York Islanders", " Montréal Canadiens", "Toronto Maple Leafs", "Winnipeg Jets", "Edmonton Oilers", "Vegas Golden Knights", "Colorado Avalanche",
                "Washington Capitals", "Boston Bruins", "Florida Panthers", "Pittsburgh Penguins", "Carolina Hurricanes", "Nashville Predators", "Tampa Bay Lightning",
                "St. Louis Blues", "Minnesota Wild"]

data_playoffs = data[data['team_name'].isin(playoff_teams)]
data_not_playoff = data[~data['team_name'].isin(playoff_teams)]

ttest, pval = ttest_ind(data_playoffs['height'], data_not_playoff['height'], equal_var = False)
print('p_value is ', pval)

if pval < 0.05:
    print("There is significant evidence that playoff teams are taller than non playoff teams")
else:
    print("There is no statistical difference between the size of playoff teams and non playoff teams")
ttest, pval = ttest_ind(data_playoffs['weight'], data_not_playoff['weight'], equal_var = False)
print('p_value is ', pval)

if pval < 0.05:
    print("There is significant evidence that playoff teams are heavier than non playoff teams")
else:
    print("There is no statistical difference between the size of playoff teams and non playoff teams")

First, I entered a list of last year’s NHL playoff teams, and built two data sets: data_playoffs and data_not_playoffs. Their names are pretty self-explanatory.

Then, I performed the Welch’s t-test on those samples. The results? Non-significant.

This suggest that teams are not taller nor heavier whether they participate in the NHL playoffs or not.

Simple Statistical Tests to Compare Categories

Written by Alexistats