Pythagorean Expectation in Football

Beendict Udemeh
7 min readJan 22, 2024

--

Exploring the basics of coding football data in R

Introduction

Welcome to football analytics! Ever heard of the Pythagorean expectation? It’s a fascinating part of football analytics! It’s an insightful predictor of a team’s win percentage, all based on the goals they’ve scored and conceded. So, it’s not just about the game, it’s about the numbers behind the victory too! It shows how a team’s success in a season relates to their goals.

Here’s the formula for Pythagorean expectation in football:

WPC = GS²/ (GS² + GC²)

where: WPC- Win Percentage, GS- Goals Scored, GC- Goals Conceded.

The Pythagorean expectation formula predicts a football team’s wins from their scored and conceded goals. It’s a cornerstone of sports analytics. While event prediction is valuable, our current focus is on coding data to establish Pythagorean expectations and gain insights.

No worries if you’re new to football! This guide is made easy to follow. We’ll focus on key metrics: GF (Goals For), GA (Goals Against), G (Games Played), and W (Wins). Let’s dive in.

Tool

R Programming

Data Importation and Cleaning

Using open-source data for this project, I’ve loaded the necessary packages and imported the dataset into R Studio. Let’s start analyzing.

library(tidyverse)
library(dplyr)
library(ggplot2)
library(lubridate)

#Import data
epl18 <- read.csv("EPL17-18.csv")
str(epl18)

We’re working with a dataset on English football, spanning 4 divisions for the 2017/2018 season. It’s got 2036 observations, each with 7 variables: Division, Game Date, Home Team, Away Team, Full Time Home Goals, Full Time Away Goals, and Full Time Result. Let’s kick off the analysis.

Sure, here’s what each variable in the dataset signifies:

  • Div: The division in which the team competes.
  • Date: The date when a game was played.
  • HomeTeam: The team playing at their home ground.
  • AwayTeam: The team playing away from their home ground.
  • FTHG (Full Time Home Goals): The number of goals scored by the home team by the end of the game.
  • FTAG (Full Time Away Goals): The number of goals scored by the away team by the end of the game.
  • FTR (Full Time Result): The result of the game at full time (Home Win, Away Win, or Draw).

These details provide a comprehensive view of each game, allowing for in-depth analysis. “At first look, the data seems well-structured. Just a heads up, we’ll convert the ‘Date’ column from character to date format for later calculations. Apart from that, our data is clean and good to go.

head(epl18)
tail(epl18)

Quick note on our ‘Date’ column — it has both 2-digit and 4-digit years and is in Character format. I’ve set up a function to change 2-digit years to 4-digit, and then convert the whole column to date format.

#Function to convert 2-digit year to 4-digit year
convert_year <- function(Date) {
if (grepl("\\b\\d{2}\\b$", Date)) { # If the date ends with a 2-digit year
return(sub("(\\b\\d{2}\\b$)", paste0("20", "\\1"), Date)) # Add "20" to the start of the year
} else {
return(Date) # If the year is already 4-digit, return the date as is
}
}

# Apply the function to the date column
epl18$Date <- sapply(epl18$Date, convert_year)

# Convert the date column to a Date type
epl18$Date <- dmy(epl18$Date)

# Checking new date format
head(epl18)

Coding Metrics to analyze data.

#creating value for wins,draws and losses
epl18[,'hwinvalue']=ifelse(epl18$FTR=='H',1,ifelse(epl18$FTR=='D',0.5,0))
epl18[,'awinvalue']=ifelse(epl18$FTR=='A',1,ifelse(epl18$FTR=='D',0.5,0))
epl18[,'count'] = 1

To run calculations, In the code block above, I’m turning Home and Away Wins and Draws into numbers. We’re adding new columns for home win, away win, draw, and count, assigning 1 for a win, 0.5 for a draw, and 1 for each outcome.

Football fans know a season has two halves. We’re splitting our dataset accordingly. Pre-2018 dates are ‘first half’, post-2018 dates are ‘second half’.

#creating dataframe for games played in 2017
games17 <- epl18[epl18$Date < ("2018-01-01"), ]

Next up, we’re splitting the first half-season games into home and away. Keep your eyes peeled, it’s a bit tricky.

#creating aggregation for 2017 homegames
home17 <- games17%>%group_by(HomeTeam,Div)%>%
summarise(count = sum(count),
hwinvalue = sum(hwinvalue),
FTHG = sum(FTHG),
FTAG = sum(FTAG)
)%>%ungroup()%>%
rename(team = HomeTeam,
ph = count,
FTHGh = FTHG,
FTAGh = FTAG)%>%
arrange(team)


#creating aggregation for 2017 awaygames
away17 <- games17%>%group_by(AwayTeam,Div)%>%
dplyr::summarise(count = sum(count),
awinvalue = sum(awinvalue),
FTHG = sum(FTHG),
FTAG = sum(FTAG)
)%>%ungroup()%>%
rename(team = AwayTeam,
pa = count,
FTHGa = FTHG,
FTAGa = FTAG)%>%
arrange(team)

First, we’re summing up home goals, away goals, home wins, and counts, grouped by home team and division. Then, we’re renaming ‘FTHG’ and ‘FTAG’ to ‘FTHGh’ and ‘FTAGh’ respectively,

  • ‘FTHG’ (Full Time Home Goals) is being renamed to ‘FTHGh’ (Full Time Home Goals at Home)
  • ‘FTAG’ (Full Time Away Goals) is being renamed to ‘FTAGh’ (Full Time Away Goals at Home)

These new names help us distinguish between goals scored at home and away, and then finally sorting teams in ascending order.

Finally, we’re combining the dataframes to capture the complete picture of the season’s first half.

#merging the home and away aggregates for 2017
games17ha <- merge(x=home17,y=away17,by=c('team', 'Div'))

head(games17ha)

We’ve prepped our data and are now ready to calculate the key metrics for the Pythagorean expectation.

#creating calculated columns for Wins, GamesPlayed, GoalFor,GoalAgainst
games17ha[,'W'] = games17ha[,'hwinvalue'] + games17ha[,'awinvalue']
games17ha[,'G'] = games17ha[,'ph'] + games17ha[,'pa']
games17ha[,'GF'] = games17ha[,'FTHGh'] + games17ha[,'FTAGa']
games17ha[,'GA'] = games17ha[,'FTAGh'] + games17ha[,'FTHGa']

head(games17ha)

Win percentage and Pythagorean Expectation for first half of Season.

#calculating win percentage and pythagorean expectation
games17ha[,'wpc17'] = games17ha[,'W']/games17ha[,'G']
games17ha[,'pyth17'] = games17ha[,'GF']**2/(games17ha[,'GF']**2 + games17ha[,'GA']**2)

Next, we’ll repeat the process for the season’s second half, tweaking the data to calculate the key metrics. Let’s go

games18 <- epl18[epl18$Date >= ("2018-01-01"), ]

Next, we’re dividing the second half-season games into home and away, just as we did for the first half. Let’s keep going!

#creating aggregate for 2018 homegames
home18 <- games18%>%group_by(HomeTeam,Div)%>%
summarise(count = sum(count),
hwinvalue = sum(hwinvalue),
FTHG = sum(FTHG),
FTAG = sum(FTAG)
)%>%ungroup()%>%
rename(team = HomeTeam,
ph = count,
FTHGh = FTHG,
FTAGh = FTAG)%>%
arrange(team)


#creating aggregation for 2017 awaygames
away18 <- games18%>%group_by(AwayTeam,Div)%>%
dplyr::summarise(count = sum(count),
awinvalue = sum(awinvalue),
FTHG = sum(FTHG),
FTAG = sum(FTAG)
)%>%ungroup()%>%
rename(team = AwayTeam,
pa = count,
FTHGa = FTHG,
FTAGa = FTAG)%>%
arrange(team)

Now, we’re merging the home and away dataframes to get a complete view of the second half of the season games.

#merging home and away aggregates for 2018
games18ha <- merge(x=home18,y=away18,by=c('team', 'Div'))

head(games18ha)

Just like before, we’ve set up our data and are all set to crunch the key Pythagorean expectation metrics. Let’s get to it.

#creating calculated columns for Wins, GamesPlayed, GoalFor,GoalAgainst
games18ha[,'W'] = games18ha[,'hwinvalue'] + games18ha[,'awinvalue']
games18ha[,'G'] = games18ha[,'ph'] + games18ha[,'pa']
games18ha[,'GF'] = games18ha[,'FTHGh'] + games18ha[,'FTAGa']
games18ha[,'GA'] = games18ha[,'FTAGh'] + games18ha[,'FTHGa']

Win percentage and Pythagorean Expectation for second half of Season.

#calculating win percentage and pythagorean expectation
games18ha[,'wpc18'] = games18ha[,'W']/games18ha[,'G']
games18ha[,'pyth18'] = games18ha[,'GF']**2/(games18ha[,'GF']**2 + games18ha[,'GA']**2)

Summary and Insights

#merging 2017 and 2018 summary files
Newepl <- merge(x=games17ha,y=games18ha,by=c('team', 'Div'))
ggplot(data = Newepl,aes(x = pyth17,y = wpc17,color = Div)) + geom_point() + ggtitle("First Half of Season")
ggplot(data = Newepl,aes(x = pyth18,y = wpc18,color = Div)) + geom_point() + ggtitle("Second Half of Season")

CONCLUSION

There seems to be a positive correlation between pythagorean and win percentage in both halves of the season, as most dots are aligned diagonally from the bottom left to top right. This suggests that teams with higher pythagorean values also tend to have higher win percentage values.

When we say there’s a positive correlation between Pythagorean and win percentage, we’re observing a pattern where as one variable increases, the other tends to increase as well. In this case, as Pythagorean values go up, win percentage values also tend to go up, and vice versa.

Remember, while the Pythagorean expectation provides a useful estimate, it’s not perfect. Real-life factors like luck, injuries, and who you’re playing against can make a team’s actual wins differ from what the guide predicts.

In an upcoming project, I’ll dive into how the Pythagorean expectation can help predict outcomes, reveal team insights, and even explore the nitty-gritty of correlation figures and confidence levels. It’s going to be exciting! 😊

--

--

Beendict Udemeh

Data Analyst || Sports and Performance || Problem Solver. I’ve helped businesses make smarter decisions. Give me your data and I'll tell you a story.