Distribution of Premier League Goals

A brief look into the numbers of English Premier League football

Published in

Analytics Vidhya

17 min readAug 10, 2021

About a year ago I handed in my master’s thesis. It was a captivating subject that I very much enjoyed working on. The proceedings of my thesis began about a year before the hand-in when I encountered a near-perfect way to blend together my studies, Mathematical-Finance, and my life-long passion, football (soccer, for the American readers). I had previously come across several papers describing how to make a simple betting strategy from statistical models of the distribution of football goals[1,2,3] and had been implementing some hobby projects based on some of these papers. However, during my studies, I had recently taken a course in advanced stochastic processes and option pricing techniques that caught my attention. The culmination came when a friend of mine stumble upon a recent article that combined the two topics into a simple risk-neutral framework for pricing in-play football bets using the Poisson process[4] by viewing the goals in a match as the underlying assets. In my thesis, I presented an extension of this simple framework and in doing so I made some initial analyses and model checking for models used to statistically represent goals in football games. Some of my findings from that part of my thesis, I’ll try to present in this piece.

What sets the work I did in my thesis apart from most previous academic (or recreational) papers on goal distribution in football, is the fact that not only did I examine the distribution of goals of the matches, I also examined the goal distribution throughout the matches by also looking at the goal times during the football matches, basically, for the mathematically-interested readers, I investigated how certain point/counting processes could be used to describe a typical football match and possible shortcomings thereof. In fact, I tried to push the model checking procedure away from simple goal distribution, and as a consequence of that, I tried to confront the long-standing assumption, still used by many experts, that goals in football are Poisson distributed.

For the interested readers, I’ll quickly sum up a few mathematical definitions, mainly based on work from [5,6]. You may skip this part if you’re not into the mathematical theory of this subject, but it will be useful to at least skim it through.

Definitions

First, we’ll need to know what a point process is in mathematical terms.

A simple point process ψ = {tₙ : n ∈ ℕ} is a sequence of strictly increasing points 0 < t₁ < t₂ < ⋅⋅⋅ < tₙ <⋅⋅⋅
with lim_{n→∞} tₙ = ∞. We say that tₙ is the arrival time of the nth arrival (event). Furthermore, we sometimes allow a point t₀ at the origin and define t₀ := 0. If the tₙ’s are random variables defined on some probability space
(Ω, 𝔉; ℙ), then ψ is termed a simple random point process.

For the use in describing football goals, I will only consider the case where I define t₀ := 0, and hence write ψ = {tₙ: n ∈ ℕ₀}.
Now, let ψ be a simple point process. Then the nth interarrival time of ψ is then given by

Furthermore, by definition, we have that

Next, I’ll define a counting process based on this:

Let ψ be a simple point process. By defining N₀ := 0, we let Nₜ denote he number of points in the interval (0; t], i.e. Nₜ = max {n: tₙ ≤ t }.
N = (Nₜ)_{t≥0} is then referred to as the counting process for ψ.

Moreover, we have that the (random) variable Nₜ may also be expressed in terms of the indicator function:

The fundamental relationship between the counting process N and the point process ψ is that for each n and t the following holds by simple reasoning:
Nₜ ≥ n ⇔ tₙ ≤ t.

The whole notion of intensity processes in relation to point/counting processes is very central to the theory, but out of the scope of this introductory post, but for a short summary, the intensity process characterizes the evolution of the counting process conditioned on the history of the process.

Lastly, I’ll introduce an important class of point processes.

A random simple point process ψ for which the interarrival times {Tₙ: n ∈ ℕ} form an independent and identically distributed (i.i.d.) sequence is called a renewal process. We then refer to tₙ as the nth renewal epoch and
F(x) = P(T ≤ x) , x ≥ 0, denotes the common interarrival time distribution.

The Poisson Process

We are just not going to get away with not delving into the Poisson process when dealing with football modeling, and thus we can’t avoid talking about the Poisson distribution either. It is simply such an integral part of football goal modeling (and in fact also of the theory of point processes). The (homogeneous) Poisson process, named after the French mathematician Siméon Denis Poisson, is defined as the point process ψ with independent interarrival times {Tₙ: n ∈ ℕ} which follows an exponential distribution with parameter λ, and furthermore:

From the memoryless property of the exponential distribution, we have that the Poisson process is, in fact, also a renewal process. Basically, we have that the homogeneous Poisson process describes a series of events, in which the occurrence of the next event is independent of when the last event took place. We also have that the increments of the counting process are Poisson distributed, i.e. Nₜ - Nₛ ∼ Poi(λ(t-s)). This fact also brings us to the clear connection between the distribution of goals and the point process describing the goal times during the match; namely, set s = 0, and t = 1, where 1 is equal to the end of the football match (a quick side note: the end of a football match from a point process perspective is rather hard to quantify beforehand due to the somewhat random nature of stopping time in football), then we have that
N₁ - N₀ = N₁ ∼ Poi(λ), that is, from a football point-of-view, λ describes a team’s average number of goals during the match. The nature of a point process, and thus the Poisson process, in relation to modeling football goals, is that we know the expected number of goals (or equivalently, the expected time between goals) beforehand, but that the actual number of goals is random.

The Weibull Process

It is possible to extend the homogenous Poisson process to the inhomogeneous case, i.e. such that it is no longer stationary, which basically means that the chance of scoring a goal is not the same throughout the match, but such that the overall distribution of goals is still Poisson. In mathematical terms, we’re introducing an integral to the λ-term above (as well as other technical stuff, which we won’t care about here). What we arrive at with this extension is

Finally, note that due to Nₜ having the Poisson distribution with parameter

we have that the variance of Nₜ is equal to the expected value of Nₜ, i.e.

All we have to do now to make use of the inhomogeneous Poisson process is to select a characterization of the λ-process (which may be random or deterministic). In my thesis, I focused on the Weibull Process characterization, which takes the form of

with α, β ∈ ℝ₊. Please note that what is actually Weibull distributed in a Weibull process is the arrival time of the first goal, whereas the arrival of t₂, t₃, ⋅⋅⋅, and the interarrival times, {Tₙ} for n ≥ 2, are not Weibull.

The Weibull Renewal Process

Finally, I’ll present an extension to the renewal aspect in regards to football goals. The general renewal assumption basically states that the intensity (chance of scoring) starts over when an event happens, i.e. a team's chance of scoring restarts when it scores a goal. I want to show another Weibull-related process, namely the Weibull renewal process, which is a renewal process ψ with interarrival times {Tₙ: n ∈ ℕ} distributed according to a Weibull distribution; Tₙ ∼ Weibull(α, β), n = 1, 2, ⋅⋅⋅ . Whereas the Poisson and Weibull processes portray a deterministic chance of scoring a goal throughout the game, the Weibull renewal process has a random intensity, which can be described by

where Zₜ is the time elapsed since the last event, which is where the randomness comes into play. The emerging distribution of Nₜ that originates from a Weibull renewal process is known as the Weibull count model[7] and is given by

It is a fairly complicated formula that I won’t go into details with here (see the source for a description).

Goals Analysis

I am now in a position where I can start my analysis of English Premier League (EPL) goals. First, let me get my hands dirty and get some data. My go-to language is the R programming language[9] along with the tidyverse[10], but it should be fairly easy to perform this analysis in other languages as well.

Obtaining the Data

You can easily obtain historic score data straight from the web, e.g. at https://www.football-data.co.uk. However, obtaining the goal times are a bit more difficult and you’ll likely have to scrape it yourself (please be aware that most sites that contain such data have some sort of scraping policy that you should adhere to). I have created a small R-function that can be used to obtain data from https://football.fandom.com/wiki/Category:Premier_League.[8]
A quick pro-tip; you can use the memoise package when dealing with scraping tasks from R.

get_fandom_epl_football_data <- function(season = "2018-19", matchweek = 1) {
  match_url <- glue::glue("https://football.fandom.com/wiki/{season}_Premier_League:_Match_day_{matchweek}")
  
  raw_list <- 
    rvest::read_html(match_url) %>% 
    rvest::html_table()
  
  games <- raw_list[2:11] %>% 
    dplyr::bind_rows() %>% 
    dplyr::filter(
      !is.na(X2), 
      stringr::str_detect(X3, pattern = "Report", negate = TRUE)
    ) %>% 
    dplyr::bind_cols(
      raw_list[2:11] %>% 
        dplyr::bind_rows() %>% 
        dplyr::filter(
          !is.na(X2), 
          stringr::str_detect(X3, pattern = "Report")
        ) %>% 
        dplyr::select(X6 = X2, X7 = X4)
    ) %>% 
    tidyr::separate(
      col = "X3", 
      into = c("home_score", "away_score"), 
      sep = "–"
    ) %>% 
    dplyr::mutate(
      home_score = as.numeric(home_score), 
      away_score = as.numeric(away_score)
    ) %>% 
    dplyr::rename(
      datetime = X1, 
      home = X2, 
      away = X4, 
      match_info = X5, 
      home_comments = X6, 
      away_comments = X7
    )
  
  return(games)
}

The output should look something like this:

Screenshot by the author; RStudio console output of the tibble containing the scraped EPL matches.

Before I can make my analysis, I need to clean the data. A process I won’t go into details with here, but generally, there are two caveats with this data:

I need to extract the goal-scoring minutes from the home_comments and away_comments columns, e.g. using stringr::str_match_all('[0-9]+') or similar, and also deal with how we want a stoppage-time goal represented (shown as e.g. 45+1, 90+3 in the data). In my case, I chose to represent second-half stoppage-time goals by simply adding the numbers, e.g. 93, and first-half stoppage-time goals by adding the numbers and adding 100, e.g. 146. In my later analysis, I have filtered out most of these stoppage-time goals, due to not all football matches even have that many playable minutes, but I wanted to be able to visualize them all if needed.
Unfortunately, the times for a red card are also shown in the comments columns, and thus one should perform a sanity check by comparing the score columns and the dimension of the extracted goal-times and manually (programmatically?) fix it — sorry!

Distribution of Goals

Let us now investigate the empirical distributions of EPL goals and see if I can obtain similar characteristics with the distributions of the proposed models, i.e. the probability mass function (pmf) arising from the counting processes via N₁ - N₀. I have included all EPL matches from August 2004 through May 2019 in my analysis.

Please note that if each individual match was to be played over and over, we may see characteristics that are very different from the patterns shown here. Such scenarios can obviously never be observed in the real world, therefore, the only data I have available is the collective of these matches played only once. Therefore, to make general assumptions of an EPL game, I implicitly assume that all games follow the same patterns and have the same characteristics as the masses. This is likely not applicable, and I should be weary in making too strong conclusions for each individual game based on the general patterns.

Histogram of the number of goals in all EPL matches between August 2004 and May 2019 split by home and away team.

It is clear from the histograms that the home teams tend to score more goals than the away team, which also coincides with general football knowledge of a home advantage. I also calculated the sample mean and variance for the EPL goals:

Sample means and variances of the EPL football data for each side.

When dealing with how well the Poisson distribution describes the empirical distribution, it is clear that there is overdispersion in the empirical data, i.e. that the sample variances are larger than the sample means. This is not an apparent feature of the Poisson distribution.

The Weibull count model has two parameters, the scale parameter α, and the shape parameter β. It encapsulates the Poisson distribution with β = 1. I should thus expect a better, or at least not worse, fit from the Weibull count model. I note that β < 1 corresponds to overdispersion and β > 1 corresponds to underdispersion.

Maximum likelihood estimates for the parameters of the Weibull count model for each side.

(Left) Goal histograms with fitted Poisson pmf. (Right) Goal histograms with fitted Weibull count model pmf.

From the plots with the fitted probability mass functions, I clearly see that the Poisson distribution has obvious flaws, but still provides a somewhat decent approximation, which is likely why many still use it to this day. On the other hand, we see a much better fit for the Weibull count model in regards to the goal distributions.

Goal Differences

In addition to presenting the two distributions individually, I’ll also present the goal difference distribution. By examining the goal differences, we can get an impression of possible correlations in the scores. If no correlation exists the goal differences should simply exhibit the same patterns as that of the difference of the individual distributions.

Let us now see how the empirical goal differences stack up with the proposed distributions. First, let’s note that the distribution of the difference between two Poisson distributed random variables follows a Skellam distribution. The distribution of the difference between two Weibull count model distributed random variable have not been specified to the best of my knowledge, but I can use Monte Carlo simulation to determine the distribution.

Histograms of goal differences with theoretical probability mass functions of the differences of two random variables with Poisson and Weibull count model distribution, respectively.

It looks like the Skellam fit is slightly more accurate than what appears from the initial Poisson fits of the scores, and contrary, the Weibull Count difference seems slightly worse than what we encountered before.

Goal Intensity

Let us now examine the empirical goal intensity by analyzing the distribution of EPL goal times and compare this to the theoretical distribution using a specified intensity of the two proposed Weibull-based counting processes. This is where the goal minutes of the EPL matches will be used.

Since I haven’t obtained records of the exact game length of every match, I removed all entries of goals scored in the first half’s stoppage time and all goals scored after the 93rd-minute mark in this analysis in order to obtain fairly consistent data.

Histograms of goal times for each side in all EPL matches between August 2004 and May 2019 with a binwidth of 3 and overlain kernel density estimations (orange line).

I’ll start by presenting the results for the Weibull process. The end goal is to evaluate how well a theoretical goal time distribution originating from a Weibull process can be fitted to the observed goal times, with the added restriction that it must also be consistent with the score distribution above. In other words, assuming that the end of the game is at t = 1, I must have that λ₁ = αβ1^{β - 1} = 1.536491 for the home team, and λ₁ = αβ1^{β - 1}= 1.138246 for the away team. By solving this optimization problem I get the following parameters:

Fitted parameters of the Weibull intensity in the Weibull process for each side for the goal time distribution.

Fitted theoretical goal time distributions (dotted line) of mean-valued restricted Weibull process intensities to the empirical goal time distributions (solid orange line) using a root-mean-squared error optimization function.

This yields promising results for the Weibull process, i.e. the Weibull intensities are able to show similar features as the empirical goal time densities, and also have parameters that are consistent with the parameters obtained from the (Poisson) score distributions.

Next up, is the Weibull renewal process. Again, I want to enforce the consistency of the parameters in accordance with the score distributions in the fitting process, which in turn means that I’ll use the parameters presented for the Weibull count model score distribution. Doing so yields not-so-promising results for the Weibull renewal process.

The parameters of the Weibull renewal process leading to the score distributions presented above are not consistent with the empirical goal time distributions. In fact, the shape parameters of a Weibull renewal process that leads to overdispersed score distributions cannot be persistent with an increasing goal intensity distribution.

(Left) Theoretical goal time distributions of Weibull renewal processes consistent with the Weibull count model score distributions against the empirical goal time distributions. (Right) Theoretical goal time distributions of Weibull renewal processes with non-consistent dispersion parameters against the empirical goal time distribution.

Despite this seemingly major drawback, I’m not ready to discard the Weibull renewal process just yet. As mentioned earlier, one could imagine a specific game played over and over thousands of times that would result in underdispersed score distributions and still have increasing goal intensities. In such a case, the Weibull renewal process could be an appropriate modeling choice.

Doing so, along with some other technical restrictions on the mean value and the score distribution, I show that the Weibull renewal process can also display fairly promising characteristics for the goal intensities.

Waiting Times of Goals

Lastly, I want to explore the distributions of waiting times for the goals in the EPL and compare them to the theoretical distributions of the two types of processes. For simplicity, I only focus on the distributions of the first two goals’ waiting times.

Histogram of waiting times of the first goal in gray and waiting times of the second goal in pink for each side in all English Premier League matches played between August 2004 and May 2019. The histograms have a binwidth of 3 and overlain kernel density estimations in black and purple for the waiting times of the first and second goals, respectively.

The empirical distribution of waiting times across teams seem to be fairly similar, with a slight difference in the waiting times of the first goal, where the mass of the density for the home side appears to be further to the left, i.e., on average, a faster first goal is observed for the home team. Both sides show similar patterns for the waiting times of the second goal, with much of the mass centered at the beginning. This is not surprising since a football match is roughly the same length every time, therefore, for the second goal to be scored with a waiting time of ~90 minutes, would mean that the first goal needs to be scored very fast. This time truncation of the point processes also comes into play when we need to evaluate the theoretical densities. An example of this is the (Weibull) renewal process which should have equal (Weibull) densities of each waiting time, however, this is if the point process is continued in perpetuity. This is not the case and instead, I have used Monte Carlo simulation the simulate the emerging distributions when including this time truncation, aka the end of the football game, in the point processes.

(Left) Dotted lines: Simulated theoretical densities, for each side, of the waiting times in Weibull processes with parameters given above. The waiting time density of the first goal is shown in black and the second goal in purple. Solid lines: Kernel density estimations of the empirical waiting times of the first, in black, and second goal, in purple, respectively, for each side. (Right) Same but for the Weibull renewal process with non-consistent dispersion parameters.

The general patterns show many similarities between the theoretical and the empirical densities for both models. On closer inspection, I note that the estimated kernel densities of the waiting times of the first goal (in black) and the theoretical from the Weibull process seem very consistent across sides with the general shape being fairly well explained. There are some differences in the beginning, especially for the home team, and in the end for both teams. The endings are likely because of the simulation of the theoretical densities, where I ended every game at the 93rd-minute mark, in real life, there is some significant variance in the ending minute. The reason for the slight right skewness at the beginning of the densities is most likely due to the celebratory period right after a goal is scored.

Looking at the Weibull renewal process, I notice that for the waiting times of the first goal, the process tends to overestimate the number of fast goals and underestimate the number of late goals. It is, however, noteworthy that the large probability mass centered to the left in the observed distribution for the waiting times of the second goal, i.e. a fast second goal, is not captured well by the Weibull renewal process. This is due to the fact that the renewal assumption implies that a team completely starts over after scoring a goal, when in fact a team will generally ride a wave of confidence right after scoring a goal.

Conclusion

My analysis has, in terms, expanded the model checking procedure when it comes to finding distributions and/or point processes to model football goals. My findings indicate that both types of processes, and hence distributions, that I have focused on seem to have useful properties, in regard to describing EPL football goals, that were not previously highlighted. But it also shows some serious limitations for modeling (in-play) football. Especially, the distribution of goals for the Weibull process and the renewal assumption embedded in the Weibull renewal process should be highlighted. EPL football goals are simply not consistent with the Poisson distribution, and even though the pause after a goal (celebration and kickoff) makes it practically impossible to score in the minute after another goal, the renewal assumption does not seem to be present in football goals. Despite these drawbacks, both (types of) models should not be totally abanded and can still teach us a great deal about modeling football goals. In the words of many great statisticians, let me finish with a fantastic quote, generally accredited to George Box, that states:

All models are wrong, but some are useful.

References

[1] Mark J. Dixon and Stuart G. Coles, “Modelling association football scores and inefficiencies in the football betting market” (1997), Journal of the Royal Statistical Society: Series C (Applied Statistics) 46.2, pp. 265–280.

[2] Michael J. Maher, “Modelling association football scores” (1982), Statistica Neerlandica 36.3, pp. 109–118.

[3] Georgi Boshnakov, Tarak Kharrat, and Ian G. McHale, “A Bivariate Weibull Count Model for Forecasting Association Football Scores”(2017), International Journal of Forecasting 33.2, pp. 458–466.

[4] Peter Divos, et al., “Risk-Neutral Pricing and Hedging of In-Play Football Bets” (2018), Applied Mathematical Finance 24.4, pp. 315–335.

[5] Karl Sigman, “IEOR 6711: Notes on the Poisson Process” (2009), IEOR Columbia - Columbia University, Lecture Notes.

[6] Tomas Björk, “An Introduction to Point Processes from a Martingale Point of View” (2011), KTH, Lecture Notes.

[7] Blake McShane, et al., “Count Models Based on Weibull Interarrival Times” (2008), Journal of Business & Economic Statistics 26.3, pp. 369–378.

[8] FANDOM, “The Football Database Wiki” (Accessed 2021), https://football.fandom.com/wiki/Football_Wiki.

[9] R Core Team, “R: A language and environment for statistical computing” (2021), R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

[10] Wickham et al., “Welcome to the tidyverse” (2019), Journal of Open Source Software, 4(43), 1686.