Working With the FiveThirtyEight Soccer Data

Published in

Analytics Vidhya

7 min readDec 11, 2019

Recently, some of the data FiveThirtyEight uses for articles was made available to the public. For this blog, I am going to be exploring and playing around with the 538 club soccer data using python.

Background

For the uninitiated, FiveThirtyEight(from here on out, I will refer to FiveThirtyEight as 538) is a publication whose coverage focuses on sports and politics, with some additional coverage on other categories such as economics and entertainment. What makes 538 different is their in-depth approach to the use of data to support arguments and predictions. 538 is owned by ABC News, who are, in turn, owned by Disney.

I will be looking at how top clubs’ overall strength changed over time, and I will be looking at how scoring differs in the top five leagues around Europe. To accomplish this, I will be using both pandas and seaborn.

In soccer, there are three possible results for each team: a win, a draw or a loss. A winning team is awarded three points, in a draw both teams get one point each, and losing will give you zero points. Throughout the season, teams play each other twice, and the league champion is whichever team wins the most cumulative points over the entire season.

Metrics

538 uses several different metrics to track performance and make predictions. The main metric calculated by 538 is referred to as SPI(Soccer Power Index). SPI is calculated by looking at a team’s expected goals scored compared to the same team’s expected goals conceded. Both of these predictive metrics are based on expected performance against an average team at a neutral venue. SPI is then determined as the percentage of games won against an average team at a neutral venue, given the team’s expected goals for and goals against. A top European team would have an SPI value above 70, meaning that they would beat an average team 70% of the time at a neutral location. SPI would allow us to compare two teams based on general quality, however SPI does not take into account factors such as injuries or location. Throughout the season, SPI changes based on a team’s performance, specifically based on a weighted total goals and overall result.

The Data

After loading up the data, the first thing I did was check to see how big the data sets are and what values they contain.

As you can see, there is a lot of information included in the data.

In terms of points of data, there are 32,290 matches included and 629 clubs included.

Questions

Next, I needed to come up with some questions that I wanted to explore. After some thinking, I decided I wanted to look into how a team’s SPI changed over time. I also wanted to compare different leagues based on SPI, projected goals scored, and actual goals scored.

Question 1: How does SPI change over time?

For this question, I decided to look at the top ten ranked clubs (as of 12/2/19) based on SPI, and then also look at a slice of mid-range clubs. Here is the list of top ten clubs:

I expect to see a lot of these clubs staying near the top over time, however there are a few that might see some fluctuation.

SPI over time for the top 10 clubs in Europe

While this graph is fairly messy, there are some general conclusions we can make from this. Since the lowest SPI value is still above 65, we can conclude that the current top teams in Europe have consistently been good for the last several years(the data for matches goes back about three and a half years). Also, the best of the best, i.e. Man City, Barcelona and Real Madrid, have stayed at an elite level, as shown by their SPI never dropping below 85.

Since there was so much to take in with 10 teams, for the mid-level clubs, I decided to use just 5 to make it easier to interpret the graph.

While there are 629 total clubs, I decided to define ‘mid-range’ as clubs that are considered fairly strong in their domestic league, but not strong enough to compete with the top teams on the continental stage. I also decided to find a group that was only from Europe, so we can compare them more fairly to the top ten, who are all from Europe. After some guess and check, I decided to use the teams ranked from 159 to 163. Here they are:

I expect that there will be more variation than the top ten clubs, because, logically, an average team would have more unpredictable results, so their expected result should also be more unpredictable.

SPI over time for selected mid-range clubs.

This time, there is clearly more variation, with Genk occasionally breaking the 70-SPI threshold, while also falling below 55 at times. Also, there is more separation of values overall between these five teams when compared to the separation between the top five teams. There is some consistency, however. Udinese stayed within approximately 13 SPI points, and if you take out a couple outliers, their range drops to about 10 SPI points. This kind of consistency is on par with several elite teams, although the elite teams ranges are about 20 points higher, on average. Another noteworthy comparison is that for several of the mid-range clubs is equal to or higher than the minimum rating for RB Leipzig, the tenth ranked team at the top. This demonstrates that the top ten clubs in Europe aren’t impervious to outsiders, and one of the mid-range teams could find themselves near the top if they experience a good year or two.

Question 2: Comparing different leagues

For this question, I first tried only using the top 50 ranked teams(as of 12/2/19). While I thought using these teams would work well, I found that there were too many different leagues. It was difficult to glean any useful information from the graphs because there was a lot of information packed into one pair plot.

My first attempt at plotting the top 50 teams.

As you can see, this pair plot would require too much time looking closely to be effective. However, there is some information to be found here. For instance, we can see that, generally speaking, teams score fairly close to their expected goals. We can infer this because the graphs generally hold the same shape, especially around the extremes. Something to note is that the projected goals are scaled from 1 to 2.75, while the actual goals are scaled from 1 to 3. This is fairly significant, although it is less relevant to this question, as I am only looking to compare different leagues, rather than evaluate how accurate predictions are.

To make a more cohesive visualization, I decided to limit the teams to members of the top 5 leagues, as opposed to the top 50 teams overall. While this would add more teams in total, it would also produce a much clearer pair plot, making it much easier to understand the visualizations.

As you can see, these graphs are still cluttered, but it is easier to understand because there are only 5 different hues among the data points. I also included a column for the leagues separately, which makes it easier to see a direct comparison between leagues.

From this pair plot, it is clear that Italian teams score less goals on average, while German teams score more on average.

There are a couple conclusions that could be drawn from this. For instance, the Italian teams scoring less than teams in other leagues could mean that Italian teams are better defensively. It could also mean that the Italian teams are just worse offensively. It could even mean a mixture of both conclusions — Italian teams are really good on defense and really mediocre on offence. The reverse is true for the German teams. They are better on offence and worse on defense, or at least comparatively so.

To me, the Spanish league is the most interesting. Almost all of the teams are clustered fairly low for expected goals and actual goals, but there are two teams that are among the top scoring teams overall. This could be a result of the top teams skewing the data by dominating all the other teams so much, or they could just be outliers that don’t impact the other teams so much.

There is a similar phenomenon in Germany and France, although this time the dominating in each league is coming from one team as opposed to two teams. Another thing to note is that Italy is the only league where there is no team that is noticeably better at scoring goals than the rest of the league.

Conclusion

My goal in working with this data was to get some more practice with pandas and seaborn, so from that perspective, this project was a complete success. In the future I look forward to exploring more of the data used by 538 in conjunction with the statistics and programming skills that I am currently learning.