Messi vs Ronaldo (vs the world), data science edition

Elior Cohen
10 min readOct 12, 2017

--

Nowadays its easy to forget that data science is not all about machine/deep learning.
While AI is awesome, data science is by majority a practice that exists to better understand real phenomenons.

Besides being a data scientist, I am also a sports fan.
One thing that drives me crazy is the false use of data and statistics in sports.
Very often you see irrelevant facts being made assumptions upon and players/teams being compared over very weak statistics.

It’s a while now, that I wanted to create a measure for comparing goals in a soccer match.
Counting who has the most goals, is just plain wrong.
A goal scored at the 90 minute when the scoreboard shows 1–1, is by far superior to a goal score in the same minute when leading 4–0.

I have put in a lot of time and effort in coming up with a way to measure the significance of a goal, to finally establish what I call Relative Goal Value v1.0 (referred by now as RGV1).
The elements RGV1 takes into considerations are:
1. Time the goal has been scored
2. Team the goal was scored against
3. Home / away goal
4. Current score of the game
I have chosen to not discriminate penalties.

In this post, I’ll explain about the RGV1 scoring system, and use it to compare Lionel Messi to Cristiano Ronaldo and the top 50 scorers (by RGV1) in the 5 major leagues.

RGV1 Scoring System (TL;DR)

Before we use RGV1 to compare player’s goal scoring, lets understand what it is about.
This is the TL;DR version, assuming most people reading this will not want to go into the equations, this part will explain the essence of the scoring system, at the end of the post you can find the full equations.

**Disclaimer: While RGV1 is proportional to the points won for the team, it has nothing to do directly with it. RGV1 DOES NOT measure how many points a player won for the team but rather calculates a sophisticated value of a goal.

The scoring is built in a following manner

The most important element and most complex being game state value.
The game state value, differs in range, depending on the current score and the time left to play.
When the game is tied, the value of a goal rises exponentially from 1 to 3, according to the minute of the game.
When leading, the value of a goal drops exponentially as time advances, and the range is dependent on by how much the team is leading by.
When trailing the score behaves like when leading, but in a smaller scale.

The logic behind the game state value is that:
— Goal scored on tie > goal scored when behind > goal scored when leading
— On a tie, the later the goal the higher the value (goal scored on tie in the 20' minute, is worth less than a goal scored on tie in the 90' minute)
— When leading, increasing the lead earlier is better
— When trailing, decreasing the opponents lead earlier is better.
Before deciding on these 4 points and their relativity to one another, I have consulted with many friends, some field experts in order to be as accurate as possible.

Below is a plot of the game state value:

Then, the game state value is multiplied by the team quality multiplier, which ranges from 0.68~ to 1, depending on the standings of the opponent team in the end of the season (a measure of team quality).
And finally this is multiplied by 1/0.9, depending if it was an away/home goal.
A perfect 3 score, will be achieved when scoring a winning goal on the 90 minute in an away game against the team who finished the season in the top spot.
The lowest score possible, will be achieved when scoring a goal, when leading by 3+ in the 90 minute against the team that finished the season last.

Before we go on to the comparison, some examples of scores:
1. In La Liga, 2016–2017 season, the goal with the highest score is Lionel Messi’s goal at the Bernabeu, when the game was tied 2–2 at the 92th minute (Score of perfect 3)
2. In La Liga, 2016–2017 season, the goal with the lowest score is Tiago’s goal for Atletico Madrid against Granada at home when leading 6–1, at the 87th minute (Score of 0.231)

Examining 2009–2016 in La Liga, below is the distributions of all the RGV1 scores for all players

Messi vs Ronaldo

Now lets get to the interesting part.
A lot has been talked about this two, and while in other areas of the game it is quite clear in each area who is best, their goal scoring is constantly compared.
The data we’ll be comparing on are only on La Liga’s goals, from the year 2009 (when Ronaldo arrived at Real Madrid).

First, lets see how their overall RGV1 distribution looks like

Well, not so surprising… In numbers these plot is (Messi/Ronaldo)
Mean: 0.950 / 0.943 (higher is better)
Standard deviation: 0.547 / 0.485
25 percentile: 0.461 / 0.578
50 percentile: 0.854 / 0.861
75 percentile: 1.232 / 1.246
Minimum: 0.226 / 0.233
Maximum: 3.000 / 2.855

Looking at Ronaldo’s and Messi’s most important goals (maximum RGV1), interestingly, both happened in April, one year apart.
Messi, the winning goal in the 92 minute at the Bernabeu, when the game was tied 2–2 against Real Madrid, which won the league title that season.
Ronaldo, the winning goal in the 85 minute at the Camp Nou when the game was tied 1–1 against Barcelona, which won the league title that season.

Moving forward, lets see what was their overall contribution, meaning sum of all RGV1 from 2009 to 2016

Messi has scored a total of 271.629 RGV1 and Ronaldo a total of 260.228, Messi in 266 appearances and Ronaldo in 254, making Messi’s average RGV1 per appearance 1.021 and Ronaldo’s 1.024.

Let’s try to look now at the RGV1 per season, starting with the total RGV1 per season.

Interesting to see in the graph is that the leader of each year splits evenly between them, each one taking the top spot for 4 seasons.

Now, tempting to look at is the average RGV1 per season.
But the truth is that this is a bad metric, since if the two had scored the same goals exactly, but one of them scored an extra goal with a low value, he would have a worse average even though he performed better.
Instead we would look at ‘fixed average’ which would be the total RGV1, divided by the average goal count of the both in the same season.

Here also we can see that the lead changes are equal and Ronaldo displays better stability throughout the years while Messi’s peak performance outperforms Ronaldo’s.

Since the most critical aspect of RGV1 scoring is the game state value lets see how the goals distribute between different game states per player and the minutes.
First, by the scoreboard status

Simply amazing to see, that across 8 seasons, Messi and Cristiano has an equal amount of goals scored when 1 behind and when the game is tied.
Notice that they both score when the game is tied more than any other score situation, which tells a lot to their contribution to their teams at the most important point of the game.

Now lets look how they distribute their goals across minutes:

Here we can see that Ronaldo’s distribution is quite uniform, while Messi prefers the second half.

I have to say, that when I started with this project, I knew they were both phenomenal goal scorers, but I had hopes to see one that will stand out.
As the data tells us, there is no much difference between the two and the mystery of who’s the better goal scorer is left unsolved….

But how do they stack against the rest of the goal scorers?

Messi and Ronaldo Against the World

Without further due, lets look at the totals, of the top 50 RGV1 ranked scorers in the period between 2009–2010 -> 2016–2017

Notice that except from Messi and Ronaldo, there are only pure strikers in the top 15.
It is pretty clear that these two stand out from the crowd, as Ibrahimovic which is the closest has a total of 182.788 which is is 78~ RGV1 points behind Ronaldo and 90~ behind Messi.

Also in this plot it can be seen, that counting goals and counting RGV1 are two different things. For example, Lewandowski has scored many more goals than Di Natale, while Di Natale has created more value for his club.
Another great thing to see is that Ibrahimovic, Higuain and Cavani have scored lots of goals and also yielded great RGV1 showing their great significance for their club.
You’d be the judge, but I believe RGV1 reflects a player’s value for his club in a better manner than goal counts.

Lets see how the top 10 have performed throughout the years:

We could be missing players that had great years in this plot since the graph above shows the top 10 over all the seasons.
Attached below is a graph, for each season separately, plotting Messi and Ronaldo against the top 25 RGV1 scorers examining each season separately.

Judging the graphs it is simply amazing what phenomenal scorers Messi and Ronaldo truly are and how consistent their dominance has been.
Lots of scorers have emerged in those 8 years but none have managed to reach Messi’s and Ronaldo’s peak performance nor sustain their performance for such long period.

To wrap it up, I concluded below a table for each season, with the top 10 players of that season.

Before going to the data itself, I’ve added the number of times players have appeared in the top 5 of a season:
Messi: 7
Ronaldo: 6
Ibrahimovic: 4
Milito, Lewandowski, Suarez, Cavani, van Persie: 2

2009–2010

2010–2011

2011–2012

2012–2013

2013–2014

2014–2015

2015–2016

2016–2017

RGV1 Scoring System

If you got this far in the reading, I salute you.
This part is dedicated to the equations of the RGV1’s scoring system.

Let’s remind ourselves what RGV is made of

Where TeamQualityMultiplier is ranging from 0.68~ to 1, and is calculated in the following manner, on a 20 team league table:

Where s is a linear decreasing value between 1 and 0, depending on how many teams are in the league where the team that finished first gets 1 and last 0.

Next is the HomeOrAwayGoalMultiplier which is set to 0.9 for home games, and 1 for away games.

Last but certainly not least is the game state value.
Game state value acts differently when game is tied and when it is in favor.
Equation for tied situation:

Where m, is linearly increasing from 0 to log(3) depending on the minute the goal was scored at where the 1st minute is 0 and last is log(3)

When the game is in favor, the following equation is used:

Where m is linearly increasing from 0 to 1, where the 1st minute is 1 and last is 1.
The other variable diff is set to fixed variable depending whether the goal scorer team is leading or behind:
- Behind by 1 -> diff = log(3)
-Leading by 1 -> diff = 0.85
- Behind by 2 -> diff = 0.6
- Behind by 3 or leading by 2 -> diff = 0.3
- Leading by 3 -> diff = 0.15
- Behind by 3+ or leading by 3+ -> diff = 0

Some domain knowledge has been put into these equations as you can tell.

Last Words

I hoped you enjoyed this read and also hope that with time better statistics and measurements will enter the world of soccer.
I would love to keep exploring soccer data, but unfortunately gold standard data like Opta’s is very hard to get or very expensive.
Using such data (like Opta’s) amazing things can be done, especially today in the explosion of data science and AI.
Today, most of the clubs are using data analysts, but the distance between an analyst and a scientist is what can make all the difference which makes this fact rather sad.

I wonder what would happen if all those clubs had full time data scientists…

--

--