Using data to scout football players

Sergio Pessoa
LatinXinAI
Published in
8 min readFeb 15, 2023
Photo by Jannes Glas on Unsplash

In 2017, Manchester City made a big signing in the winter transfer window: the Brazilian forward Gabriel Jesus. The striker was the best player in Brazilian National Championship in 2016, and the top scorer of the winning team, Palmeiras. Soon in the 2016–17 season, Gabriel Jesus started to show his good skills: made 7 goals and 4 assists, playing 8 games as a starter. In the next season, he made 13 goals and 4 assists and helped Manchester City to win their 5th Premier League title.

After six seasons playing for City, Gabriel was sold to Arsenal for £45 million, giving a profit of more than £15 million to the Manchester club, and after winning 4 Premier League titles. We can say, with no doubt, that he was a very good signing.

Given that background, we can question ourselves: how we can find a player such as Gabriel Jesus playing in another league? Data can help us. My idea in this project is to use the player stats from the 2017–18 Premier League season and create an algorithm to find similar players to Gabriel Jesus in the Spanish La Liga, of the same season.

We are going to gather stats that measure the performance of a striker, and, with that, calculate the Euclidean Distance of Gabriel Jesus’ stats with La Liga’s strikers' stats. After that, we can compare the player radars’ of both players and see if they are similar.

Data Source

Football has a big variety of stats nowadays, and some of them are very hard to collect. That’s why I’ll use the free data of WyScout, one of the biggest companies in football data nowadays. On their website, there is free data from the 5 major Europe leagues from the season 2017–18. The data consists in:

  • Events are the data of everything that happened in a match during the 90 minutes and extra time.
  • Matches, the file with the information of everything related to the match such as home team, away team, and formations.
  • Players, the file that shows information such as the player's first name, second name, and role.

With that, I can build a dataset of all match events with the team’s names and players’ names, which is needed to find the stats of the players.

Stats

The first challenge with the data is creating stats that capture with accuracy the quality of a player. As stated by David Sumpter, from Soccermatics, we need metrics that have context and comparison. The context means that we need to express in the stats things that are important to the game. As an example, we can’t say that a player with a high % of pass completion is a good passer: we don’t know how the passes are made. But, with the context of passes to the final third of the field, we have so much more information to say if he is a good passer or not.

Comparison is another thing that we need to have in our metric. We can’t simply analyse Gabriel Jesus' stats alone, for example. We need to see if the metrics that he made are good in comparison with other players from the same league. One way to see that is the use of percentiles.

Also, we need to normalise all these metrics by the number of minutes each player played. To do that, we multiply the season stat by 90 minutes and divide it by the number of minutes played in the season. This means that we rescale the data as if the player had played all the games and then divide it by the number of games that he played.

Based on the book Soccermatics, I decided to use these stats:

  • Passes into Final Third of the Field;
  • Receptions inside the Final Third of The Field;
  • Attacking duels won;
  • Non-penalty expected goals;
  • Major stats (goals, assists, key passes).

The first stat, assists and key passes are used to see how much impact the player is doing with his passes. Passes into the final third are calculated by me, and I considered only completed passes. Key passes are a stat made by wyscout and given to us by their data, and assists are the number of passes previous to the goal that a player made in the season.

Receptions inside the final third are a good metric to analyse the positioning of the forward. Attacking duels are good to see if the player is a good presser. These stats were also calculated by me.

Non—penalty expected goals are one of the most important football stats nowadays. In order to find the expected goals, I created a Machine Learning model using logistic regression and fit it into every shot that happened in the Premier League 2017–18. I tested the model with 4 combinations of features:

  • Angle;
  • Distance;
  • Distance and Angle;
  • Distance**2, Distance * Angle, Angle**2;

And it resulted in the following ROC-AUC:

So, I decided to use the distance and angle model, which is simpler than the polynomial but with the same AUC score. You can see the code with details in my GitHub and see how to create an expected goals model here.

The last stat is very self-explanatory for measuring the striker's quality, which is goals.

All these stats are normalised by the minutes played for each player. I created a summary for the English and Spanish league, containing the stats for each league for the season 2017–18.

Gabriel Jesus stats
Premier League 2017–18 Summary
La Liga 2017–18 Summary

In order to visualize a player’s performance, I used the Player Radars. They are graphs that show the values of each variable of interest and also we can add the percentile in which his value belongs in the data.

Gabriel Jesus 2017–18 Player Radar

Above we can see Jesus’ Player Radar for the 2017–18 season. The dotted lines are the percentile lines. We can see that Jesus is one of the bests in goals, non-penalty xG, and very good in passes and receptions in the final 3rd. Also, he has a decent performance giving key passes.

Player Scout

To scout a player similar to Gabriel Jesus in La Liga, we need to use some similarity algorithm. I thought of the Cosine Score or the Inverse Euclidean Distance, which was my choice because gave better results after testing both. The formula to use the inverse of euclidean distance is shown below:

Euclidean Distance
Inverse Euclidean Distance

After calculating that for all the La Liga strikers, I standardized the values using the maximum and the minimum value of the Inverse Euclidean Distance score to get the range of similarity score between 0% and 100%. The most similar players are shown below:

Most Similar Players to Gabriel Jesus in La Liga 2017/18

As you can see, the top similar player was Wissam Ben Yedder, followed by Santi Mina and Rodrigo. Let’s take a more deep look at Ben Yedder:

W. Ben Yedder 2017–18 Player Radar

As Gabriel Jesus, W. Ben Yedder is one of the top players in non-penalty xG in his league. Also, he’s very good in goals and passes to the final 3rd, has a similar performance with Gabriel Jesus in key passes, is better in assists, and is worse in receptions in the final 3rd. They both are at the bottom of the league in attacking duels. We can see the comparison below:

Gabriel Jesus and W. Ben Yedder comparison

The graph above shows what I said, they are players with similar performance with differences in some stats.

This season, Ben Yedder was a Sevilla player, he played 3 years in the club and scored 70 goals, a great performance. After that, Monaco signed him and he become the top scorer in the 2019–20 Ligue 1, with 18 goals, the same number as Mbappé and more than Neymar, proving to be a great signing.

Conclusion

With the presented project, we can say that player stats are a very good indicator to find good players in another league. One way to do that is by using the stats of a player who is performing well in the league and trying to find a similar one in another league, as done here using Euclidean Distance.

Here, we tried to find a similar player to Gabriel Jesus, a very good performing striker from Premier League and found W. Ben Yedder, who was a very good player hidden at minor clubs in other leagues. This shows that he probably would be a great signing for a Premier League club, as he was to Monaco two years later.

Also, the project was very good to study some of the techniques to do football analysis using stats, such as modeling an Expected Goals algorithm with Machine Learning and creating Player Radars.

In case you can see the code in detail, you can see my GitHub. And if you want to reach me, that’s my LinkedIn. Thanks!

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

Thank you :)

--

--

Sergio Pessoa
LatinXinAI

Data Science and Sports Analytics enthusiast. Computer Engineer graduated @UFPE. Data Scientist @atletico