2022 World Cup Passing Networks

Dominic Graziano
INST414: Data Science Techniques
7 min readDec 16, 2023

Overview

This project seeks to provide a metric for player evaluation in the sport of soccer beyond xG, xA, overall goal contribution metrics. With how big the sport is across the world and the vast player pool to pick from, how do the front office of clubs choose which players to target for transfers, as well as who should play in a given game for the club. In this project I have attempted to take into consideration all of the players who participate in the build up to a goal, and assign each individual a value based on the order . This will be used by managers and coaching staff to see which players have contributed to goals beyond the initial scorer, which will help in theory of player selection, not only for games, but player targets and selection in transfers. In creating this I chose to focus on acquiring data from the 2022 World Cup, and focused on the two finalists, Argentina and France. The question to be answered with this project is which player had the most contributions to a goal, based on this metric over the time period of the 2022 World Cup. Within a Python environment, this project was completed using the packages Matplotlib, Pandas, NetworkX, Statsbombpy, Mplsoccer, and Math.

Data Collection

In searching for data to use I found a company called StatsBomb, a soccer analytics company which hosts a open data repository on GitHub. The data was stored in a series of JSON files, and could be accessed using a package statsbombpy that their company created. To obtain the data I needed to create the passing networks, I had to query the data to find the competition id of the World Cup, then found all of the matches where either Argentina or France played in. Once I had this I was able to extract all of the goals for either team and added them to a dataframe, which had 41 rows. Additionally I created a dataframe which had all of the passes in all of the matches that had been collected, which ended up being 13,442 rows of data.

Data Cleaning and Reformatting

There was not much of a need in cleaning the data in terms of missing data or wrong data types, though there were some key steps in reformatting the dataframes in order to visualize these passing networks. The first which is vital to the visualizations is unpacking the column of location, which held the x and y coordinate of where a goal or pass was located in the form of [x,y]. This was then broken up into an individual X and Y column using this code below:

I additionally had to figure out how to change the timestamp to the minute in game, so that I could go back from when a goal occurred and get the passes that led up to the score. This was more complicated because statsbomb restarted the timestamp for every period, meaning that the first minute in the first half and the first miniute in the second half had the same timestamp. I initially didn’t realize this which led to some problems, but the code below allowed for accurate times to be calculated irrespective to the period.

One other thing that I thought was neccessary was to account for the goals that were penalty kicks, as the visualizations would not be able to show the build up to the kick. In order to find this I had to query based on the x and y coordinate of the penalty spot, and created an additional column called penalty which either was yes or no. This allowed me to later on to filter out the unnecessary penalty kick goals. This was implemented through this code:

Analysis and Visualizations

The method to be used to gain these insights is network analysis, which creates edges, or relation to nodes. In this case the nodes are the individual player, and the edges are the passes. In creating these passing networks, I not only used NetworkX but also visualized the build up using scatterplots on top of a soccer pitch. One of the problems I encountered with this project was automation of relating the passes to their respective goals, which I attempted to find the passes before the goal, but after the other teams last possession. I could not overcome this problem and chose to go in a different direction, manually finding the ranges of passes which led to a goal and assigning it to either a iterated variable Arg_goalX or Fr_goalX. The code related to this process consisted of getting each individual matches goals, and then finding the range based on the index.

This was the code finding the goals for the match of Argentina vs Poland
This then gave the ranges for each of the goals

From this point I chose to do a more detailed visualization using the package Mplsoccer, which graphs a soccer pitch. The code uses the x and y locations for both the player passing the ball as well as receiving it, though in my first iteration I forgot that there are instances where a player carries the ball, leaving spaces between passes. Instead of going back to the events and getting more data on the carries I generated some code which was able to use the starting and ending points of passes and linked the two.

This code then generated a visualization like the one below:

The generated visualization for each goal can be explained with the solid white lines being a pass between players while a dotted line is where the player dribbled the ball. A red circle is where the player received a pass and the blue circle is where the ball was shot for a goal, with the names of each event corresponding to the pass and reception.

To explain this visualization step by step is Otamendi started the build up passing to Lucero, who then carried the ball upfield. From this Lucero passed the ball back to Messi who carried the ball to the top of the box, then passed to Lucero, who dribbled and shot the ball for a goal where the blue circle is located.

For the network analysis visualizations, I used the data of the player passing the ball and the recipient of the ball as the nodes. From this the edges could be weighted, going from the shot, in reverse order through the passes. After this was generated I created a dataframe which held both the player and the weighted value, which was appended after each visualization.

This code generated the below visual:

These steps were repeated for each of the Argentina goals, and then they were summed to get the totals across all of the goals. Though it should be noted that I dropped the data for the goal scorer, only including whoever got the assist and moving backwards. This was what was generated through the analysis.

Overall, this showed that Messi had the most contributions, which is expected for Argentina as he is involved in many passes throughout the sequences. It also makes sense that there are multiple midfielders here which are involved in the build up as well. It was interesting to see some defenders show up such as Lucero playing out wide, as well as Otamendi who is a center back. An interesting note is that Enzo Fernandez shows up as the third highest, someone who after the World Cup was involved in a very expensive transfser between clubs of $132 million. Another transfer that occurred of note is Alexis Mac Allister. This can be shown that the insights generated from this passing analysis has some relevance in terms of player values.

Limitations

Overall, there may be some problems with the analysis presented, I only did the network analysis portion with the Argentinian goals, as the French possessed the ball more leading to very cluttered and bad visualizations. Though I believe it was more relevant to do this process for Argentina, as they won the World Cup, so it answers who helped the most during their road to winning the World Cup. I don’t see limitations with the data that was collected though I could have gone into more detail, getting all of the events such as dribbles rather than having to fill the gaps with code. An additional limitation is that I excluded the actual goal scorer, as I believed that this should go beyond the scope of popular metrics such as xG, xA, and G+A. Though including these data points would definetely reshape the final assessment. It is also notable that I removed the penalty kicks as well, due to the fact that I thought that the passing sequence leading up to a penalty being given was not as relevant as the actual sequences leading to a shot being scored.

The code for this project can be found here

--

--