Network Analysis of Soccer Player Positions

Luke Walsh
INST414: Data Science Techniques
5 min readMay 2, 2024

In the realm of soccer, it can be important to understand the dynamics of the game. Not just the relationships between players, but the unseen networks that may be present in their teams, leagues, and positions. Seeing these relationships can support decision making in games like EA Sports FC and even real life, where position knowledge can be helpful. For this analysis, we will be focusing on positioning and seeing if we can see defined networks of positions for players on each team. Our main question is: In the top five leagues in Europe, what are the most predominant positions that players play?

There can be several possible stakeholders of this analysis. The first could be players of the EA Sports FC video game, which this data was actually taken from. Knowing which positions are the most popular and where their players play can help inform a user who to choose when building a team. It can also be helpful in real world situations like player or team development, where less played positions may need to be filled or strengthened.

The data that would be needed to answer this question is a dataset that includes categorical variables like players names, player club, club league, and the players position. These would be important to our question because these would allow us to create the network needed to see these most common positions. We will be able to connect the players to the clubs, the clubs to the league, and finally the players to the positions. This should create a network that shows which positions are the most prominent.

To get a subset of this data, I looked at EA Sports FC data. Every year, this video game takes extensive information about players all over the world and builds profiles for them. These profiles contain the exact information I need. I was able to find this data from a user on GitHub who was able to scrape this year’s player information and put it into a dataset. This dataset not only included information about the players league and team, but it also included dozens of other skill ratings, which were not needed for this project.

The data cleaning process for this dataset was pretty straightforward. We only needed a few rows. The original dataset had over 30 rows, but we only needed 4. The player name, the players team, the players league, and the position that the player plays. The rest of the dataset could be dropped, since we didn’t need any of its values. I also decided to only focus on players in the five biggest leagues in Europe. These were the Premier League (England), La Liga (Spain), Bundesliga (Germany), Serie A (Italy), and Ligue 1 (France). There are many more players in the dataset, but players from these leagues are usually the most prominent. Once these steps were taken, we had the final dataset that could be analyzed.

When looking at the nodes used for this analysis, we have two different nodes that will be used. One is the players names, and the other is the players positions. We can also include player clubs and club leagues, but these are less important. The main relationships, or edges, that we will be focusing on are between the player and the position. When these two share an edge, it means that that player plays that position. When looking at an edge between a player and a league or a player and a team, it means that the player plays in that league or team. This will mean that when you focus on a player in the analysis, they should have edges connecting to a position and team.

Before we look at the analysis, let’s define what importance means in this project. In our case, this has no numerical value, but an important node or relationship may be one that dominates the graph that we are looking at. Take a look at the Gephi graph below and see if any of the nodes stand out to you.

There are five nodes in the graph above that stand out. These are nodes of high importance, meaning that there are a significant number of edges that connect to them. Let’s take a closer look at what these nodes could potentially be. Before we dig into the graph, let’s find the positions with high degrees of centrality and see if they line up with the nodes in the graph.

Positions with highest degree centrality:
CB: 0.12686357243319268
GK: 0.10576652601969058
ST: 0.08045007032348804
CDM,CM: 0.048663853727144865
CM,CDM: 0.04050632911392405

Now that we have these, let’s mark the 5 nodes on the map that stand out and see if they match up.

As it turns out, they do match. In the graph above, you can see that the nodes with high traffic are indeed the ones with high degrees of centrality. One question that viewers of the previous graphs may ask is why are there CM, CDM and CDM, CM, aren’t they the same position? The first position in the order is actually their main position, while the second is a position they are able to play. The CM, CDM players mainly play CM, but can play CDM, and vice versa for the CDM, CM positions.

The most important nodes we see in this analysis are CB (Center Back), GK (Goalkeeper), ST (Striker), CM (Center Midfielder)/CDM (Center Defensive Midfielder), and CDM/CM. Given this evidence, we can assume that these are the most predominant positions in the top five leagues in Europe. With the use of cluster analysis, network analysis, and centrality calculations, we were able to find these top positions.

The biggest limitations I had in this analysis was that I did not have all of the data that I wanted to use. Originally, I wanted to focus on country and have that as part of the network graph, but the country values in the dataset weren’t all collected correctly, so I wasn’t able to use those. I also cut down a lot of the dataset by only choosing five leagues. If I expanded that, the answer could have been different.

GitHub: https://github.com/ltwalsh/walshINST414module2

Data Sources: https://github.com/prashantghimire/sofifa-web-scraper?tab=readme-ov-file and sofifa.com

--

--