Some fun with FIFA ‘19 Dataset

Meena Aier
Data Girl
Published in
9 min readMar 10, 2019

I have a very loving husband. However I do believe that in his order of priorities, I’m firmly placed second. His first love has been, is and will always be football (soccer for all you crazy Americans and Canadians out there). I also have far too many friends and work colleagues that seem to have a rather dramatic love for the game. So as my first ever post, I figured I might as well kick-off with something football related (yes, I can be quite punny).

Thanks to the wonder that is Kaggle, I will be carrying out some preliminary analysis with a rather comprehensive FIFA ’19 player dataset. A quick detour before we dive into the data fun. FIFA is a massively popular video game that most football lovers will be familiar with. It simulates the games and tournament tracks (UEFA Champions League, English Premier League etc.) that you’d tend to follow in the real world. There are multiple modes to choose from — for instance, one could choose the “career mode”, where one can build their team from scratch and play games with their team, with the aim of wining tournaments. As video gamers win/lose games, they get various opportunities to trade players and ultimately build a highly effective and winning team.

This dataset is packed with a number of different attributes for over 18,000 players. Attributes include: basic stats such as player name, associated club, age, weight, overall rating, potential rating etc., as well as a number of player-specific characteristics related to pay, market value, skill, playing position, movement, power and mentality.

First, let’s look at the distribution of players by their preferred playing position, and layer it with an understanding of how an average club is likely to assemble its team based on these player positions. Some facts are of course, immediately obvious — of the 18,000+ players, strikers are the most common, followed closely by goalkeepers and center backs. Every team needs strikers, at least one goalkeeper and a unit that can effectively cover the center and the back of the pitch (typically, each club has on average, 3 goalkeepers, 3-4 strikers and 2–3 center backs on its team). Beyond these first three, the number of players specializing in each position drops off. The interesting bit is towards the tail-end of the chart — where we have fewer than 25 left attacking midfielders, right forwards or left forwards. Clubs typically have strikers play these angles — but finding players that specialize in these positions is exceedingly rare. Is this rarity also an indication of perhaps, exceptional skills? Fun fact: Messi (yes, the Messi) specializes as a right forward. Another fun fact: Eden Hazard plays on the left wing, as a forward. Of course, correlation isn’t causation, but this does make you wonder about different approaches pursued by footballers.

The bars represent the number of players specializing in each position; the red line depicts average number of players in each position/club.

Player market value by position might be another basic stat that could usefully inform our larger analysis. The distribution of market value by player positions highlights some interesting trends. The cumulative market value of all players is a little more than €39 billion. 11.3% of this value comes from strikers (unsurprising, given that it is the most popular position to specialize in), followed by right and left midfielders (7.11% and 7.09% respectively). This is interesting — though only a little more than a 1,000 players specialize as right midfielders (roughly similar number for left midfielders as well), together, they command a higher market value than strikers. Goalkeepers follow close behind — at just over 7% of total value, indicating that on average, goalkeepers have a lower market value than other players (though of course, there are exceptions — the most notable outlier being David De Gea of Manchester United).

Each pie and color shade represent % of the total player market value. For instance, if the total market value for all players is roughly €39 billion, strikers account for a little more than 11% of that value.

Alright, now that we have some of those basic summary statistics on hand, let’s take a look at how a player’s current score correlates with his market value and wages. At a glance, there seems to be an exponential relationship between a player’s current score and his market value. In other words, returns on score improvements can be exponential — where small score improvements can dramatically alter a player’s market value. Some of our initial hypotheses around specializing in niche positions being an indicator of higher skill levels seems to be reinforced here — some of the highest valued players tend to have uncommon playing positions. For instance, as the highest valued player, Neymar prefers to play as a left winger. Only about 2% of all players are classified as left wingers. Similarly, Messi, valued at €110.5 million plays a position that is shared by only 0.09% of the player base. The other interesting bit about this graph is the impact of extremes, especially on the higher end. The average player value is €5.5 million, and the average score is roughly 66. It goes to show just how extraordinary players on the higher end are, and how their market value grows aggressively as their scores surpass the average.

Each color represents a player position. Players’ current scores (measured from 0 to 100) are plotted on the x/horizontal axis, while their market values (in millions of Euros) are plotted on the y/vertical axis.

FIFA has a “potential score” feature, to quantify how much a player could potentially improve. For certain players already at the top of their game (eg: Messi), current overall score and potential score are the same. For others, the difference between current overall and potential scores can be sizeable. I just wanted to take a quick look at how the delta (i.e. difference between current and potential scores) correlates with a player’s market value. A very very preliminary bivariate linear regression analysis indicates that the relationship between these two variables is in fact, negative. In other words, the larger the difference between a player’s current and potential scores, the lower their market value. Intuitively speaking, this does make sense. If a player (a young one at that) starts off fairly unpolished, but shows a lot of promise, his market value in the short term is likely to reflect that. However, I would emphasize that this is a preliminary analysis — and this is the simplest regression model out there. Looking at this linear trend line and the outliers, I’m fairly certain that the overall story is quite a bit complex and the market value returns likely decrease exponentially. It’s likely that certain deltas actually yield favourable market value results, especially if a footballer happens to be young and already at the top of his game (eg: Neymar, with a current score of 92 and potential of 93 commands a €118.5 million market value). I will return to this question with a more detailed analysis in a future post.

The x/horizontal axis plots the difference between current overall score and potential score for each player. The y/vertical axis plots market value in millions of euros. The trend line is the output of a rudimentary bivariate regression.

Next, I wanted to understand if there were any interesting correlations between various player attributes — especially those related to skills, movement, power, and mentality. The correlation matrix below visualizes the strength of correlations between various player characteristics. The darker (i.e. more purple) the boxes, stronger the positive correlation. On the other end of the spectrum, the more blue/red a box, the stronger the negative correlation. There are several interesting observations to be made. For instance, there aren’t that many significantly negative correlations. The most obvious one is that strength is negatively correlated with agility and balance. However, players that are more agile tend to do much better with long shots, volleys, crossing, dribbling, and short passes. They also have a greater degree of control over the ball, can position, curve and finish their shots more accurately than others. Similarly, when it comes to mental characteristics, players with a greater degree of composure have more accurate long shots, can deliver shots with greater power and generally do better with passes (long and short), likely on account of being able to better concentrate and see the ball movements. Aggression on the other hand, doesn’t really have as much of a strong positive correlation with other characteristics except for interceptions and short passes, which is fairly intuitive.

Each box represents the strength of a correlation between two variables. The more purple and bigger a box, the higher the degree of positive correlation. The more red/blue and bigger a box, the more extreme the negative correlation. Grey boxes indicate a slight correlation. All of the values here (except for the completely white boxes) are statistically significant and meaningful.

Having examined these player characteristic correlations, it is clear that there are certain skills, movements and mentalities that cluster together. It is also highly likely that groups of these characteristics have a big and significant impact on a player’s overall score, as well as his wages and market value. While I will be devoting at least one future blog post to examining these clusters and predictive models, for now, I will leave you with a basic model to determine the impact of a specific set of variables on a player’s market value.

I was curious to see if and how variables such as age, overall score, potential score, and attacking and defensive work rates impact a player’s market value. Before I dive into the results — a quick note on attacking and defensive work rates. Each player puts in a certain level of effort while participating in attacks as well as defense. For instance, strikers will likely have high attack rates whereas center backs, defense midfields will likely have high defense rates. Each player’s attack and defense work rates are classified as being high, medium or low.

Now, returning to the model. I opted for a very simple linear regression, just to determine basic trends and impacts. For the stats geeks out there: this yielded a model with a pretty decent level of explanatory power (Adj. R-square value of 0.68) and some significant relationships.

For instance, overall score seems to have the highest magnitude of impact on a player’s market value. On average, a one point increase in overall score increases a player’s market value by €1.5 million (while holding the impact of other variables constant). A high attacking work rate on average, adds about €620,000 to a player’s market value. A one point increase in potential score nudges the market value upwards by about €87,000. On the flip side, age has a negative impact on a player’s market value — this is understandable. The older a player gets, the lesser number of prime years they have left ahead, which caps their value. In this case, as a player gets older by a year, his market value decreases by about €440,000. What is not as intuitive is the impact of defensive work rates on market value — a medium defensive work rate reduces a player’s market value by ~ €370,000, while a high defensive work rate reduces it by more than half a million Euros. This result is a little bit surprising and could lead one to believe that defensive positions aren’t as highly valued as attacking positions. This could potentially be true — however, I will add a note of disclaimer here for stats geeks. It’s likely that a different functional form (perhaps exponential) will fit this model better, and also yield error terms that are more homoscedastic in nature. I will be exploring these functional forms in a future blog post.

This graph depicts the impact of various factors on the player’s market value. The blue-colored lines are factors that positively impact market value — the red color lines negatively impact market value. Only statistically significant variables are included — i.e. only factors that have a meaningful relationship with a player’s market value are depicted.

Do defensive work rates also negatively impact a player’s current wages? Of course, the answer is no. This is more intuitive — especially when we consider that both, high attacking work rates and high defensive work rates have the highest impact on a player’s current wages (about €2,000 each). On the other hand, a medium attacking work rate delivers a greater positive impact on a player’s weekly wages (increases it by ~ €1,500), whereas a medium defensive work rate increases weekly wages by just about €950. Improvements in overall score and potential scores also bump up wages (€1,500 and €500 respectively). Age once again, has a negative impact, but not as drastic as the market value. As a player gets a year older, on average, his weekly wages tend to fall by about €200. Of course, the same disclaimer applies here as well; a different functional form will likely yield more robust predictive results — a topic I will address in a future blog post.

This graph depicts the impact of various factors on the player’s current wages. The blue-colored lines are factors that positively impact wages — the red color line negatively impacts wages. Only statistically significant variables are included — i.e. only factors that have a meaningful relationship with a player’s market value are depicted.

Well, that’s that for this week! I hope you had fun dissecting this data with me, and getting to learn a little bit more about football along the way (even if you happen to be a crazy football fanatic!). If you liked this post, please let me know. More importantly, if you have suggestions (constructive criticism, tips and tricks etc.) please feel free to leave me a comment or get in touch!

--

--