How do the attributes of football players determine their effectiveness?

Kingsley Obeng
10 min readAug 20, 2020

--

A data science analysis of the premier league footballer’s attributes presented by EASports FIFA Series in correlation with the Premier League Fantasy football scores, giving an insight into the technical work with machine learning models.

Introduction

I was always intrigued to know whether there are certain attributes of a player that determine how successful goalkeepers, defenders, or even strikers would be. Looking at Messi and Ronaldo for example, the current best players in the world, are there particular personal attributes that guarantee their effectiveness, their success? With two such different players performing exceedingly well in almost the same position, I could merely identity speed as a common factor. However, the fastest man in the world, Usain Bolt, despite his interest and professionalism in sport, could not make it into a professional football league so what exactly is, or are, these factors that determine a player’s “success”?

The fact that in December 2019, Magnus Carlsen, the current world chess champion, was also no.1 in the fantasy football league implies that the topic can be approached algorithmically. If a rational mind can select the optimal team to succeed, maybe we can use data science and general data analysis to predict the scores of players before the season begins, to mimic the success of AlphaZero, the successful computer program developed by artificial intelligence research company DeepMind to master the games of chess.

Overall, my aim was to establish that there is a correlation between the footballer’s attributes and their final fantasy score at the end of the season. I will feed the data into several models to see whether there is a significant correlation and to select the most accurate model for this regression test. The accuracy will be measured by Mean Absolute Error and R-Squared.

Methodology

Assumptions

Let me firstly define “success”. To keep it concrete and on a mathematical scale, I decided to opt for the Premier League Fantasy scores. For the sake of this analysis, I am assuming that the points scored by each player are calculated as rationally as possible. I will also assume that all players in the premier league have the same chance of effectiveness no matter what club they play for. I am also adding the midfield data to the strikers and calling them all forwards. This is a conscious simplification based on achieving a high score for the midfield players. In other words, I am aiming at the offensive midfielders scoring assists and goals like a striker.

Datasets

The datasets used for this project are based on the season 2018–2019. The EASports FIFA19 Data and the premier league Fantasy Scores of 2018–2019 are in line with each other. After all the modeling, I used FIFA20 footballer’s attributes to predict a final fantasy score with my best model. All my datasets are from kaggle.com, where it is elaborated that the FIFA19 data has been scraped from the publicly available website https://sofifa.com.

Cleaning & Merging Data

  • Some players have deviant spellings, where it could not be avoided to manually change the names for the datasets to be able to reach a common factor.
  • One dataset might merely state the first letter of the first name and the surname, and the other in full, whilst others might state the player’s full names. For example, the player “Kepa Arrizabalaga” was “Kepa” in one dataset and “Arrizabalaga” in another.
  • The club names needed to be synchronized.
  • Positions are grouped in the simplified manner described above.
  • The personal attributes height and weight needed to be converted into the metric system

The main aim is to merge as many datasets as possible and not include any wrong values or merge wrong rows. So after creating a new unique identity of surname and club and merging the datasets, the reward was 387 legitimate players having been successfully merged. This number, from the sources given, can be envisaged as a good result to go to the next stage of analysis.

Analysis of Steps

After cleaning and merging the data I performed the same analysis steps for all positions

  • Selection of relevant columns
  • Selection of relevant rows
  • Model Building

Selection of relevant columns

I dropped several columns that did not have a representative correlation with the points that the players achieved. This is defined by all the columns with a less than ±25%

Presentation of python code for the exclusion of irrelevant attributes

Delving further into the columns defined, it is important to reduce the columns defined by testing for multicollinearity. Our model needs relevant, and only relevant columns to fit an appropriate model to the data. This was accomplished by “dropping” attributes that had above 75% correlation with other attributes. For example, in the case of Goalkeepers, there was an 85% correlation between GKDiving and Reaction, meaning that only one of the two needed to be kept. Through this process, the attributes Reactions, Vision, Composure, GKDiving, GKHandling, and GKPositioning were “dropped”.

Selection of relevant rows

The only rows omitted where rows where the ”Points”, the overall fantasy score of a player was nil, or too low( 0<points<10) to have a significant impact.

Model Building

The final table of attributes and corresponding scores were then fed into models to determine a mathematical pattern and algorithm leading to the results. Firstly, the data was split into a training and a test set. With the help of Sklearn classes,a python library, this split can be made and fitted to the following 5 model types

  • Logistic Regression
  • Decision TreeRegressor
  • SVM Regressor
  • Naive Bayes
  • Random Forest

All the above models are, again, all from Sklearn. They aim to fit and train the data and to test the accuracy of fit. Naturally, it made sense to try out several models to establish the most accurate one. Below is an example of the code for the first model

Python code implementing sklearn library to fit, train and predict

All datasets and analyses made can be found in my GitHub file at the end of this article

Feature Engineering & Results

Goalkeepers

It was frustrating starting my detailed analysis of goalkeepers, as this position carried the least number of players. Since most teams commonly field a maximum of two goalkeepers in a season, the paucity of data for this position was always something that would need to be observed.

Having dissected the goalkeepers from the main 387 players, it was time to apply some feature engineering.

One can see from this plot below that there are unfortunately a few 0 pointers in the dataset:

A plot of the distribution of points with a kernel density estimate by goalkeepers

To yield representative results, players who had been awarded 2 points or less, as a total for the season, were excluded. The overall attributes used had the following correlation as depicted by sns.heatmap:

Correlation Heatmap(Seaborn python library) showing attributes incorporated to goalkeeping analysis

From the diagram above we can summarize, from the absolute correlation values, that GKPositioning (0,68),GKDiving (0.67), GKHandling(0.64), GKReflexes(0,63) and Age (0,57 absolute) are the attributes that determine the overall success of a goalkeeper the most.

Model Results of MAE and R² of the goalkeepers

The table above illustrates the results obtained by testing in the different models. Since the R-Squared of 1.0 can be classed as overfitting, the best fit is from the Naive Bayes (Multinomial)

Defenders

For the defenders, I decided to remove the rows with points less than 10, in this case. The describe function of the defenders’ dataset and the plot below both give a clear indication of a skewed distribution.

Description and plot of the points scored by defenders

The remaining attributes and their correlations to the “Points” can be seen below:

Correlation heatmap showing attributes incorporated to the analysis of defenders

From this heatmap above we can conclude that Reactions(0.51), StandingTackle(0.48), Sliding Tackle(0.47),and Interceptions(0.45) were the dominating factors determining the points of a defender.

Overall the six selected models produced the following MAE and R-Squared results:

Model Results of MAE and R² of the defenders

As displayed above, the random forest model performed with the lowest MAE and the R-Squared value closest to 1.

Forwards

Merging the midfielders to the forwards(FWD) to create this group made this group the biggest dataset.

The proportion of the different positions
A plot of the distribution of points with a kernel density estimate by forwards

with more points the distribution can be seen to be more gradual, allowing for more accurate results.

Correlation heatmap showing attributes incorporated to the analysis of forwards

The offensive midfield and strikers position seems to consider more attributes, which makes sense as several more attributes, compared to the other positions, like Dribbling, Finishing, ShotPower would be expected to have an impact.

Model Results of MAE and R² of the forwards

Interpretation of Results (MAE & R-Squared)

Generally, there are two main factors of interest when analyzing the fitting of models, Mean Absolute Error(MAE) and R-Squared. The MAE measures the errors between paired observations. The lower the differences, the better the model performance.

“R-Squared can be a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variable(s) in a regression model.”[18th March 2020, investopedia]

The coefficient R² or the returned value by these models is defined as (1-u/v)where u is the residual sum of squares ((ytrue — ypred) ** 2).sum() and v is the total sum of squares ((ytrue — ytrue.mean()) ** 2).sum()

The best possible score is 1.0 and it can be negative because the model can be arbitrarily worse. A constant model that always predicts the expected value of y, disregarding the input features, would get an R² score of 0.0.

Predictions

Using the FIFA20 attributes, I was naturally intrigued to find out how the points would be predicted by my best model. Looking at a shortlist of forwards gave me the following predictions.

Extract of code showing the predicted scores of FIFA20 attributes

We can deduce from the predictions above that the attributes of De Bruyne, Hazard, D. Silva, and Kante, would be predicted 99 and 90, 42, and 122 points respectively.

Overall Conclusion

From the analysis made on the various positions, Random Forest proved to have the most reliable results, ranking second with goalkeepers probably due to the paucity of data. Generally, SVM Regressor gave a very low value of R², and was, therefore, least convincing as a model, but still yields a good MAE score showing that this model is, to some extent, capable of fairly accurate predictions.
However, we have to be careful to draw too many conclusions with such a small amount of data sample. In fact, in data science, we would envisage analyzing the data more thoroughly via deep-learning whereby in this case the paucity of data hinders the further analysis of such kind.
We can thus conclude that it is to some extent possible to predict Fantasy Premier League scores based on players’ attributes, because most players perform in a way such that most of the scores cluster around the mean. So it was correct to use the metric MAE.

For those thinking of using this as a future team predictor for your fantasy football team at the beginning of the season:), let me add a few points that would need to be incorporated. It would be important to look at a “price-performance” attribute incorporating the price of the player to their predicted points. A weighting factor for the club associated with the player would also need to be considered, as well as how injury proneness a player is but all that is beyond the scope of this project:)

Please find the GitHub file here with the details of the python code and analyses:

https://github.com/Kingz2020/Capstone-soccer-data-project

References

Please find below a few useful references used in this article. They cover both thematic and technical aspects.

https://en.wikipedia.org/wiki/Mean_absolute_error#:~:text=MAE%20is%20simply%20the%20average,the%20sum%20of%20squared%20deviations.

--

--

Kingsley Obeng

studied actuarial science, worked as an actuary and software developer. Now getting into Data Science! My family and music are my hobbies.