Soccer data and predicting L’Équipe ratings
Data Mining, visualization and Machine Learning around soccer data
Soccer is not statistics focused as basketball or baseball can be. Must there are some important data in soccer : goals, assists and possession are the more common but many others could represent what we are watching : tackles, cross, passes, dribbles, etc.
I have gathered games data for French League 1 games for players and teams. I also have the ratings L’Équipe (famous French sports newspapers) for the 2016/2017 season. The goal of this project is to visualize soccer data, see if there are some interesting insights and try to predict ratings L’Équipe (made by journalists, so those ratings are often subjectives).
2016/2017 (not yet Conforama) League 1 season
For each player and each game, I have those features :
'Crosses', 'Long Ball', 'ThroughBall', 'AerialsWon', 'Assist', 'BlockedShots', 'Clearances', 'Crosses', 'Day', 'Dispossessed_game', 'Dribbles', 'Fouled', 'Fouls', 'Goal', 'Goal Opponent', 'Goal Team', 'Interceptions', 'KeyPasses', 'Offsides_won_game', 'Opponent', 'PassAccuracy', 'Passes', 'Penalty Missed', 'Home/Away',
'Red Card', 'Shot On Post', 'Shots', 'Shots On Target', 'Team',
'TotalTackles', 'Touches', 'UnsTouches', 'Yellow Card', 'Age',
'Position', 'In LineUp'
I use Pandas DataFrames for my preprocessing. I have 10 000 rows (20 teams, 35 games, in average 14 players -subs included-/game). Here are some of the features for few rows.
Before training my model, I have to preprocess my data : fill NaN values, convert to float numerical values, convert categorical data to int (with sklearn LabelEncoder).
After preprocessing my data and having a well-formatted data frame, I have looked at the data to understand it a little bit more. I wanted to see if there were correlations between features and which features were really important for ratings. The first feature I focused on was the player position, there were too many positions (6 types of midfielder, 5 types of attacking players, 4 types of defender), so I mapped those categories into : defender, attacking, midfielder, goalkeeper and plotted the position distribution and the ratings distribution depending of the position :
As you can see, in proportion, it’s hard to get a high rating being a defender compare to attacking and midfield player. In the same way, midfielder have less really bad ratings than the other positions. If you don’t want a L’Equipe bashing, be a midfielder !
As a PSG fan, I am really disappointed by this season but objectively, Monaco did a great season with great young players that have sometimes amazed me (Mbappé, Silva, Lemar, Mendy …). So I wanted to check if we can see that trend in the ratings. I plotted the ratings distribution for the best French teams and the mean rating for every L1 team.
So indeed, Monaco was really the best team this year … And Paris is not even the second for L’Equipe !
Intuitively, the team and the positions seems to be two important factors for the L’Equipe ratings(in addition of the number of goals, assists and shots).
By computing the correlation matrix, we can see than (obviously) some features are correlated : shots on target/shots & goals (yeah, I know we couldn’t score with shots not on target) and keys passes & cross. But those features, are really different so I kept all of them.
Learning and evaluating
I have benchmarked Random Forest and Ridge Regression models with the super useful scikit-learn.
I began with the simple Ridge Regression (linear regression where the loss function is the linear least squares function and regularization is given by the L2-norm). To evaluate it, I use 10-fold cross validation (test on partitions of 1/10 of the dataset ten times and average on those ten test errors). The cross validation mean square error (MSE) is 0.83, which is not really good knowing than the L’Equipe ratings are between 0 and 10 but it isn’t too important. The R2 (coefficient of determination) score is 0.26. So, our model isn’t really linear though but this regression could help us getting the most important variables for the ratings.
As you can see, the most important features are pretty obvious for a soccer fan: missing a penalty, getting a red/yellow card, conceding goals are important negative factors. Contrary to goals, assists, shots on target/post which are positive factors.
The second model that I tried is the famous bagging model Random Forest. To evaluate this model, I used the Out of Bag (OOB) score  in order to validate the model during training (as accurate as using a test set of the same size as the training set). The OOB score is 0.34 so this is way better than the regression (R2 for this model is 0.91).
Both of my models were fine-tuned but I couldn’t get a better score, I think I would need more data. To confirm this, I compute the OOB score for different sizes of the dataset (30%, 40%, etc.). and we can see that the OOB score decreases with the size of the dataset.
For now, it’s now a true success, even if the RF OOB score isn’t too bad, but I hope to get more data next year and improve this model.
Soccer data for 2016/2017 for some European leagues
I have also gather data for average statistics for players and teams in England, France, Italy, Spain and Germany.
I did some preprocessing to gather for data and to have well-formatted Pandas data frames. The aim of this part is to see if there are some interesting insights about this season.
Moreover, I want to improve my data visualization skills and use Matplotlib and Seaborn to plot graphs. I want to see if I can retrieve the same trends, that I saw this season in game, in the plots : players contributions, wunderkids, etc.
I watched almost PSG games this season and one player was really often scoring : Edinson Cavani. He scored many of his team goals so I was wondering what was his contribution compare to other great strikers.
Indeed, Cavani scored 42% of PSG goals (we really need another striker !) equally to Aubameyang with Dortmund. The MSN (Messi, Suarez, Neymar) has striked again : Messi & Suarez have scored 57% of their team goals. Another interesting stats, Falcao, the best Monaco striker, has only scored 19% of his team goals.
This year was the year of Mbappé’s outbreak. He’s only 18 years old and has incredible stats for his young age but how is he compare to others young players ?
Indeed, Mbappé is impressive at 18 years old, he’s the 4th best scorers per game for players of 23 years or less (more than 0.5 goals/game). With 3 French players, France is the most represented country in this top 10.