Uncovering Hidden Gems — Supervised Machine Learning for Soccer Transfer Bargains

Daniel Adams
INST414: Data Science Techniques
5 min readMay 3, 2024

By Daniel Adams

Introduction

Like any sport, the quality of the players are a significant factor in the team’s ability to win. Soccer is no different, therefore, quality players come at a significant premium. While teams can acquire players through their academy systems for relatively low costs, most teams obtain players through the transfer market. The transfer market in soccer is very different from the trading systems of the different American sports leagues. Since soccer is a global game, the transfer market is a global market, where players in any nation’s domestic league can be acquired. Another key difference of the transfer market is that clubs obtain the players by paying the player’s current club with money, known as a transfer fee. That being said, a massive portion of a club’s expenses are allocated for transfers. While clubs set aside a large portion of their finances for transfers, many clubs face challenges in finding quality players for a reasonable price.

Question

As mentioned, finding the right price for a player is critical for a team to find success while balancing their finances. The best clubs around the world tend to have the most finances. With that in mind, these big clubs do not have problems shelling out unfathomable transfer fees for the best players. This creates a “rich gets richer” scenario in the transfer market. While that may be the case, clubs still aspire to become the best and remain persistent to sign the best players that they can afford. Though these clubs will certainly not be signing the highest profile players, can they still sign hidden gem players that go unnoticed?

Stakeholder and Approach

Stakeholders for this analysis will be clubs who have financial constraints. Since the analysis focuses on attacking players, the specific stakeholders will be clubs in the aforementioned scenario who are looking for a striker, which is the main goal scoring position on a soccer field. In order to find undervalued players, the analysis will utilize a linear regression machine learning model. The analysis will attempt to predict a striker’s transfer fee based on the player’s goals and assists, known as goal contributions. Since the transfer fee and goal contributions are continuous numerical values, linear regression was chosen as the model. As mentioned, we will be determining the striker’s market value based on their goal contributions. Therefore, the supervised labels used will be the striker’s current transfer value. These labels were acquired from the dataset and will act as our ground truth values. To predict the striker’s value, the model was given each striker’s goals and assists, which were also sourced from the dataset.

Data Cleaning

Overall, the dataset acquired from Kaggle.com was relatively clean. In order to avoid outlier values, the data set was limited to players with transfer values ranging between 100,000,000 and 10,000,000 euros. From here, the transfer values were divided by 1,000,000 in order to create easy to read values. The players’ goals scored and goals assisted data points were also cleaned. Since the data set provided goals scored and goals assisted per game, all of these values were floating point values. In order to get the players’ goal contribution totals, both the “goals” and “assists” columns were multiplied by the number of appearances the player had.

Data

This dataset contains player appearances, goals, assists, market value, and other data from the 2022–2023 season. Once the data set had been cleaned, it was ready to undergo the analysis by the supervised learning model. Data used includes:

  • Name: The name of the player in order to identify prediction values
  • Goals: Used by the linear regression model to predict player value
  • Assists: Used by the linear regression model to predict player value
  • Current Value: Ground truth label used to compare accuracy of the linear regression model

Data Analysis

The analysis conducted aimed to find players with a predicted price to be higher than the transfer fee given by the ground truth values. This means that if the linear regression model valued a player higher than he actually is, the stakeholder would be given a player that performs above their given value. These types of players are considered hidden gems amongst transfer targets, since their attacking quality exceeds the price tag evaluated on them by the market. Players that fit this criteria are listed below:

A striker I would like to highlight from this list is Dominic Solanke. During the timeframe of this dataset, the club Solanke plays for, AFC Bournemouth, was in the second tier of English football. Therefore, the market severely undervalued Solanke’s transfer fee, as many figured he could not perform at the same level in the first tier of English football. That being said, Bournemouth were able to achieve promotion into the first tier, and have performed great this season. Solanke in 39 appearances this season has 20 goals and 4 assists. 19 of these goals are in the English Premier League (first tier) and puts him at the 5th highest scoring striker in the league. Looking back on his transfer value last year, any club that recognized Solanke was a hidden gem would have significantly reaped the rewards this year. Currently Solanke’s transfer value stands at 35 million euros, nearly doubling his value in one season and only slightly above 2 million euros different from the model’s prediction. While any potential suitors of Solanke have missed their opportunity to capitalize on a bargain transfer fee, this linear regression model shows that there are currently hidden gems spread out throughout Europe this season.

Model Errors

As shown in the above statistics, the module was significantly wrong compared to the ground truth values. In fact, the model accuracy using Sklearn’s .score() function gave an accuracy percentage of 4.02%. That being said, the model obviously had much more than five predictions wrong, as 20 were shown above. While that may be the case, I feel the model got many predictions wrong since the value of players are determined by factors more than just goal contributions. In respect to the Solanke example, my domain expertise allowed me to understand that Solanke playing in the second division significantly impacted the value the market imposed on him. The linear regression model does not know this, therefore, it gives a cold prediction based simply on attacking output. In other words, the model is not bothered by what league a player is in or the nationality of the player; the model lacks bias that people put on players.

Limitations

A key limitation of the analysis is that the data used for the linear regression model is from the past. Therefore, the analysis does not directly answer the stakeholder’s question of finding hidden gems, as the opportunity to have a bargain transfer fee may have passed. However, the analysis still shows the stakeholder that this method to find future hidden gem players is very possible. Another significant limitation of this analysis is its scope. Using linear regression to predict the value of only strikers was not a mistake since I felt more data was necessary to determine a good prediction for midfielders, defenders, and wingers. That being said, that limitation does not undermine the striker value predictions, as this position is concerned primarily (sometimes only) about putting the ball in the net.

GitHub Code:

https://github.com/dadams16/INST414AdamsModules/tree/main

--

--