Machine Learning Project: Predicting Football Player’s Market Value
If you want to predict something such as price, sales etc, regression could be a good solution for you. In this post, I am going to apply one of the machine learning algorithms that predict football player’s market value.
- Define The Problem
- Web Scraping
- Data Cleansing
- Exploratory Data Analys
- Building a Model
- Evaluating Model’s Result
- Conclusion
Before writing my post, i would like to share my Github repo, if you interest in, you can find Jupyter Notebook codes.
Define The Problem
Every data science project starts with a problem / question. In my project, my business need scenario is:
Predicting football players market value and determining who are overvalued or undervalued.
Firstly I need to import libraries and setting some default display options
Web Scraping
Web scraping is a technique that collecting data from the internet and parsing it into meaningful form. If there is no direct way to download data, you need to extract the data into a meaningful form such as data frame. If you would like to learn detail about web scraping you can visit my other post that Web Scraping Using Python BeautifulSoup
In this project I used sofifa dataset . Main page of sofifa shown below;
Main page includes players characteristics data and each page has 60 players. So I need to build loop to get all players from sofifa. I used BeautifulSoup library to scrape data.
After scraping process, I checked number of rows and dropped duplicate rows. Finally I scraped 19316 players from sofifa web page.
Number of All Rows
30415
Number of Rows Without Duplicated
19316
Data cleansing
Data cleansing is a critically important step in any machine learning project. Clear data means more accurate results. If you obtain data using web scraping techniques, you have to spend a lot of time to clean your data before machine learning algorithms
After getting data, your data frame looks like belown. It is too complicated for machine learning algorithms :)
In my project, I applied these steps for data cleansing;
- Split columns,
- Remove unwanted characters (such as \n)
- Converting height column to cm
- Converting weight column to kg
- Converting Value, Wage and Release_Clause to decimal money. (Including €, M and K characters)
- Deleting some rows that columns is blank
- Converting International Reputation column (5 ★) to integer (5)
- Converting all numeric columns to integer or float
Finally, I cleaned my data and convert data frame shown below.
Exploratory Data Analysis
Exploratory Data Analysis helps you that,
- understanding your data ,
- exploring structure of data ,
- recognising relationship between variables,
Briefly, Exploratory Data Analysis tell us almost everything about data.
Let’s have a look at some brief information and graphics.
Player’s Position Distribution
Mean Value of Each Position
Building a Model
In this project I used Regression Model. What is regression model ?
Regression Models are used to predict a continuous value. This is one of the most common type of Machine Learning Algorithms . Here we predict a target variable Y (Value) based on the input variable X. A linear relationship should exist between target variable and predictor and so comes the name Linear Regression.
At the beginning of the model, I split my data frame as Xb, yb for baseline model. Xb includes all numeric columns (59 features) and yb is my target column (Value) and create OLS model
player_modelb = sm.OLS(yb, Xb, data=player_list)
resultsb = player_modelb.fit()
print(resultsb.summary())
According my baseline model, R-squared value is 0.954, It is so good but model has some complexity (number of features) and there are strong multicollinearity problems (Cond. No is very high)
Evaluating Model’s Result
Model needs some feature engineering works. Firstly we have a look our target column’s distribution
According to the distribution plot, it looks like a positive skew. So I have to calculate log of my target column
Second step is reducing number of features. When we need to reduce number of features, correlation matrix help us. Let’s find highly correlated with target (Value) columns.
Then I decided to use these columns to predict Value;
- Overall
- Age
- Int_Reputation
- Growth
- Release_Clause
- Height
- Weight
Let’s have a look heatmap between these selected columns and target column
And pairplot for all selected columns and target column
Then I created a model with selected features and checking some parameters. And OLS Regression Results looks great. Because there are less feature (6) and it help us explain our model clearly, Our R-squared value is high (0.941) and there ara no multicollinearity because of Cond.No (117) is low.
Now I have to split my data as train, test and validation set. And run model again and finally compare Ridge, Lasso and Polynomial regression results.
We have best result at Degree 2 Polynomial regression and second is Linear Regression. Now I am going to run cross validation. Lasso Regression result tell us, it was a overfitting problem and Ridge Regression is similar to Linear Regression.
Conclusion
We can compare actual values vs predicted value.
If we convert these logaritmic values to inverse, we can compare real values (Million €)
In Linear models, we have to control 5 assumptions.
Assumption 1: Regression is linear in parameters and correctly specified.
We can visually check this assumption, plotting predictions vs actual.We should observe that the points are approximately symmetric about a line through the origin with slope.
Assumption 2: Residuals should be normally distributed with zero mean.
Assumption 3: error terms must have constant variance.
Assumption 4: errors are uncorrelated across observations. We can check correlation between observations.
Assumption 5: no independent variable is a perfect linear function of any other independent variable (no perfect multi-collinearity)
When you check your OLS Regression Result, Cond.No tell us, if there is a strong multicollineary or not.
Finally Let’s find which players are most over valued and which are most under valued player.
Biggest negative difference is 8.48 million € for Sergio Ramos. Actual market value of Sergio Ramos is 32.5 million € but model predicted value is 39.98 million €.
Biggest positive difference is 39.04 million € for M.Icardi.. Actual market value of Icardi is 53 million € but model predicted value is 13.96 million €.
Finally, we tried to predict football player’s value based on Fifa 2020 characteristics data by using machine learning algorithms.
Thanks for reading my post and I hope you like it. Feel free to contact me if you have any questions or if you’d like to share your comments.