Data Analysis on Football Players using Python & Machine Learning

6 min readAug 11, 2023

I’ve been independently learning to code in Python and have performed an exploratory data analysis on a dataset of football players. This includes data manipulation with pandas, plotting graphs with matplotlib and creating a machine learning model with scikit-learn.

This is a link to the data used. (The data was taken midway through the season so numbers will not be up-to-date).

Here is what I found and how I used Python to help answer my questions:

Importing the dataset

After importing the relevant libraries, I imported the dataset using the .read() function.

#importing libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

#creating a dataframe from .csv file
df = pd.read_csv('C:\\Users\\Rahul\\Desktop\\2022-2023 Football Player Stats.csv', encoding='latin-1')

#checking data for errors
df.info()
df.head()

Here are the first few rows of the dataset

Now that the data is imported, we can start answering some questions about the data.

Who scored the most goals in the Premier League?

Using the .query() function I created dataframe which only contains Premier League players. I then created another dataframe to sort the players by most goals scored, and show only relevant columns.

#querying data to create dataframe filtered to Premier League players
PL= df.query('Comp == "Premier League"')

#creating a view of the top 10 scorers
Top_Scorers = PL.sort_values(by='Goals',ascending=False)[['Player','Nation','Pos','Squad','Comp','Age','Goals']]

Top_Scorers.head(3)

As the data shows, the top scorer was Erling Haaland (25) followed by Harry Kane (17) and Ivan Toney(14).

Which team has collectively scored the most goals in the Premier League?

To answer this I needed to group the players based on their team and sum up the goals scored. This is achieved with the .groupby() function.

#grouping teams together
Top_Scoring_Teams = PL.groupby(['Squad'])['Goals'].sum().reset_index().rename(columns={'Goals':'Total Goals'})

#sorting by goals scored
Top_Scoring_Teams.sort_values(by='Total Goals',ascending=False,inplace=True)

Top_Scoring_Teams.head(3)

The data shows that Manchester City (52) had the most goals followed by Arsenal (44) and Tottenham (40).

My friend Greg only watches football matches with lots of goals, which leagues should Greg watch/avoid to see the most goals?

Answering this question can be done with the same grouping and sorting functions from the last question. I have plotted this to a bar chart to visualise the data.

#grouping leagues together
League_Goals = df.groupby(['Comp'])['Goals'].sum()

#creating bar chart
matplotlib.style.use('seaborn-v0_8')
League_Goals.plot.bar(rot=0,title='Goals By League',xlabel='League',ylabel='Total Goals',fontsize='small')

Based on these findings, Greg should avoid La Liga as it has the least amount of goals, and should watch Ligue 1 as it has the most.

What is the average number of minutes played in the season for all players?

The .mean() function can be used here to take an average of the minutes played column.

#calculating average minutes played
avg_mins= df['Min'].mean()

#printing and rounding number to 2 decimal places
print(f'The average minutes played was {round(avg_mins,2)}')

The average minutes played was 760.45 across all players.

Is there a correlation between age and minutes played?

I began answering this by creating a scatterplot to show Minutes Played by Age, expecting to immediately see a clear correlation.

#creating a dataframe just showing players, age and minutes played
age_mins= df[['Player','Age','Min']].rename(columns={'Min':'Minutes Played'})

#using the new dataframe to create a scatter plot
age_mins.plot.scatter(x='Age',y='Minutes Played',title='Minutes Played by Age')

There were simply too many data points to see any clear trend, I also can’t see which areas are the most dense easily.

To remedy this, I took the averages of each age and made another scatterplot.

#grouping by age and averaging the number of minutes played
avg_age_mins= age_mins.groupby(['Age'])['Minutes Played'].mean().reset_index().rename(columns={'Minutes Played':'Average Minutes Played'})

#creating scatter plot
x= avg_age_mins['Age']
y= avg_age_mins['Average Minutes Played']

plt.scatter(x,y)
plt.title("Average Minutes Played by Age")
plt.xlabel("Age")
plt.ylabel("Average Minutes Played")

#fitting a trendline
z= np.polyfit(x,y,2)
p= np.poly1d(z)

plt.plot(x,p(x),"r--")

plt.show()

After taking averages of minutes played at each age, we see a reverse ‘U’ shape trend/correlation. There are three main groups here, Young players (Ages 15–22), Experienced players (23–34) and Senior players (35+).

Young players usually come on as substitutes so will not get much play time unless they are exceptionally good.

Most starting players fall into the Experienced players age group where they get to play the most amount of minutes.

Senior players typically do not have the stamina for a full 90 minutes so will play less. The average for Senior players does appear to be driven up by some older goalkeepers and defenders but exploring that is beyond the scope of this analysis.

Creating a machine-learning model to predict minutes played based on age

Using Sci-Kit Learn, I created a machine-learning model which I trained with this dataset. The model is based on a regression function, we can use the regression output to predict outcomes based on our input value.

In this case the input is Player Age and the model outputs Minutes Played, which is how many minutes the model predicts an age group would play

Because there is a reverse ‘U’ shape trend in the data, I used the PolynomialFeatures library to apply this transformation to the regression model.

#setting variables:
age_mins.sort_values(by='Age',inplace=True)
X= age_mins['Age'].values.reshape(-1,1)
Y= age_mins['Minutes Played']

#creating machine learning model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)

#adding polynomial features to model
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree= 2,include_bias=True)
x_poly = poly.fit_transform(X_train)
x_test_trans = poly.transform(X_test)
poly.fit(X_train,Y_train)

#running machine learning model
from sklearn.linear_model import LinearRegression
linear_regression_model = LinearRegression()
linear_regression_model.fit(x_poly,Y_train)
y_pred_test = linear_regression_model.predict(x_test_trans)

#generating output from model for set ages
prediction_array = np.array([[18],[24],[28],[32],[41]])
poly_array = poly.transform(prediction_array)
prediction= linear_regression_model.predict(poly_array)

ML_Result = pd.DataFrame(prediction)
Test_ages = [18,24,28,32,41]
ML_Result.insert(0,"Age",[18,24,28,32,41],True)
ML_Result.rename(columns={0: 'ML Predicted Minutes Played'},inplace=True)
ML_Result.head()

After creating and training the model, I gave the model some input values to see how many minutes it predicts these age groups will play. I took these numbers to print a table for ages 18, 24, 28, 32 and 41.

The model shows a reverse ‘U’ shaped relationship like before. We can see that at age 18 there is a lower amount of playtime which rises and plateaus around 28, and is falling by 41 where any player left at that age who isn’t retired would have a low amount of playtime.

Thank you for reading

This project was a great way for me to explore what I’m capable of doing in Python. I hope to carry on learning and being able to do even more as a data analyst as I continue to develop my knowledge of tools like Python, SQL and PowerBI.