If I Were A Football Data Analyst part I: Data Cleaning

7 min readApr 3, 2023

Japan controversial win against Spain, which was proven by goal-line technology, and ‘sport bra’ worn by South Korean footballer Hwang Hee Chan, which were spotted after the second goal against Portugal, were some of the notorious events in the World Cup 2022. Furthermore, match ball with motion sensor to monitor the gameplay, semi-automated offside technology (SAOT), portable assistive technology platforms, face-recognition-based payment system were also the cutting-edge innovations debuted last year.

As a once-every-four-years mediocre football fan, last year World Cup event (World Cup 2022) made me realized that technology advancement, specifically Machine Learning (ML) and Artificial Intelligence (AI), had just merged seamlessly with sporting events. Those predictive models aim to improve player predictions and strategic tactics while striving to minimize injury risks. To be effectively processed in Machine Learning, collection of data should be transformed first in Data preprocessing. In this stage, we need to perform data cleaning, data integration, data transformation, and data reduction.

Last month, I took part in #DataCleaningChallenge held by Promise Chinonso & Victor Somadina about FIFA 2021 and used this chance to envision myself as a football data analyst😬. Let’s look at the data and see what we can do with it! Dataset can be seen here.

First thing first, import the library and load the dataset.

# Import library
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('fifa21 raw data v2.csv')
df

There is 18979 rows and 77 columns, but you cannot see all the columns:( . Tip! : To see all the columns, you can expand the output display

pd.set_option("display.max_columns",80)
df.head()

Then, look at the brief summary of data frame by using df.info(). Summary of some columns in the data frame are shown below. As we can see, there are null values in loan date and hitcolumns. To see the statistical summary, use df.describe() . Note: I prefer the transpose table ones, so i use df.describe().T .

Fig 3. Statistical summary of data frame

Next, drop any duplicates using df=df.drop_duplicates().
After removing any duplication, we need to rename the columns into proper variable names. You can go with camelcase (e.g. LongName) or lowercase separated by underscore (_) (e.g. long_name). Avoid space for variable names. First, we need to see all column names by using df.columns.

Then, transform all names into lowercases separated by (‘_’). Don’t forget check the names after some changes.

# Rename column names into more proper column names

# change all column names into lowercases
df=df.rename(columns = str.lower)

# Replace space with underscore ('_')
df.columns = df.columns.str.replace(' ', '_')

# Remove any special character
df.rename(columns = {'↓ova':'ova'}, inplace = True)

Alright! Let’s start data preprocessing by from the left to the right columns.

id column is, in fact, unnecessary, so drop it. df=df.drop(['id’],axis=1 . 1 signifies columns, while 0 signifies rows. Next to id , there are name and longname. I prefer name than longname, so drop longname. I want to look at the unique characters in name first using this function.

# Check for unique character
def unique_characters( df, col ):
    result = []
    for index, row in df.iterrows():
        for char in row[col]:
            if char not in result:
                result.append(char)
    return result

print(unique_characters(df,'name'))

Unique characters in namecan be seen in Fig 5. As we can see, there are numerous non-ascii characters. To convert non-ascii characters into ascii characters, I use unidecodeand encodemethod. Fig 6 shows unique characters in the namecolumns. No more non-ascii characters!

Fig 5. Unique characters in ‘name’ column

from unidecode import unidecode
df['name'] = df['name'].apply(unidecode)
df['name'] = df['name'].str.encode('ascii', 'ignore').str.decode('ascii')

Fig 6. Unique characters in ‘name’ column

Regarding photourl and playerurl, just let them be for now. Remove non-ascii characters in nationality and clubcolumn, like we just did to name. Remove the new line characters \n as well fromclubcolumns by using replace. df=df.replace('\n','', regex = True) .

Fromcontractcolumn, we want to create 3 columns from it — when the contract start, end, and contract duration. Contract start can be obtained fromjoinedcolumn. Contract duration (in years) is the difference between start end date and start date. Note: Before performing the substration, I type cast those variables into datetime.Since some of the players are loan players, we can fill contract duration with 0 by using fillna. New columns created will be put in the rightmost column. Don’t forget to drop contract column thereafter.

# Rename joined column into start_cont (when contract starts)
df.rename(columns = {'joined':'start_cont'}, inplace = True)

# From the contract column, create start_contract and end_contract
# Joined 

df['end_cont'] = df['contract'].str[7:]
df['end_cont']=pd.to_datetime(df['end_cont'], errors = 'coerce')
df['start_cont']=pd.to_datetime(df['start_cont'], errors = 'coerce')

# Calculate contract duration from end_cont and start_cont column
df['cont_dura'] = round((df['end_cont']-df['start_cont'])/np.timedelta64(1,'Y'),0)

df['cont_dura'] = df['cont_dura'].fillna(0)
df['cont_dura'] = df['cont_dura'].astype(int)

Moving on to the heightcolumn. Look at the unique values in the column by using df['height'].unique(). As can be seen in Fig 7, data are not in the equal units. Thus, we need to create the function to convert it. I prefer height in centimeter(cm) unit. Note: I replace some characters to simplify the conversion.

# create function to convert height
def convert_cm(x):
    if type(x) == float or type(x) == int:
        return x
    if 'cm' in x:
        if len(x)>1:
            return float(x.replace('cm',''))
        return 1.0
    if '.' in x:
        if len(x)>1:
            ht_ = ht.split('.')
            ft_=float(ht_[0])*12
            in_ = float(ht_[1])
            ht__ = ft_ + in_
            return ht__ *2.54

# Lets convert the heights to centimeters and include it as a new column
heights_in_cm = []

for height in df.height:
    if "." in height:
        height =  height.split('.')
        feets, inches = int(height[0]), int(height[1])
        ft_ = feets * 30.48
        in_ = inches *2.54
        heights_in_cm.append( round(float(ft_ + in_),2) )
    elif "cm" in height:
        height = height.strip('cm')
        heights_in_cm.append(height)
        
df['height'] = pd.Series(heights_in_cm, dtype = float)
df['height'] = df['height'].astype(int)

The same approach applies forweightcolumns, since some values are in kg, while some are in lbs. Create the functions and apply it to the columns.

Fig 8. Unique values in ‘weight’ columns

# Create function to convert the weight
weights_in_kg = []

for weight in df.weight:
    if "kg" in weight:
        weight =  weight.strip('kg')
        weights_in_kg.append(weight)
    elif "lbs" in weight:
        weight = float(weight.strip('lbs'))
        weight = weight/2.205
        weights_in_kg.append(weight)

# apply the function and convert it to integer
df['Weight'] = pd.Series(weights_in_kg, dtype = float)
df['Weight'] = df['Weight'].astype(int)

Let’s take a look at the currecy number format in Fig 9. To make further calculation easier, we need to remove ‘€’ symbol and create function to convert those letters into numbers. For instance, ‘K’ into 1000, ‘M’ into 1000000, and ‘B’ into 1000000000. This conversion is done in value,wage, release_clause, and hitscolumns.

# remove € symbol
df[['value','wage','release_clause']] = df[['value','wage','release_clause']].replace('€','', regex = True)

# create function to convert letters to numbers
def string_to_float(x):
    if type(x) == float or type(x) == int:
        return x
    if 'K' in x:
        if len(x) > 1:
            return float(x.replace('K', '')) * 1000
        return 1000.0
    if 'M' in x:
        if len(x) > 1:
            return float(x.replace('M', '')) * 1000000
        return 1000000.0
    if 'B' in x:
        return float(x.replace('B', '')) * 1000000000
    else:
        return x

df['value'] = df['value'].apply(string_to_float)
df['wage'] = df['wage'].apply(string_to_float)
df['release_clause'] = df['release_clause'].apply(string_to_float)

df['hits'] = df['hits'].apply(string_to_float)
df['hits'] = df['hits'].fillna(0)

Fig 10 displays the rating format for w/f,sm,ir. We don’t necessarily need the ‘★’ symbol, so we replace column with the string that matches the regex (for number only). For instance, df['w/f/] = df[‘w/f’].str.replace(‘\D+’,’’).

So far, we have ‘cleaned’ the data. As for the finishing touch, we need to type cast the columns into the proper datatypes and sort the columns into proper order.

As for the type cast, we need to refer to our final data frame and compare it with the result from df.info()(Fig.2). Here are some columns that need to be type casted:
1. load_date_end (from object → to datetime)
2. value (from object → to int)
3. wage (from object → to int)
4. release_clause (from object → to int)
5. w/f (from object → to int)
6. sm (from object → to int)
7. ir (from object → to int)
8. hits (from object → to int)

# Change the datatype of those columns
df['loan_date_end'] = df['loan_date_end'].apply(pd.to_datetime)
df[['value','wage','release_clause','w/f','sm','ir']]= df[['value','wage','release_clause','w/f','sm','ir']].astype(int)

Lastly, sort the columns to make it easy to read by using reindex() . Don’t forget to export the cleaned data file. The cleaned data can be seen here.

df = df.reindex(columns=['name', 'age', 'nationality', 'club', 'best_position', 
      'start_cont', 'end_cont', 'cont_dura','ova', 'pot', 'preferred_foot',
      'height', 'weight','positions', 'bov', 'value', 'wage','release_clause',
      'loan_date_end',  'attacking', 'crossing', 'finishing',
      'heading_accuracy', 'short_passing', 'volleys', 'skill', 'dribbling',
      'curve', 'fk_accuracy', 'long_passing', 'ball_control', 'movement',
      'acceleration', 'sprint_speed', 'agility', 'reactions', 'balance',
      'power', 'shot_power', 'jumping', 'stamina', 'strength', 'long_shots',
      'mentality', 'aggression', 'interceptions', 'positioning', 'vision',
      'penalties', 'composure', 'defending', 'marking', 'standing_tackle',
      'sliding_tackle', 'goalkeeping', 'gk_diving', 'gk_handling',
      'gk_kicking', 'gk_positioning', 'gk_reflexes', 'total_stats',
      'base_stats', 'w/f', 'sm', 'a/w', 'd/w', 'ir', 'pac', 'sho', 'pas',
      'dri', 'def', 'phy', 'hits',  'photourl', 'playerurl'])

# export the cleaned data to csv file
df1.to_csv('FIFA2021_CleanedData.csv')

References:
https://www.prescouter.com/2022/12/technology-2022-fifa-world-cup-qatar/#:~:text=The%20semi%2Dautomated%20offside%20technology,of%2050%20times%20per%20second.

If I Were A Football Data Analyst part I: Data Cleaning

Written by Rania Hana Rafifa