Imagine you are a cook that wants to serve the tastiest dish, you would of course make sure beforehand that the recipe is spot-on, or else you might end up with twice the salt (duplicated data), missing a key ingredient (missing data), baking it in the oven instead of frying it (incorrect data)!
This is exactly what data cleaning is, it is essentially making sure that our data is cleared of any error. An uncleaned dirty data might affect the final results, leading it to be inaccurate and/or misleading.
Generally, data cleaning involves procedures such as fixing errors, handling missing values, identifying duplicates, adjusting formats and detecting outliers.
Now let’s start cleaning this dataset of FIFA 21 Player Stats, mind you the raw data is rather messy (or Messi, no pun intended)!
The project will be using Python’s Pandas Library. The codes below can also be found on my GitHub.
1. Importing the dataset
import pandas as pd
df = pd.read_csv('fifa21_raw_data.csv')
2. Check the information of the data set
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 77 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 photoUrl 18979 non-null object
1 LongName 18979 non-null object
2 playerUrl 18979 non-null object
3 Nationality 18979 non-null object
4 Positions 18979 non-null object
... ...
75 PHY 18979 non-null int64
76 Hits 18979 non-null object
dtypes: int64(55), object(22)
memory usage: 11.1+ MB
We can view the name, count and dtype of each columns with .info(). It seems there are a total of 77 columns, in which majority of them are the stats of the players, as well as 18978 entries (rows).
df.shape
(18979, 77)
Another easy way to identify the amount of columns and rows is by using .shape, similarly, we obtain that there are 18979 rows and 77 columns.
3. Browse through the 77 columns
pd.set_option('display.max_columns', None) #To be able to browse through all 77 columns.
df.head() #To display only the first 5 rows
Browsing through the 77 columns is essential as it provides an initial impression and better understanding of the data set. It allows us to identify what consist in the data sets, which data should be kept, while which should be removed.
4. Remove unnecessary columns
#To remove all columns between column index 22 to 76
df.drop(df.iloc[:, 22:77], inplace=True, axis=1)
After the browse through, my verdicts are that we are not very interested with certain stats, so we will drop(remove) these unwanted columns.
For the column removal, instead of inputing the name of each column individually, we can use the column index. The column indexes can be identified from the df.info() above.
For this case, the column index we would like to remove are from column index 22 to 76.
We can repeat the above steps several times to remove several more unnecessary columns, and the final result is:
Great, we went from having 77 columns, to only 16 columns.
5. Modifying the height and weight column
The Height and Weight columns are recorded in terms of feet-inches and lbs, we would like to convert them to the metric system of cm and kg.
#To convert the Height from inches into cmm which allows easier calculation for aggregration functions
df['Height_cm'] = (df['Height']
.str.extract(r'(\d+)\'\s*(\d+)"') #Applying regular expression
.astype(float)
.mul([12*2.54, 2.54]) #First part convert feet, second part converts inches
.sum(axis=1)
)
#To rearrange the columns, so that the column "Height_cm" sits beside "Height"
df = df[['Nationality', 'Positions', 'Name', 'Age', '↓OVA', 'POT',
'Team & Contract', 'Height','Height_cm', 'Weight', 'foot', 'BOV', 'BP',
'Joined', 'Value', 'Wage']]
df.head()
Next, we can modify the Weight columns.
#To strip the numerical value from the lbs in the weight columns, then convert it to kilogram
df['Weight_kg'] = df['Weight'].str.extract(r'(\d+)').astype(float).mul(0.453592).round(2)
#To rearrange the columns, so that the column "Height_cm" sits beside "Height"
df = df[['Nationality', 'Positions', 'Name', 'Age', '↓OVA', 'POT',
'Team & Contract', 'Height','Height_cm', 'Weight', 'Weight_kg', 'foot', 'BOV', 'BP',
'Joined', 'Value', 'Wage']]
df.head()
Perfect, we have successfully converted the Height and Weight from imperial format to metric format and placed them in a new columns.
6. Splitting text into different columns
Notice how cluttered the the Team & Contract column is, we would like to clean out the \n texts, then proceed to split the columns, one with the team name, another with the contract year. The final result should be as follows:
We could easily achieve this by just using the slicing operator [ : ].
df['Team_Name'] = df['Team & Contract'].str[:-11] #To obtain only the first 11 string, starting from the end
df['Contract_Year'] = df['Team & Contract'].str[-11:] #To obtain the remaining strings after the 11th string, starting from the end
Next, could split the Contract_Year column into its start year and end year with a similar approach of using slicing operator, with the end result looking as follows:
#To create 2 new columns that split the year into "start year" and "end year"
df['Start_Year'] = df['Contract_Year'].str[:4]
df['End_Year'] = df['Contract_Year'].str[-4:]
7. Convert column type & Remove unwanted rows
Next, we want to convert the Start_Year and End_Year columns into int data type.
#To remove empty spaces
df['Start_Year'] = df['Start_Year'].replace(" ", "")
#To convert data type of column to 'int32'
df['Start_Year'] = df['Start_Year'].astype('int32')
However, an error was raised, the column was not able to be converted because not all of the rows are numbers. Apparently, there is a string ‘ Bra’ in the Start_Year column. Let’s check which specific row has this string.
#To check rows that contain ' Bra'
df[df['Start_Year'] == ' Bra']
It seems like there are 2 rows with the string ‘ Bra’ instead of an actual year, we need to remove these rows.
#To remove the rows containing the string ' Bra'
for x in df.index:
if df.loc[x, "Start_Year"]== ' Bra':
df.drop(x, inplace=True)
df[df['Start_Year'] == ' Bra'].count()
Nationality 0
Positions 0
Name 0
Age 0
......
BP 0
Joined 0
Value 0
Wage 0
dtype: int64
Alright the invalid rows have been removed, and we can reconfirm by counting and making sure the sum is 0. Since the string is removed, the Start_Year column should be able to be converted into int type.
df['Start_Year'] = df['Start_Year'].astype('int')
Ah, the same string invalidity occurs again. This means that there are more rows in the Year columns that doesn’t have an actual year, but instead has some kind of string. Now let’s check which rows contain these invalid strings.
#To check rows that does not start with '2'
df[~df['Start_Year'].str.startswith("2")]
From the output, we can identify that there are actually 1250 rows that doesn’t have an actual year written in the Year columns. We can efficiently remove these invalid rows by an approach of only saving the valid rows.
#To save only the valid rows.
df = df[df['Start_Year'].str.startswith("2")]
Once the columns only contain valid years, we can finally convert the years into int data type. We can verify the conversion by calling certain aggregation functions such as mean, sum, max, min etc.
# To convert the year columns into 'int' data type
df['Start_Year'] = df['Start_Year'].astype('int')
df['End_Year'] = df['End_Year'].astype('int')
# To verify that the year columns have been succesfully converted into 'int' data type
print(df['Start_Year'].mean())
print(df['End_Year'].max())
2018.3275793986575
2028
8. Convert the value and wage (M- million, K- thousand) columns into int data type.
To proceed, we want to convert the Value and Wage columns from M and K into actual integers, our desired result in shown below:
# To convert the Value column
df['Value_int'] = df['Value'].astype('str')
df['Value_int'] = df['Value'].str.replace('€','')
def convert_amount(value):
if 'M' in value:
return float(value.replace('M', '')) * 1e6
elif 'K' in value:
return float(value.replace('K', '')) * 1e3
else:
return float(value)
df['Value_int'] = df['Value_int'].apply(convert_amount)
# To convert the Wage column
df['Wage_int'] = df['Wage'].astype('str')
df['Wage_int'] = df['Wage'].str.replace('€','')
def convert_amount(value):
if 'M' in value:
return float(value.replace('M', '')) * 1e6
elif 'K' in value:
return float(value.replace('K', '')) * 1e3
else:
return float(value)
df['Wage_int'] = df['Wage_int'].apply(convert_amount)
9. Finalize and export the cleaned data set.
df.info() # To identify index of unwanted columns
df.drop(df.columns[[6,8,11,13,19,20]], inplace=True, axis=1) # To remove unwanted columns
Perfect, the dirty data set started from having 18979 rows x 77 columns, to now, a cleaned data set having 17727 rows x 17 columns.
df.to_csv('fifa_cleaned.csv', index=False) # Export to .csv
df.to_json('fifa_cleaned.json') # Export to .json
The final step is exporting the cleaned data. We can export it into a file type that we desire, either .csv file or .json file. We have now completed the data cleaning project! Thank you for reading. :)
Our dish will no longer be too salty, too overcooked or missing an ingredient! Bravo!