FIFA 21 DATA CLEANING CHALLENGE

Samueloyedele
8 min readMay 30, 2024

--

INTRODUCTION

This data challenge was created to allow Data Analysts to practice data cleaning using their preferred tools. It is a great way for Beginner, Intermediate and Advanced Data Analysts to sharpen their data cleaning skills and build portfolio-worthy projects.

DATASET

This a FIFA 21 dataset that consists of player’s attributes and stats like Name, Nationality, Position, Age, Value, Wage, BOV, Attacking, Crossing, Finishing, Skill, Dribbling, Ball Control, Power, Strength, Stamina etc. They are 18979 rows and 77 columns.

Excel is used to perform data cleaning on this dataset.

DATA DICTIONARY

Some of the columns:

· Photo_url: Link to the picture of the player’s picture.

· LongName: Players full name

· Player_url: Link to players site.

· Nationality: Player’s country of origin.

· Position: The wing or position the player has ever played.

· Name: Another column for the name of the player in short form.

· Age: Players age. Check the datatype of this column

· OVA: Players overall analysis or rate in percentage

· POT: Players potential in percentage.

· ID: Players unique Identity number.

· Height: Players height

· Weight: Players weight in lbs

· Foot: Players strong foot

· BOV: Best Overall: Ability of player to hold the comfortably to his feet by percentage rating.

· Joined: Date at which player started National team.

· Value: Players worth in Dollars.

· Wage: Weekly salary in Dollars

· Release clause: This is the amount in players contract required for the player to leave to another club, also in dollars.

· FK : Free Kick

· M: Skill Moves rating

· W/F: It refers to a player’s weak foot rating. It is a measure of how well a player can use their non-dominant foot to pass, shoot or dribble. The rating is given on a scale of 1 to 5.

· PAS: Pass accuracy given on a 100 rating

DATA EXPLORATION

In the section, the dataset was explored to identify any data quality issues like missing data, duplicate data, inaccurate, inconsistent data.

Some of the issues:

- Inconsistent data in some columns

- Inconsistent naming conventions

- Incorrect datatypes

- Unwanted special characters

- Irrelevant columns

- Multiple values in a column

Preview of the dataset

Data Cleaning

This is the process of removing missing data, duplicate data, inaccurate, inconsistent, incomplete, invalid data for better analysis and visualization.

Some of the data issues to be cleaned:

· Rename all columns for consistent naming convention.

· Extract players correct name from player url column

· Drop LongName, PlayerUrl, Name columns since they are not need any more for analysis.

· Extract the first value from the positions column since it represents the player best position.

· Convert OVA, POT, BOV columns from integer to percentage.

· Convert Height in ft / in to cm and Weight in Ibs to kg. Exclude the units.

· Convert Joined column to date datatype.

· Extract the numbers in Value, Wage and Release columns and convert the values in M to millions (1000000) and K to thousands (1000).

· Extract the numbers value in W/F (Weak foot) , SM (Skill Moves) and IR (Injury Rating) columns

· Check for duplicates, drop unnecessary columns and examine all columns datatype.

Cleaning Process:

A new worksheet was inserted for the data cleaning.

Functions used and their meaning:

SUBSTITUTE(text, old_text, new_text, [instance_num]): To replaces existing text with new text in a text string.

FIND(find_text, within_text, [start_num]): To returns the starting position of one text string with another text string.

MID(text, start_num, num_chars): To returns the characters from the middle of a text string, given a starting position and length.

LEN(text): Returns the number of characters in a text string.

PROPER(text): Convert a text string to proper case;

IFERROR(value, value_if_error): Return value_if_error if expression is an error and the value of the expression itself otherwise.

LEFT(text, num_chars): Returns the specified number of characters from the start of a text string.

ROUND(num, num_digits): Round a number to a specified number of digits.

· Rename all columns for consistent naming convention.

This was done manually.

· Extract players correct name from player url column

Formula: PROPER(SUBSTITUTE(MID(SUBSTITUTE(FIFA[playerUrl],”/”,” “,5), FIND(“ “,SUBSTITUTE(FIFA[playerUrl],”/”,” “,5))+1, LEN(FIFA[playerUrl])-8-FIND(“ “, SUBSTITUTE(FIFA[playerUrl],”/”,” “,5))), “-”, “ “))

Explanation:

SUBSTITUTE(FIFA[playerUrl],”/”,” “,5): It’s used to replace “/” with “ “ at the 5th instance in the unclean player url column.

FIND(“ “,SUBSTITUTE(FIFA[playerUrl],”/”,” “,5))+1: It’s used to get the position of the new substitute text “ “ and applied it to MID() as the start num plus 1.

LEN(FIFA[playerUrl])-8-FIND(“ “, SUBSTITUTE(FIFA[playerUrl],”/”,” “,5): It’s used to limit the character length in the MID() to the exact player name.

MID() : It’s used with find() and len() to extract the text starting after the substitute space character “ “ at a character length of the player name.

Outer SUBSTITUTE() is used to replace all “-“ in the extracted player’s name with space “ “ while PROPER() convert the text to proper case.

· Drop LongName, PlayerUrl, Name columns since they are not need any more for analysis.

The columns are deleted manually in the worksheet.

· Extract the first value from the positions column since it represents the player best position

Formula: = IFERROR (MID(FIFA[Positions], 1, FIND (“ “, FIFA[Positions])), FIFA[Positions])

Explanation:

FIND(“ “, FIFA([Positions])): It’s used to find position of space “ “ character in the unclean position column

MID(FIFA[Positions], 1, FIND (“ “, FIFA[Positions])): It’s used with FIND() to extract first position text starting from the first character till space “ “ character for multiple positions.

IFERROR(): In case of a single position, is used to the handle error to return only the single position.

· Convert Height in ft/in to cm and Weight in Ibs to kg. Exclude the units.

Height:

Formula: = ((LEFT(FIFA[Height],1) * 30.48) + (MID(FIFA[Height], FIND (“‘“, FIFA[Height]) +1,1) / 12) * 30.48)

Explanation

FIND (“‘“, FIFA[Height]) +1: It’s used to get the position of the apostrophe character “’“ in the unclean height column and applied it to MID() as the start num plus 1.

(MID(FIFA[Height], FIND (“‘“, FIFA[Height]) +1,1) / 12) * 30.48): It’s used with find() to extract the inch (‘) digit starting after the apostrophe character “’” at 1 character length. Then converted to cm.

(LEFT(FIFA[Height],1) * 30.48): It’s used to extract the first digit in feet (‘’) from the unclean height column, then converted to cm.

After that, both digits are added together to get the height value for each player.

Weight:

Formula: = ROUND(LEFT(FIFA[Weight],3) * 0.453592, 2)

LEFT(FIFA[Weight],3): It’s used to extract the first three digits in lbs from the unclean weight column.

ROUND(LEFT(FIFA[Weight],3) * 0.453592, 2): Convert the extracted weight digit to kg by multiplying it with 0.453592, then round the figure to two decimal digits.

· Extract the numbers in Value, Wage and Release columns and convert the values in M to millions (1000000) and K to thousands (1000).

Value:

Firstly, a new column created “value_unit” to get the digit values with unit from the unclean value column.

Formula: = MID(FIFA[@Value], FIND (“¬”, FIFA[@Value])+1, LEN(FIFA[@Value])-FIND(“¬”, FIFA[@Value])+1)

Convert “K” to 1000 and “M” to 1000000

Formula: =IF([@[value_unit]]=MID(FIFA[Value],FIND(“¬”,FIFA[Value])+1,LEN(FIFA[Value])-FIND(“¬”,FIFA[Value])-1)&”K”,MID(FIFA[Value],FIND(“¬”,FIFA[Value])+1,LEN(FIFA[Value]) FIND(“¬”,FIFA[Value])-1)*1000,IF([@[value_unit]]=MID(FIFA[Value],FIND(“¬”, FIFA[Value])+1,LEN(FIFA[Value])-FIND(“¬”,FIFA[Value])-1)&”M”,MID(FIFA[Value],FIND(“¬”,FIFA[Value])+1,LEN(FIFA[Value])-FIND(“¬”,FIFA[Value])-1)*1000000, 0))

Explanation:

FIND(“¬”, FIFA[Value])+1: It’s used to get the position of the “¬“ character in the unclean value column and applied it to MID() as the start num plus 1.

LEN(FIFA[Value]) FIND(“¬”, FIFA[Value])-1: It’s used to limit the character length in the MID() to the exact value digits excluding the unit.

MID(): It’s used with find() and len() to extract the digits starting after the character “ ¬“ at the number length of the value digit.

IF(): It’s used as a condition for the value digits. If the digits in the value_unit is in “K” then multiply the digit by 1000, If the digits in the value_unit is in “M” then multiply the digit by 1000000, else return 0.

Wage:

Release Clause

Note: The same process for Value column was repeated for Wage and Release Clause columns.

· Extract the numbers value in W/F, SM and IR columns

W/F

Formula: = LEFT(FIFA[W/F],1)

Explanation

LEFT(FIFA[W/F],1): It’s used to extract the first digit from the unclean W/F column.

S/M

Formula: =LEFT(FIFA[SM],1)

Explanation

LEFT(FIFA[SM], 1): It’s used to extract the first digit from the unclean SM column.

I/R

Formula: =LEFT(FIFA[IR],1)

Explanation

LEFT(FIFA[IR], 1): It’s used to extract the first digit from the unclean IR column.

· Check for duplicates, drop unnecessary columns and examine all columns datatype.

The cleaned dataset was checked for validation and accuracy.

CHALLENGE

It was a challenging project for me. Cleaning some columns like player name, height, weight, value and wage columns using TEXT functions. It may be due to the Excel version (2016) used for the project.

CONCLUSION

Data Cleaning is a crucial step in data analysis process. A bad data will lead to bad analysis which may have bad impacts on business decisions. A clean data will surely help data analysts to make data-driven informed decisions for stakeholders and businesses.

The cleaned dataset can be used for advanced analysis and visualizations.

💡 Pro Tip: It is very necessary to assess the quality of your data before analysis in other to make impactful data insights.

It was a project to practice data cleaning using Excel functions. I was able to master and sharpen my data cleaning skill using Excel. It also helped improve my communication and storytelling skills.

You can check out my other projects on my website portfolio and Github repository.

Thank you for taking your time to read it, kindly comment your suggestion on this mini project. I will also appreciate a like / follow from you.

--

--