Data cleaning? What is data cleaning? An Excel project.

Oburo Gbolahan
4 min readMar 28, 2023

--

Data cleaning is an important step in data analysis and it helps ensure that results are accurate and reliable.

Objective:

The aim of this project is to clean a messy FIFA dataset so as to prepare it for analysis.

About the dataset:

The dataset was gotten from a data analysis challenge group. It was presented in a csv(comma-separated-values) file format. The dataset contains 18980 rows and 77 columns. It’s content consists data about FIFA’S players unique id, long name, name, nationality, wages, valuation, release clause, ratings and so much more in 2021.

Fifa Raw data

Data cleaning process:

From the snippet of the raw data above, we can tell the dataset contained a lot of messy data. The steps I followed in cleaning it are highlighted below after loading the data into Excel’s power query.

  1. I first checked for duplicates and handled duplicates. I used the ID column as it is a column that uniquely identifies each players, no two number can be the same except it is a duplicate. I achieved this by right clicking on the Id column > Remove Duplicates.
Data on loading it into power query

2. I noticed a lot of whitespaces in the data loaded into the power query, so I went ahead to trim the text columns., after performing the trim functions I arrived at this. To trim a data, you right click on the text column > transform > Trim

Trimmed data

3. I noticed lots of special characters in the name, longname and club columns. I handled them by using the replace values in the columns tab. For each special character I noticed, I replaced them with the appropriate values.

4. I proceeded to the weight and height column, this column had different measurement scales. For the height, the heights of each players were taken both in cm and feet. To tackle this problem, I created a new conditional column that recorded the rows with cm to value of 1, and those recorded in feet to 30.48, I proceeded to replace the strings and convert the data types to decimal after which I multiplied both columns together. I was able to convert the feet to cm using this method. I applied the same method to the weight column.

Value, Wage and Release Clause snippet

5. Next was the Value, Wages and release clause column. These columns contained lots of messy data as seen above. The data in this columns were formatted as texts despite them representing numerical data. I replaced all the strings then created a conditional column like I did for the heights, I then formatted correctly and multiplied.

Creating conditional column for values.

6. For the loan end date column, I noticed that it contained null values and on further investigations, I discovered it was for players who were not on loan. So I changed the null to N/A. The contract column also used a special column to indicate year length of contract and I had to change it to ‘-’.

Cleaned fifa datset

After the data cleaning steps, I was able to arrive at the snippet above. Which brings us to a very important question.

Why data cleaning?

Data

The only reason I am able to achieve this simple visualization of Chelsea wages earners for 2021 is because of data cleaning. Without data cleaning, this would not have been possible. The importance of data cleaning in data analysis can never be overstated.

--

--

Oburo Gbolahan

I am love data. •Excel| • Python| • Sql| •PowerBi