GETTING STARTED | DATA PREPROCESSING | KNIME ANALYTICS PLATFORM

KNIME as a Data Cleaning Tool: Using the FIFA 21 Data as a Case Study

A codeless approach can take you a long way!

Asafa Abdullateef
Low Code for Data Science

--

Data Cleaning or Preprocessing is one of the most important steps Data Analysts or Data Scientists have to go through in order to get better insights and better predictions from the data. Hence, the cleaner the data, the better the insights drawn, and vice versa.

A few weeks ago, a data cleaning challenge was organized in the data tech space by Chinonso Promise and Victor Somadina. The aim of this challenge was to allow its participants — data practitioners — from all skill levels use a particular data (in this case the FIFA 21 dataset) to build a portfolio of projects and also to allow them to showcase their data transformation and cleaning skills regardless of the programming language or tools used, such as MS Excel, Power BI, R, Python, SQL or SPSS.

The objective of this challenge is to preprocess the given data (FIFA 21) data using ETL methods.

The dataset was gotten from Kaggle and comprises information about 18,979 football players and 77 columns of the players’ statistics and demography for the year 2021.

For this challenge, I would be showing how I performed data cleaning procedures using KNIME Analytics Platform as my preferred tool.

Data Preparation Approach

The first step was to import the dataset into the KNIME by simply dragging and dropping the CSV Reader node into the workflow editor from the Node Repository and choosing the dataset from its file location or by simply dragging the dataset, a .csv file in the case, into the workflow which gives the same result. This node gives a preview of how the data looks like. One gets information on the number of rows and columns, the data type of each column, and also some data values for the first few rows.

Within this node, I changed some of the column names using the Transformation tab in order to better understand what each column stands for.

After importing the dataset and changing the column names, I created a metanode called “Player Name” where I prepared the columns. I started by removing unwanted columns like “Name”, “LongName” and “photoUrl”. The first 2 columns are filled with too many special characters and that would be bad for the dataset while the latter is a link to the players’ picture which is not needed. The players’ names were then extracted from the playerUrl column by first using the Cell Splitter node to separate the other characters in the column from the player names using “/” as the delimiter. After filtering out the unwanted columns, the extracted values were capitalized using the String Manipulation node.

Nodes in the Player Name Metanode.

The above method was also applied to the “Contract” column and was converted to the number data type using the Number to String node. Afterwards, special characters like Jan, Jul, Dec and Jun were converted to Null values using the String Replacer node. This is shown below in the metanode named “Contract Null Remover”.

Nodes in the Contract Null Remover Metanode.

After finishing the above process, the missing values and special symbols were replaced. The Missing Value node handled missing values in “Hits” columns, while the Cell Splitter Nodes handled missing values and special characters in the Contract columns. This is shown below in the metanode named “Special Character Remover”. The unwanted columns were also removed.

Nodes in the Special Character Remover Metanode.

The next step was to convert the weight values from Pounds (lbs) to Kilogram and height values from Feet and Inches to Centimeters (cm) using the Math Formula node. The “Weight Converter” and “Height Converter” metanodes show how it was implemented.

Weight Converter nodes.
Height converter nodes.

The final step involves converting the player's finances into the number datatypes and also representing the values in digits. The metanode, “Player Finance”, was used to perform this process.

The final output after the whole data cleaning operation results in a table with 18,979 rows and 74 columns with their respective datatype.

Dataset Description after performing data preprocessing.

Attached is the link to the KNIME workflow described in the article. You can also follow me on LinkedIn and Twitter for more KNIME use cases.

--

--