Knime — Data Preprocessing, undusting the data for modelling.

Stack Errors
4 min readJul 26, 2022

No matter how good you could be at building a skyscraper, if the foundation is not correct it will fall one day. So is the case with Modelling in Data Science. It doesn’t matter which modelling technique you are using, if your understanding of the data is not proper, and you are not putting effort into the data cleaning, you can never create a good model out of the raw data. That is the reason, almost 60–70% of the time is spent on doing the preprocessing

In the second episode of the Knime Series, we tried to cover the essential nodes while doing the data preprocessing.

Legendary Titanic ML Dataset from Kaggle has been used here to make our viewers aware of the pre-processing steps in KNIME. Click here to get the link for the problem details.

Let’s dive deep into it.

Reading the file

First, we need to read the data. It could be read from the IO node of Knime which supports multiple formats such as xlsx, csv, xls etc., but for understanding purposes .csv has been covered here.

Read the CSV file by double clicking on the CSV Reader Node available in IO>Read>CSV Reader.

The node can be directly searched by typing “Read” in the search option as stated in the screenshot.

The path of the file could be browsed by clicking on configure
Browse the file path as per the requirement.

Changing the format

Data type could be changed in the raw file itself by clicking on the transformation section in configure.

Also, irrelevant columns could be unchecked from the dataset itself in the Transformation section.

Missing Value Treatment

There are generic options available to impute the missing values.

Based on the requirement, missing values could be replaced. As can be seen in the screenshot, missing values in the string have been replaced with “Unknown” and double has been replaced with a median.

Apart from that, specific treatment could be given to a column if needed.

Here, missing values in Age have been replaced with “999”.

Dropping irrelevant columns

Now once the missing values are replaced, the next step is to remove unwanted columns.

For dropping those columns, one can use a column splitter node that splits the dataset into two parts based on the column selection.

In this dataset, Name, Ticket and Cabin have been split into the second dataset, hence a way of removing them from the existing dataset.

Now since the unwanted columns have been removed, there is a need to split the dataset into dependent and independent variables, this could also be done using the column splitter.

Here the first dataset is the independent variables, and the second dataset contains the dependent variable “Survived”

Binning

Some of the columns need special handling, such as creating bins.

This could be done using Binning node. Here, Age has been binned. Since the datatype is double so CAIM Binner node is used. The datatype of the binned column is replaced to string after creating the bins.

One Hot Encoding

One hot encoding on categorical data could be done using the One to Many nodes available under manipulation.

The basic level of preprocessing has been covered in the article. Knime is not just limited to these nodes only. Enormous nodes are available within KNIME for every category, which is beyond the scope of the article. But the KNIME team has done an amazing job in leveraging a lot of nodes for each and every functionality.

Reach out to the KNIME Community for gaining more information.

In the next article, we would be introducing the Visualization aspects that could be explored using KNIME.

StackErrors is managed by Ankita91 and Sreedev. Follow Stack Errors in Kaggle to explore our data science projects.
Let’s learn together. 💙

--

--

Stack Errors

Data Scientists pursuing AI and Data Science at Loyalist College, Toronto. Handled by: Ankita and Sreedev teamed up as Stack Errors