TUTORIAL | PYTHON | KNIME ANALYTICS PLATFORM

Codeless Data Science with KNIME

Author: Fabio Rebecchi

Rosaria Silipo

Published in

Low Code for Data Science

14 min readJun 3, 2021

Introduction

In this article I create a step by step data science pipeline using a visual and codeless workflow with KNIME.

KNIME is an open source environment where it’s possible to gather and wrangle data, or make sense of it through modeling and visualization techniques, in a drag and drop fashion. This enables the user to forget about coding and focus on the data, data science models and concepts.

Furthermore, a KNIME visual workflow could be effective in increasing the pipeline readability and maintainability in time.

Who Should Read This Article

Python is the most widely used programming language for data science, and many other languages are currently used in the field, such as R, Scala, Julia and even Javascript and C#. But what if you have little or no programming skills? Can you still approach and use data science? It should be clear by now that the answer is positive.

This article is not only intended for data scientists who want to learn a different way to create and deliver data science products, but it’s suitable also for data analysts or SQL developers with little programming experience beyond good old SQL, or for stakeholders and business analysts who desire to perform some self-service data science without the need of learning how to code.

Dataset and Predictive Model

In this tutorial I use a Kaggle dataset, the IBM HR Analytics Employee Attrition & Performance (see Bibliography). This is a fictional dataset created by IBM data scientists, containing 50 columns describing 1470 employees, with data regarding age, distance from work, monthly income, total working years and so on. A key field is the one named Attrition, containing a “Yes” if the employee left the company, and a “No” otherwise. Goal of the tutorial will be to load the dataset, pre-process it, explore it to gather useful insights and build a predictive model able to understand if a person could potentially leave the company or not.

KNIME Installation

To start our tutorial we need to have a copy of KNIME on our laptop. KNIME is free and can be downloaded and installed from https://www.knime.com/downloads.

Local Environment Setup

During installation please take note of the KNIME local environment, for instance on my MacBook it’s under the /Users/mac/knime-workspace.

Before launching KNIME, create a directory (for instance EmployeeChurn) under the local environment and paste there the files EmployeeAttritionLabels.csv and EmployeeData.csv. (for instance the path of the first file, in my case, will be /Users/mac/knime-workspace/EmployeeChurn/EmployeeAttritionLabels.csv).

KNIME Extension Installation

Open KNIME and in File->Install KNIME Extensions… select the Extensions highlighted in Figure 1 to install the extensions we are going to use (Extensions contain additional tools, nodes and features beyond the standard version).

*Figure 1 — Installing KNIME Extensions*

Workflow Creation

Now it’s time to create our KNIME workflow, selecting File->New->New KNIME Workflow (Figure 2) and naming it KNIME-Employee-Churn.

KNIME User Interface

Now that everything is set up, we can take a quick look at the KNIME user interface, as shown in Figure 3. There are seven main elements of the visual environment:

Top horizontal menu, with some useful shortcuts to frequently used functions
Explorer on the top left, where it’s possible to navigate pre-built examples or the files in the local workspace
Node Repository, where nodes can be browsed by category or full text search and then dragged into the Workspace
Central Workspace, where the visual workflows can be built and managed
Outline, where a preview of the whole Workspace is represented
Console, where error, warning or status messages are shown
Node Monitor, where the status and information of the selected node is shown

Creating File Reader Nodes

Inside the workflow we can create our first KNIME nodes, we can do that by dragging the two files from the KNIME Explorer to the Workflow space in the center (Figure 4). KNIME automatically recognizes the file type and creates the proper Reader Node, in this case CSV reader. Clicking on the “Node 1” and “Node 2” labels we can rename the two nodes as “Attrition Labels” and “Employees Data”.

We could have created the Reader Nodes choosing from the Node Repository as well, on the left menu, selecting the CSV Reader under IO->Read (Figure 5).

Double clicking on the Reader we can explore the file characteristics, such as a preview of the first 50 rows. Here it’s possible that the reader got the wrong column delimiter, as in Figure 6. We can understand it from the fact that the columns in the preview are not properly separated, but all in one long string for each row.

*Figure 6 — File Reader Attributes, wrong delimiter selected*

To solve the situation, we can click on “Autodetect format” or directly type in “;” as a column delimiter, and click on OK- Execute to correct the issue and see the columns properly separated, as in Figure 7.

*Figure 7 — Fixing delimiter issue in preview*

Creating Annotations

Furthermore, we can right click (ctrl+click in Mac) anywhere in the KNIME Workspace to create an annotation. Once the annotation is created we can click on the pencil near it to edit it, playing with font and border/background style. Here I created a title on the top and a box under the file readers, to show that they belong to the “Load Data” step, as shown in Figure 8.

Joining Data Sources

Once we have our data sources, we can perform an inner join on them using the Joiner node, that can be found in the Node Repository by typing “join” in the search. The Joiner must be dragged on the Workspace and dropped after the data sources, linked to them manually by clicking on each small “output” triangle present on the data sources and dragging the line to the Joiner “input” triangle. The final situation will look like in Figure 9, where a proper annotation was added as well.

*Figure 9 — Adding and linking the Joiner*

Once the Joiner node is linked we need to define the joining mode. By clicking on the Joiner we must choose “Inner Join” mode and pick the columns we are joining on, that are named Id on both tables (Figure 10).

Furthermore, we can exclude some column containing unnecessary information, such as “EmployeeCount”, “EmployeeNumber”, “Over18” and “StandardHours”, by using the “Column Selection” tab and using the arrows in the multiple selection boxes to move them to the Exclude section, as shown in Figure 11.

After clicking on Apply, we can Execute the Joiner by right clicking on it and selecting “Execute”, and check no error messages appear in the bottom right console. We can then verify that the Joined table has actually joined the employee “churn” information with the other data by clicking on “Joined Table” as in Figure 12. Please not that there’s a “traffic light” below each node. The light is red when the node is not properly linked or configured. It becomes amber when the node is linked, configured and ready to be executed, and eventually green when it’s executed successfully.

Processing Fields

Inspecting the joined table, we see that many fields are categorical and that the BusinessTravel field is a three valued string defining no travel, medium frequency travel or high frequency travel. To make data manageable by our predictive model, we need first to convert the BusinessTravel to a number (0/1/2) and then to one-hot encode the categorical columns such as “MaritalStatus”, “EducationField”, “Department” and so on (check the Bibliography for more details about one-hot encoding). To do that, we can apply in series the Rule Engine and the One to Many nodes. As before, we use the Node Repository search to find the Rule Engine node, and drag it into the Workspace connecting it to the Joiner, as in Figure 13.

Double clicking on the Rule Engine node, we can tell the node to replace the column “BusinessTravel” (at the bottom of the window with the radio button on Replace and selecting “BusinessTravel” from the drop down menu) with values equal to the integer 0 if the field contains a “Non-Travel” string, 1 if it contains a “Travel_rarely” value and 2 if it contains a “Travel_Frequently” value, as expressed by the simple rules we added in the Expression box (Figure 14). Remember, as usual, to apply, execute and then inspect the results.

*Figure 14 — Transforming the BusinessTravel field*

To perform one-hot encoding, we add and connect the One to Many Node and configure it to replace (select the checkbox “Remove included columns from output”) all the string columns with one-hot encodings apart from the “Attrition” one, using the arrows in the multiple selection dialog box to move “Attrition” on the left as shown in Figure 10. “Attrition” doesn’t need to be transformed because it’s not an independent variable for our problem, but it’s actually the dependent variable to be predicted.

Here, we have a warning (yellow exclamation mark on the node) in the below Console saying “Duplicate possible values found. Original column name will be appended”. This is correct, since among the selected columns some have identical field values (for instance “Human Resources” is present both in the JobRole and in the EducationalField columns, so to avoid confusion the new columns will all have names made up by the field value concatenated with the original column name, i.e. “Human_Resources_Job_Role”).

Again, we execute and inspect the result to check it is correct (check the “One Hot Encoding” entry of the bibliography if you need a refresh on the subject).

*Figure 15 — Adding and configuring the one-hot encoding*

The last action we perform on fields is to color the rows in red or blue depending on the churn field value, something that will turn useful in the visualization part, when plotting or exploring the decision tree results for instance. We can do that by adding and linking the Color Manager as shown in Figure 16.

*Figure 16 — Adding the Color Manager node*

Inside the Color Manager, we can associate the red color to Attrition value “Yes” and blue color to Attrition value “No” by selecting the values and clicking on the desired color in the palette as shown in Figure 17.

*Figure 17 — Selecting colors in the Color Manager*

Preparation: Partitioning and Sampling

The last steps needed before we can run the training are the partitioning and sampling of our dataset. Partitioning is performed because we need to “put aside” part of our dataset for a later validation of our trained model. Of course, if the model were trained and tested on the same dataset, then its accuracy would be misleadingly high, because it would be tested against samples it had already “seen” in the past.

Training the model on 80% samples and validating it on a remaining 20% of previously unseen ones, is a common practice employed to avoid this trap and measure a reliable accuracy value. To partition the current records in two random 80–20 groups , we can use the Partitioning node as shown in Figure 18.

*Figure 18 — Adding a Partitioning Node*

The random selection and the 80–20 ratio can be easily selected in the Partitioning node windows, as shown in Figure 19.

*Figure 19 — Configuring the partitioning node*

The last preparation step involves the 20% dataset to be used for validation. It needs to be further processed to ensure that the two classes, that is the churning and non churning employees, are equally represented. This is obtained by randomly removing the rows belonging to the majority class until the rows of the two classes are equal in number.

We need this step because if the classes were severely unbalanced, let’s say 9 out of 10 employees are not churning, a very simple (and dumb) model predicting “no churn” 100% of the time would nevertheless achieve 90% accuracy, that would be again misleading. To obtain equal representation we can use the Equal Size Sampling node, as shown in Figure 20.

*Figure 20 — Adding the Equal Size Sampling node*

The Equal Size Sampling node is configured by selecting the column representing our class of interest, in this case “Attrition”, and the “Use exact sampling” mode, as shown in Figure 21. As usual, don’t forget to execute the node.

*Figure 21 — Configuring the Equal Size Sampling node*

Model training

The moment has come to train our predictive model, that is going to be a Decision Tree. A Decision Tree is an algorithm splitting recursively the rows of a dataset in binary subgroups, taking the splitting decision each time on a certain threshold value for a single column, in a way that maximizes the homogeneity of the subgroups respect to the class variable.

This means for instance that if I am training a Decision Tree to predict the quality of a red wine from a series of variables, and the algorithm is splitting on the value of sulfites content, the concentration threshold for the splitting will correspond to a value giving the highest possible ratio of good quality wines in one subgroup, and the lowest one in the other subgroup.

The final goal of the algorithm is to create the smallest tree, among the many possible, and at the same time keep high the predictive efficacy of the tree towards new samples (see more on Decision Trees in the Model evaluation paragraph and in the Bibliography).

To train a Decision Tree in KNIME I need to create a Decision Tree Learner node and connect it to the 80% output of the Partitioning node as shown in Figure 22.

*Figure 22 — Creating a Decision Tree Learner node*

The configuration window for the Decision Tree, where we are leaving all parameters to default setting for this tutorial, is shown in Figure 23 (check that the class we are training the tree against is actually “Attrition”.

*Figure 23 — Configuring a Decision Tree Learner node*

Model Evaluation

Once the model is trained, the first thing we can do is to explore the tree’s learned insights by right clicking on the Decision Tree Learner node and selecting the “View: Decision Tree View” option from the drop down menu. As shown in Figure 24, the first split was done on the TotalWorkingYears column, with a threshold of 1.5 years being the most effective in obtaining subgroups with maximum segregation respect to the Attrition class.

Again from the figure, we get a valuable insight: the probability of churning is much higher for people who have spent in the company less than 1.5 years, and when that threshold is trespassed it becomes much less, so the company should have a special care in retaining the new employees.

*Figure 24 — Decision Tree View for TotalWorkingYears*

Exploring the tree by using the navigator on the right of the Tree View and by opening branches with the “+” symbol, other useful insights can be found, for instance the ones shown in Figure 25, displaying how people younger than 33.5 years or having a monthly income below $2470 are more likely to leave.

*Figure 25 — Decision Tree View for MonthlyIncome and Age*

Furthermore, we can check the model accuracy on the 20% dataset, feeding the model into a Decision Tree Predictor node, that is the node able to load a trained decision tree model and to predict class results against an input dataset. The Decision Tree Predictor node is shown in Figure 26. Here as well we accept the default parameters and execute the node.

*Figure 26 — Adding a Decision Tree Predictor*

The prediction results can be explored by right clicking on the node and choosing “Classified Data” on the bottom of the drop down menu. A new column is visible on the right, called Prediction (Attrition) and the predicted results can be visually checked against the original labeling, that is given by the record color (red=”Yes”, blue=”No”), as shown in Figure 27.

*Figure 27 — Exploring prediction results*

To measure up how many guesses were right or wrong, we can add and execute the last node of this tutorial, that is the Scorer node, as shown in Figure 28.

The Scorer node is configured as usual by double clicking on it, by paying attention to the fact that the accuracy measurement must be done between “Attrition” and “Prediction(Attrition)” as shown in Figure 29.

*Figure 29 — Configuring the Scorer node*

At last we can check the model accuracy by right clicking on the Scorer node and selecting the View: Confusion Matrix option, as shown in Figure 30. As shown in the figure, 48 out of 51 of the non churners (“No” row in the matrix) were correctly identified, while only 20 out of 51 churners were correctly identified.

This gave a total of 68 correctly classified samples against 34 wrongly classified, with a total accuracy of 2/3 (66,667%). This is surely better than a “random” classifier, that would give 50% accuracy on an equally balanced dataset, and not a bad result for a model that used all the default settings without any fine-tuning, but for sure the pipeline has a clear room for improvement (as a hint for further practice, the author obtained an accuracy of 85% using a Logistic Regression model instead of a Decision Tree).

Conclusions

We have created a very simple data pipeline and predictive model using KNIME’s visual environment. Here you can download the Workflow used in the example. There is much more to learn in the KNIME self-paced course and in the KNIME cheat sheets (see Bibliography).

Hoping you found this article useful, I encourage you to contact me via LinkedIn for feedbacks, suggestions or simply networking.