Want to do Data Analysis without coding? Use KNIME!

Rui Wang
SFU Professional Computer Science
13 min readFeb 3, 2020

Nattapat Juthaprachakul, Rui Wang, Siyu Wu, Yihan Lan

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

KNIME ( pronounced ‘Nime’ just like ‘Time’)

1) Introduction

To develop model in Machine Learning, one needs to understand Linear Algebra, Statistics, and other important concepts in Mathematics. Even though you are already comfortable with the subjects mentioned, you still need to learn not only ‘HOW TO CODE’ but also several important concepts in Computer Science like Algorithm and Database. Additionally, in order to start a project in Machine Learning, one needs to learn how to install and set up the whole coding environment, as well as learning how to use command line. These issues are tedious and become one of the biggest challenges for newcomers and beginners in learning several important concepts simultaneously in order to get into Data Science, Data Analysis, and Machine Learning spaces. Unfortunately, some people give up and never look back to Machine Learning again.

However, there is some GOOD NEWS! With great development in GUI-based application, the introduction of KNIME is a major game changer for common people who generally do not identify themselves as a programmer. The major benefit of KNIME is that no programming knowledge is required. If you know how to use Microsoft Excel, KNIME gives you the same feeling. All you need to do is just get into KNIME website and download the application (https://www.knime.com/downloads/download-knime). Now, you are ready to get into Machine learning and use the application without any further setups required. To load data, just drag and drop your file in! To clean and pre-process data, just click the functions! To select, train, and test Machine Learning models, just drag and drop any model you want! To visualize your interesting findings, just drag and drop any kind of graph you want! This allows you to focus your efforts on applying Machine Learning algorithms and techniques to your problems and subjects you are interested in the first day of your work!

In summary, to use KNIME, all you need is to just simply define the Workflow between a variety of predefined nodes which are already provided in its repository. This is very convenient since KNIME already provides several predefined components called “Nodes” for numerous different tasks such as Reading data, Cleaning data, Applying ML algorithms, Visualizing data in different formats, and Analyzing results.

1.1) What is KNIME?

KNIME stands for “Konstanz Information Miner” which was developed at the University of Konstanz in Germany around January 2004. It is an Open Source software written in Java on the Eclipse SDK platform. KNIME platform relies on pre-defined components called ‘Nodes’ for building and executing ‘Workflows’. Its core functionality is available for tasks such as machine learning, data mining, analysis, and manipulation. Additionally, the extra features and functionality are available in KNIME through various extensions and supports from numerous community support groups and vendors.

1.2) Why uses KNIME?

KNIME is a GUI-driven platform for analytics. This means that knowledge of coding is not a requirement (though sometimes writing code is needed but minimal if you want to add more complexity into your workflow.) In addition, as stated before, KNIME is an open source application, meaning that it is free to use. Also, it is a powerful and fully functional GUI-based application that is capable of helping us easily understand the whole complex processes of Machine Learning from start to finish by means of creating, editing, annotating, visualizing, and sharing workflows. Furthermore, it allows us to integrate data from many potential sources(files, database, web services) and rather to perform several essential Machine Learning related algorithms and functions ranging from basic I/O to data manipulations, data transformations, and data mining. In summary, KNIME helps consolidate the combination of various different processes into one single understandable Workflow.

1.3) What can we do with KNIME? Examples (Available in KNIME hub)

I). Topic Detection Analysis on Movie Reviews

II). Model Classification

III). Churn Prediction

IV). Credit Scoring

1.4) KNIME Workflow Bench

I). Workflow Project:

It consists of LOCAL workspaces which comprise all workspaces you have created from your own local machine, KNIME hub where you can connect with KNIME online server and community, and EXAMPLE workspace where you can get example projects that have already been created by KNIME community and ready to be used.

II). Recommended Nodes or Workflow Coach:

It lists nodes recommended based on the workflows built by the wide community of KNIME users.

III). Main Tab:

We can call it a ‘ToolBar’ tab as well. It consists of various basic functions for operating KNIME such as a function to execute and cancel selected nodes.

IV). Project Tabs:

It shows our current projects as you can create and execute several projects at the same time.

V). Node Repository:

It consists of all the available nodes in core KNIME Analytics Platform and in the extensions (Also, the nodes you have installed are listed here). The nodes are nicely organized by categories based on the node functions. Under each main node category, you can expand and select specific nodes with your desired functions.

Tip: you can also use the Search box on the top of the node repository to find specific nodes.

5.1). Nodes: A node can have 3 states.

5.1.1). Red: “Not Ready/Idle” state which means that the node is not yet configured and can not be executed with its current settings.

5.1.2). Yellow: “Ready/Configured” state which means that the node has been set up correctly and can be executed at any time.

5.1.3). Green: “Executed” state which means that the node has been successfully executed and we can see the results at the final nodes (downstream nodes).

VI). Node Description:

It shows the description of the currently active workflow or a selected node in the Workflow Editor or Node Repository.

Tip: it is very useful in the initial stages of learning when you are new to KNIME, do not have knowledge much about Machine Learning, or forget about purposes of each node in the Workspace or in the Node Repository.

VII). Outline:

It is an overview of the currently active workflow.

Tip: it is very useful since sometimes your workflow is very big. The outline will work as a map/ big picture for your Workflow space.

VIII). Console:

It shows the execution message and status which help indicate what is going on at the current workflow state such as successful operation, error in file, and so on.

Tip: It is very useful in helping diagnose the workflow and examine the analytics results.

IX). Public Server:

This tab helps you connect to KNIME server in case you want to search something on KNIME online hub.

2) Basic Process of Data Analysis with KNIME

I). Data Reading:

Usually, the first thing we should do when analyzing data is reading data. In ‘Node Repository’, we can see all kinds of Reader nodes such as CSV Reader node, EXCEL Reader node, Table Reader node, and so on. All we need to do is simply dragging and dropping the node we want into ‘Workflow Editor’.

Right click the node, we can change the node’s configuration; for example, we can select the path of data where we will get data from and then execute the node. If the node is executed successfully, the red light icon above node name will turn green. Later, we can have a look at the loaded data from the executed node.

II). Data Pre-processing:

2.1) Filtering:

Most of the time, we do not need all information from our dataset. ‘Row Filter’ node and ‘Column Filter’ node help us select rows and columns that we want to use. This operation can be achieved by setting the configuration of the node in order to extract specific rows and columns we intend to use.

2.2) Obtaining Description:

After selecting the columns, we may want to see the description of the data; for example, we may want to know the minimum value, maximum value, mean value, the standard deviation of our numeric data, and so on. All we need to do is to find the ‘Data Explorer’ node from the ‘Node Repository’ and drag it into ‘Workflow Editor’. Later, we connect the current node(‘node 2’) with recently imported node(‘node 3’) by connecting the ‘Black Arrow’ from tail to head between two nodes together. After that, we now can execute our new node by right-clicking on the ‘node 3’ and choosing ‘Execute and Open views’ option for executing our latest operation.

Now we can see the description of our data.

Additionally, KNIME gives us even more information. We can even see the distribution of data in each column (Bar chart) in this step.

2.3) Combining or Joining:

Sometimes, we may need to combine different datasets from various sources into one single dataset as to get all necessary information we want to use. By using ‘Joiner’ node, we can join two datasets into one single dataset in any different joining mode such as Inner join, Left join, or Right join.

2.4) Removing the missing values:

‘Missing Value’ node helps handle missing values found in cells of the input table. For example, we can replace missing values of numeric type with mean value of that column. Similarly, the missing value of string type can also be replaced with the most frequent value occurring in that specific column.

2.5) Sorting:

‘Sorter’ node helps sort the rows according to user-defined criteria. In the dialog box, we can select the columns according to which of our data should be sorted. Also, we can select how our data should be sorted in ascending or descending order.

III). Model Selection and Data Analysis:

In KNIME, there are many analytic methods. In this example, we apply Machine Learning algorithm called Random Forest to perform our analysis. We can just drag the ‘Random Forest Learner’ node from ‘Node Repository’ and drop it into our ‘Workflow Editor’. Furthermore, we can set the configuration of our model node such as number of Trees. We can now execute and train our model. After that if we want to make a prediction, we just drag the ‘Random Forest Predictor’ node from ‘Node Repository’ into the ‘Workflow Editor’ and execute. We can now see the prediction results.

IV). Visualization:

In KNIME, there are many different kinds of plot nodes. For example, we can combine ‘Color Manager’ node and ‘Scatter Plot’ node to customize colors and draw a scatter plot to show the distribution of age. We can select colors and choose which column will be on the x-axis and which column will be on the y-axis in the configuration dialog box.

3) A Sample: Titanic Survival Prediction with KNIME

From the above information and description, we have already acquired the knowledge of how KNIME basic operations work. In order to show how KNIME workflow works, here is our sample workflow for Titanic Survival Prediction.

I). Data Reading:

The dataset we use here is the Kaggle Titanic Dataset (https://www.kaggle.com/c/titanic/data), which contains passenger information like name, age, gender, socio-economic class, and so on. The goal of this dataset is to predict whether the passengers on board survived or not. As the dataset is in CSV format, we can use ‘CSV Reader’ node.

II). Data Pre-processing:

In the input table, there are missing values in the column ‘age’ and ‘cabin’. We use ‘Missing Value’ node to replace missing values of numeric type with the mean value and string missing values of string type with the most frequently occurring value. Also, we need to change the data type in the ‘survived’ column from numeric to string type using ‘Number To String’ node. After that, we split the dataset into training and testing dataset by selecting ‘Partition’ node which helps split input table into two partitions according to any specific portion of the partition. We split our dataset into a portion of 70% training data and 30% testing data. As of now, our dataset is ready for Data Analysis and Visualization part.

III). Model Training:

In this sample, we plan to use the Random Forest algorithm to train our model as it is one of the simplest, most explainable, and most powerful Machine Learning algorithms that can be easily trained. We select ‘Random Forest Learner’ node from ‘Repository Node’ to train our model. We can configure the node to determine the number of Decision Trees and configure the Target Column to ‘Survived’ because it is what we want to predict. After that, we include some column as feature columns and set the split criterion to ‘Gini index’. (Note that there are many other feature selection techniques and criteria we can try).

IV). Prediction:

‘Random Forest Predictor’ node predicts patterns according to an aggregation of the predictions of the individual trees in our Random Forest model. We use Random Forest Predictor to predict whether the passengers in the test dataset will survive or not.

V). Model Evaluation:

The ‘Scorer’ node reports a Confusion matrix as well as a number of different accuracy-related statistics such as True-Positives, False-Positives, True-Negatives, False-Negatives, Recall, Precision, Sensitivity, Specificity, F-measure, the overall accuracy and Cohen’s kappa. Further, we can use ‘ROC Curve’ node here to visualize the ROC curve from our model, which shows the model evaluation results.

Conclusion

KNIME is a powerful platform which is very easy to learn and use. When talking about the life cycle of Data Science, we are talking about data collection, data cleaning, data integration, analysis/modeling and visualization. KNIME is very useful and powerful because users could easily complete all of these steps in this single platform. Furthermore, it is easy to learn because users of KNIME do not need to have any background in programming. It makes data analysis available for everyone, especially for the person who needs to analyze data only occasionally. KNIME, however, still has some drawbacks. Compared to Python or other programming languages, it is not flexible enough to perform some specific task. Python gives you the opportunity to customize your own programming style and environment, while KNIME does not. In addition, the community of KNIME users is very small compared to other Analytics platform. This gives you a problem when you try to seek some support. After all, we believe that the innovation of KNIME is beneficial to overall Data Science community as it helps facility and introduce powerful Analytics platform to newcomers and non-programmers.

References:

https://towardsdatascience.com/guided-analytics-using-knime-analytics-platform-b6543ebab7e2 (Guided Analytics using KNIME Analytics Platform)

https://towardsdatascience.com/knime-desktop-the-killer-app-for-machine-learning-cb07dbef1375 (KNIME Analytics Platform is the “killer app” for machine learning and statistics)

http://infochim.u-strasbg.fr/IMG/pdf/knime_tuto.pdf (KNIME tutorial slides for Strasbourg Summer School 2014)

https://www.slideshare.net/gpapadatos/knime-tutorial (KNIME tutorial slides for Drug Design)

--

--