Going on an adventure? Don’t forget about the Swiss Army knife!

Armin Ghassemi Rudd
Act of Intelligence Accretion
8 min readMay 25, 2019

--

Are you going on the adventure of building a KNIME workflow? If so, then how about having a node similar to a Swiss Army knife in your repository?

To have the functions in the “ Math Formula” and the “ String Manipulation” nodes (and even more) in a single node plus having the ability to define different expressions for several columns at once, we can use “ Column Expressions “ node. This node allows us to use expressions with the JavaScript syntax. We can define temporary variables within our expressions. These variables can maintain some values for us so we can use them somewhere else in our current expression. The name of these variables must be different from the functions which are predefined for this node.

We can use different functions and calculations in several lines as a single expression which then will write results into a column that we have set in the configurations. This node is obviously one of the most powerful nodes in KNIME which can handle several tasks in one step.

Below we are going to show you an example in which some functionalities of the node are demonstrated.

Let’s suppose we have a table of some products and the monthly income by selling each of them. (The income is in 1000$)

Now we want to calculate the average of our income for each product and have the result in dollars (not 1000) with the dollar sign “$”. Furthermore, we want to check whether our income from each one of the products is increased, decreased or not changed after the time period. (The first three months of 2018)

So we need to append two new column: one as “salesAvg” for the average income and the other one as “salesDiff” for the income difference between the first and the last month. We can do this all by using a single “Column Expressions” node.

After creating our example table using a “ Table Creator” node, we drag and drop the “ Column Expressions “ node into our workflow editor panel and connect the output port of the “Table Creator” node the input port of our “Column Expressions” node. Now let’s configure the node and add the expressions we need.

First of all, as we need two output columns, so we add two expressions by clicking on the plus sign.

We can change the column names by editing the “Output Column” settings for each of the expressions we have added.

Now we can input the expressions. For the first column which is the “salesAvg” we input this expression:

mean = (round(average(column("201801"),column("201802"),column("201803")),2)) * 1000 join(mean, "$")

The “mean” is our temporary variable to keep the average rounded in two decimal points and multiplied by 1000 (to have income in dollars not thousand dollars). Then we have used a “join” function to join the number and the dollar sign.

Next, we input this expression for the “salesDiff” column:

diff = column("201803") - column("201801") if (diff > 0) "Increased" else if (diff < 0) "Decreased" else "Not changed"

Again we have a temporary variable named “diff” which keeps the difference between the income of the first month and the last month. Then we have an If-Else statement to label the ranges.

Now if we apply and save the configurations by clicking on OK button and execute the node, the output would be this:

So as you see we have done two tasks in a single and simple step by using the “Column Expressions” node. One who perhaps has read our previous post about flow variables may ask as we have variable variants of “Math Formula” and “String Manipulation” nodes, does “Column Expressions” have any variable variant? The answer is: Yes, of course! And the name is “ Variable Expressions “.

But you may have the feeling that this example was too straightforward. So let’s make it a bit more complicated and put both variants on the show.

For the new example, let’s suppose we have the same table but this time want to have “salesDiff” column in monthly periods not in total. So we need to calculate the income difference between each month and its previous month and then assign them the labels.

To do this, we use a “ Column List Loop Start “ node and in the configurations of the node we exclude the “product” column. Actually, as the first month has no previous column, one may suggest to exclude the “201801” column as well. But remember this exclusion doesn’t mean we omit the column from the process and indeed we make it included in all iterations. So let’s just exclude the “product” column for now and take care of the “201801” column later.

Now as we need the previous month income as well, let’s use a “ Variable Expressions “ node to create the name of the columns we need and then filter the initial dataset to bring in our desired column into the process in each loop iteration. So connect the output variable port of the “Column List Loop Start” node to the input port of the “Variable Expressions” node, add an expression and name the variable “preMonth”. Now we have to input the expression:

if (length(variable("currentIteration")) == 1) join("20180",variable("currentIteration")) else join("2018",variable("currentIteration"))

In this If-Else statement, first we check whether our current iteration number is one digit long or not. So if it’s a single digit (0 to 9) then we join “20180” and the iteration number, otherwise we join “2018” and the iteration number. Although it won’t make any difference in our case, we have to do this to create the correct column names for the last three months of the year.

After executing the “Variable Expressions” node, we have a new variable named “preMonth” which its value is similar to our income column names except that in each iteration we have the name of the previous month. So now we feed this variable to a “ Column Filter “ node which is connected to the “Table Creator” node we already have. In the configuration window of the “Column Filter” node we choose one column (it doesn’t matter which one) in the inclusion list and then go to the “Flow Variable” tab and below the “column-filter” -> “included_names” there is this option named “0” (which stands for the first included column). We assign the “preMonth” variable to this option, save and close the configuration window and execute the node. Of course in the first iteration of the loop we have an empty table, because there is no column named “201800”.

Then we use a “ Joiner “ node to join the two tables. One coming from the “Column Filter” node containing only one column (or none in the first iteration) and one table coming from the “Column List Loop Start” node containing two columns (the “product” and one of the income columns). The configurations for the “Joiner” node would be simple. We use the “Row ID” as the joining column for both tables and include all the columns.

Now we bring a “Column Expressions” node into the workflow editor and add this expression:

if (variable("preMonth") == 201800) null else { diff = column(variable("currentColumnName")) - column(variable("preMonth")) if (diff > 0) "Increased" else if (diff < 0) "Decreased" else "Not changed" }

So as you see, we have a condition in case of the first iteration which results in a null output. Then we have another If-Else statement which assigns the labels just like our first example. Just in case you are confused with the column(variable(“…”)) structure: for example, in the second iteration of the loop, the “preMonth” variable contains the value of “201801”. So when we say column(variable(“preMonth”)), it means column(“201801”) which then calls the values of this column.

But wait, how about the output column name? Since we are looping over several columns and our output belongs to each month in each loop iteration, we need to dedicate specific column names like “01 Status”, “02 Status” and so on. So we need to create the column names in each iteration. Thus we need some variable manipulation node to create column names using the current column name or the current iteration number. We don’t need to add another new node to our workflow to do the task. We can just get back to our “Variable Expressions” node and add a new expression and name the variable something like “newColumnName”:

join(substr(variable("currentColumnName"), 4, 2), " Status")

Here we have used two functions. The “join” function which joins the result of “substr” function and the “ Status” string. The “substr” function take the current column name and subtracts the last two characters of it (e.g. “01”, “02” …).

We have our column names now so let’s get back to the “Column Expressions” node and in the “Flow Variables” tab under “expressions” -> “element 0”, we can find the “outputName” option. Assign the “newColumnName” variable to this option and click OK.

Let’s execute the “Column Expressions” node and use a “ Loop End (Column Append) “ to close the loop. After executing the workflow we’ll see duplicate columns like “201801 (Iter #1)” or “production (Iter #1)”.

We can remove these duplicates using a “Column Filter” node using the “Wildcard/Regex Selection” mode and this pattern as the regular expression:

((?!Iter).)*

This pattern includes all the columns that their names don’t contain “Iter”.

Now the final output should look like this:

You can download the workflow here.

Originally published at https://blog.statinfer.com on May 25, 2019.

--

--