Which loop is the right one for me?

Armin Ghassemi Rudd
Act of Intelligence Accretion
13 min readJul 28, 2020

--

When I listen to this part of “Just One Last Dance” by Sarah Connor where she sings: “Just one last dance, before we say goodbye when we sway and turn round and round and round, it’s like the first time”, it sounds like a loop to me! But probably while Sarah was recording this song back in 2003, she did not know that in a few years, there would be a data science platform named KNIME with various types of the loop that not for all of them it’s like the first time.

Loops in KNIME are handy tools that help us to handle different tasks. Repeating a process is the most simple and common usage of loops. But there are also more complicated tasks that can be dealt with loops in KNIME in addition to just repeating the same process.

KNIME has several nodes to perform different types of loop. Here we are going to introduce some of the most favorite loops in KNIME.

  • Counting Loop Start
  • Chunk Loop Start
  • Column List Loop Start
  • Generic Loop Start
  • Table Row to Variable Loop Start
  • Group Loop Start
  • Interval Loop Start
  • Recursive Loop Start

First of all, we should know that the loop construct in KNIME consists of a loop start node, the node(s) performing some operation(s) and the loop end node.

There are different loop start and loop end nodes. Some loop start nodes have their own specific loop end nodes, and some use general loop end nodes. You can find most of these loops in “Node Repository” under “Workflow Control” / “Loop Support” category.

Let’s check them one by one:

This node does a simple task; It takes an input table and repeats the execution of the nodes within the loop construct for a defined number of times. This type of loop can be ended by a Loop End node (which concatenates the outputs), a Variable Condition Loop End node or a Loop End (Column Append) node (which joins the outputs). The latter loop end node may be used in more specific use cases when applying this type of loop.

As an example of Counting Loop Start node, we can review one of our previous posts about getting the content of a webpage in KNIME. In this example, we have used a Counting Loop Start to create a unique URL in each iteration and extract the content we need. The Loop End node is used to collect the output of each iteration and close the loop.

The configuration of the Counting Loop Start node is straightforward; You only have to set the number of loops.

This type of loop lets you loop over the table rows in chunks. In the first iteration, it picks the first n number of rows, operates them and then the next n number of rows and so on.

You can set

  1. The number of rows per chunk — then the number of chunks would be the total number of rows divided by the number you have set (remainder rows produce an additional chunk)
  2. The number of chunks — then the number of rows per chunk would be the total number of rows divided by the number you have set (rounded to the next integer).

Like the counting loop, the chunk loop can be ended by using a

  1. Loop End node (very likely), or
  2. Variable Condition Loop End node, or
  3. Loop End (Column Append) node (the application of this one is rare here as well).

As an example, assume we have a table with over than 10 million rows, and we want to write it in Excel format. We know that the excel has a maximum number of rows limit, which is about 1 million (1,048,576). So, in that case, we may find the Chunk Loop Start node very useful to loop over our table in 10 chunks (1 million rows each) and write each chunk in a separate Excel file.

This loop start node takes a list of columns (“Include” list) and set the number of loops equal to the number of columns in the list and pass these columns one by one to each iteration. The other columns in the table which are not on the list will be passed to all iterations. To close this loop we often use the Loop End (Column Append) node but the Loop End node can also be used under some conditions. If you need to use the Loop End node to close a column list loop then since the table structure differs in each iteration, you have to rename/resort columns to create the same structure in all iterations or check the “Allow changing table specifications” option in the configuration window of the Loop End node.

To have an example for this type of loop, let’s refer to the KNIME forum and check this topic. On the , the user has provided a table which is his desired output:

He has the “Name”, “First purchase year” and “Purchase year” columns and wants to see if the customer is “New” or “Existing” based on the “Purchase year”. So, if the “Purchase year” is equal to the “First purchase year” then the label must be “New” in the column with the name which is the same as the year of the “Purchase year”, otherwise it should be labeled as “Existing”.

The solution to this topic is the , where you can download the workflow as well. Here we review the solution:

The first node is a Table Creator to produce our sample dataset:

Then the One to Many node produces the year columns based on the “purchase year”:

After that, we have a Column List Loop Start node in which we have included the year columns (2007, 2008, …) and inside the loop, we have a Column Expressions node with this expression:

if (column("first year") == column("purchase year") && column("purchase year") == variable("currentColumnName")){ "New" } else if(column("first year") != column("purchase year") && column("purchase year") == variable("currentColumnName")){ "Existing" }

This expression checks whether the “first year” and the “purchase year” columns are the same or not and if the “purchase year” is equal to the year of the current column in the list of included columns in our loop start node. After finishing the loop, we have to clean our table by using a Column Filter node since there are duplicate columns in our table (the columns which were not enlisted in the loop start node and were passed to all iterations). Finally, the output looks like this:

By using this type of loop, we can repeat the loop until a condition on one of our flow variables is met. The loop start node itself does not have any specific configurations, but you have to define the condition for the desired flow variable in the configuration window of its particular loop end node which is the Variable Condition Loop End.

We have an example for Generic Loop Start in our last post where we have used a Try/Catch Errors construct inside a loop to re-execute failed nodes automatically. In this example, the loop repeats for an unlimited number of times until the problem, which causes the workflow to stop is solved. Then the Variable Condition Loop End node collects rows from the last iteration and passes the table to the next nodes.

This loop behaves like the chunk loop when set to 1 row per chunk but converts the table values to flow variables. This loop takes table rows one by one, converts the values to flow variables, and passes the variables to the loop insider nodes. A variable port connects the first loop insider node which takes the data table from another node before the loop.

The loop end nodes can be the

  1. Loop End, or
  2. Variable Condition Loop End, or
  3. Loop End (Column Append)

We have two examples for this loop to give.

Example-1: The first one could be in one of our posts where we make use of Selenium nodes to rank KNIME forum users, and we have used a Chunk Loop Start node to pass different URLs one by one (Rows per chunk is set to 1) and then these values are converted to flow variables using a Table Row to Variable node. Alternatively, we could use a Table Row To Variable Loop Start node to replace the Chunk Loop Start node and the Table Row to Variable node.

Example-2: The second example is the workflow which has been suggested as the solution to an issue in the forum. In this topic, the user wants to write a table in multiple CSV files and set the file names regarding the data groups. The solution is provided in the of the topic. Let’s review the workflow:

Step-1: The GroupBy node aggregates the table by the grouping column which in this case is a column named “BU Number”.

Step-2: Then using a Table Row To Variable Loop Start node, we loop over the unique “BU Number” values and make use of them in two other nodes: The Create File Name node and the Row Filter node.

Step-3: In the Row Filter node, we filter the table based on the current “BU Number” value, so we pass the rows in groups.

Step-4: In the Create File Name node, we use the current “BU Number” value to name the file in each iteration.

Step-5: At the end, we close the loop using a Variable Loop End node.

Oh wait, it seems we have not mentioned this loop end before. This loop end can be used to close the previously mentioned loops as well (except the generic loop), but it is used in some special cases like the one here. As you see, the last node inside our loop is the CSV Writer node which has no output data ports. We want to close the loop, and we have no specific condition, so we use this loop end. The Variable Loop End node also collects the flow variables in all iterations and converts them to a data table.

By using this loop start, you can divide rows into groups based on one or more attributes and operate each group separately. The configuration of the node is very straightforward, and we only need to select the grouping columns. We can use all previously mentioned loop ends to close this loop, but of course, the Loop End node has the highest chance to be the one to close this loop.

If we take a look at our previous example for the Table Row To Variable Loop Start node, we will notice that we can modify the workflow to use the Group Loop Start node instead:

Here we have replaced the GroupBy node, the loop start node and the Row Filter node with a single Group Loop Start node selecting the “BU Number” as the grouping column and added a Table Row to Variable node to convert the “BU Number” value to a flow variable and feed the Create File Name node.

This loop is almost the same as the counting loop with this difference that instead of setting a number as the number of loops, we input a number as the start point (“From”), a number as the endpoint (“To”) and a number as the step size (“Step”). So, for example, if we set 0 as the start point, 10 as the end point and 2.5 as the step then we would have a loop with 4 iterations. These numbers can be doubles or integers. We have access to the values of “From”, “To”, “Step” and current “loop value” via flow variables inside the loop.

This loop behaves a bit different from the other types. Working exclusively with its loop end, the Recursive Loop End node, this type of loop can send the output of each iteration back to the beginning as the input. The Recursive Loop End node has two input ports. The top port is the one to collect data for the final output, and the bottom port feeds the Recursive Loop Start node in the second iteration and the next ones. This loop also has a two ports variant (Recursive Loop Start (2 ports)) which makes the functionality of this loop even more interesting.

The Recursive Loop End has three options to finish the loop:

  • A minimal number of rows which makes the loop keep iterating while the output has at least this number of rows.
  • A maximal number of loops which ends the loop after this number of iterations.
  • The “End loop with variable” option, which makes it possible to finish the loop if the value of the selected flow variable is “true”.

Fortunately, we have an excellent example in which both variants of the recursive loop are used. In KNIME forum (again!), there is a topic in which the user has asked how to group the rows based on a specific condition. First, let’s take a look at the dataset:

The condition is: We select rows from top to bottom. When a row is selected, the next row with an equal or greater “START TIME” than the “END TIME” of the current row + 5 minutes will be selected and so on. This series of rows are in the same group. When there are no rows left to add to the current group, we go back to the top and select the first row with no groups assigned to it and do the same until there are no rows left without a group. In the table above, four groups are selected as an example. The rows with the same color are in the same group.

There is two solutions provided for this issue in the forum. One of them is the which we are going to review here:

First, by using a Date&Time Shift node, the end time values are shifted by 5 minutes and saved in a new column named “END TIME (shifted)”. After that two recursive loop start nodes: First, a single port loop start node and then a 2 ports loop start. The first loop gathers all the groups and the second one (the one inside the first one) selects the members of each group.

Let’s see how the loops work:

First, the second loop (the two ports variant): In the first iteration the whole dataset goes to the top port, and then the first row is filtered and passed to the Table Row to Variable node.

Then the data from the second port that includes the complete data set in the first iteration goes to the Rule-based Row Filter node and based on the condition that we mentioned earlier, rows are filtered, and then the first one is selected by the Row Filter node.

The output of the second Row Filter node (Node 19 in the picture) goes to the second input port of the loop end (Node 21) which (for the next iterations) will feed the first output port of its loop start (Node 20) and the whole dataset goes directly to the third port which (for the next iterations) will feed the second port of the loop start (Node 20).

The output of the first Row Filter node (Node 16) goes to the first input port of the loop end, which is collected as the output.

Now, the First loop (the single port variant): A Reference Row Filter node excludes the output of the inner loop construct from the initial dataset and send it back to its loop start node (Node 15). In the configuration window of the last loop end node (Node22), we check the “Add iteration column” option, so this number will represent the group IDs.

Although this list ends here, but the available loops in KNIME are more. If you check your “Node Repository” under “Other Data Types” / “Time Series” / “Transformation” category, you will find another loop start node named Window Loop Start. This loop makes it possible for us to loop over our input in chunks where these chunks are defined by a window size and a step size. Also, this can be done based on rows or time. For example, we can have five rows in each iteration (window size) and move the window by two rows in each iteration (step size), or we may choose to have all the instances within every 5 minutes and move the window by 2 minutes. This topic in KNIME forum is an example of a use case of this type of loop.

There are still more types of loop in KNIME which would be handy in specific use cases, and you can find them by searching your node repository, KNIME Hub or NodePit.

Originally published at https://blog.statinfer.com on July 28, 2020.

--

--