GETTING STARTED | FILTERING | KNIME ANALYTICS PLATFORM

All Row Filters of KNIME

One concept, many useful nodes

Roberto Cadili
Low Code for Data Science
11 min readOct 11, 2021

--

Co-Author: Rosaria Silipo

“Come on … that’s easy! It’s just a Row Filter!”

Isn’t this a familiar sentence? Well, as often as it might be pronounced it is not completely true. Sure the row filter concept is an easy one to grasp. But in KNIME there are a few different ways to put it in place. Let’s explore them.

Figure 1. Overview of row filter nodes in KNIME.

Row Filters on Nominal Columns

Simple Pattern Matching

This is the most common of all filter modes. For example, let’s extract all people/records with the place of birth “United States” from a census data set. The easiest solution is a Row Filter node, with

  • “Include rows by attribute value” enabled on the left.
  • “Column to test” set to “place of birth”, as the column we want to work on.
  • “use pattern matching” enabled.
  • “United States” as the pattern to search for in the selected column to test. If the selected column does not have an exorbitant number of values, a menu with possible patterns is generated and can be pulled down with the arrow on the right of the textbox containing the pattern to search. A value can then be selected from the list of possible patterns.

If we would like to extract all records/people born elsewhere than in the United States, then we need to enable the option “Exclude rows by attribute value” on the left and leave all other settings unaltered.

However, sometimes we find “United States” and sometimes “United States of America” in the data. In this case, we can still keep using the Row Filter node, but enabling the “contains wild cards” option on the pattern “United States*”, the * being the wild card. This configuration searches for all values in the selected column starting with “United States”, including for example “United States of America”.

The same result can be obtained using a regular expression, like “(United.+)” and enabling the “regular expression” flag.

By default the pattern search in the selected column is not case sensitive. However, if we want it to be, the “case sensitive match” flag is there for that exact purpose.

Figure 2. Configuration dialog of the Row Filter node to include all rows with the pattern “United States”.

Advanced Pattern Matching

Besides “United States” and “United States of America”, it is also very likely to find “USA” or “US” as birth place values, all indicating the same birth place. In this case, even a wildcard based row filter might not be enough. Without resorting yet to regular expressions, we can use a few more complex rule based row filters.

The Nominal Value Row Filter node for example allows to select a number of values for filtering, i.e. the equivalent of an OR condition. Whenever any of the values in the Included list is matched, the row is retained.

Note that on the very top of the node configuration window the column needs to be selected in the “Select column” menu. By default no column is selected. The Excluded list is populated with the unique values found in the selected column. Patterns to be matched can be moved from the Excluded list to the Included list through the Add/Remove arrows or through double-clicks.

Figure 3. Configuration window of the Nominal Value Row Filter node to include all patterns referring to the United States as birth place.

Similarly, the Reference Row Filter node also includes all rows matching any of the values from a list, only that in this case the list is provided as an input data table (a dictionary) at the lower input port. The configuration window of this node requires the input column to be checked for filtering in the data set at the upper input port and the column providing the list of matching patterns in the data set at the lower input port.

Figure 4. The Reference Row Filter node filters row values based on the entries in an input data table.

Another node that implements complex row filter rules is the Rule-based Row Filter node. With this node, you can implement all kinds of filtering rules, not only OR conditions.

The Rule-based Row Filter node implements rules using the same syntax as the Rule Engine node, that is:

<antecedent> => TRUE / FALSE

Double-clicks on column names and operators automatically inserts them in the rule editor. The radio button below the rule editor allows you to keep or drop the results of the implemented rules.

In a similar fashion, the Rule-based Row Filter (Dictionary) node takes a list of user-defined rules -where each rule is represented by a row- and tries to match them to each row in the input table. If the first matching rule has a TRUE outcome, the row will be selected for inclusion. Otherwise (i.e. if the first matching rule yields FALSE ) it will be excluded. If no rule matches, the row will be excluded.

Note that these nodes allow mixing nominal, numerical, and other column types in the condition antecedent. It is not just limited to nominal columns like the Nominal Value Row Filter node.

Figure 5. The Rule-based Row Filter node allows you to write customized filtering expressions mixing columns of different types.

Similarly, complex filtering rules can also be implemented using a Java Snippet Row Filter node, if you are a Java expert. Here the filtering condition must return a Boolean value and rows are kept if the condition matches TRUE. This Java Snippet Row Filter node follows the Java editor in the Java Snippet node. The following code might be one way to implement the rules from the Rule-based Row Filter node.

boolean response = false;

if($place of birth$.equals(“United States”) ||

$place of birth$.equals(“United States of America”) ||

$place of birth$.equals(“USA”) ||

$place of birth$.equals(“US”))

response = true;

return response;

Notice that also the Java Snippet Row Filter Row is not restricted to work only on nominal columns, but can include all types of columns in the condition antecedent.

Numerical Columns

Range Checking

Let’s suppose now that we prefer to filter our data set by age and not by place of birth. For example, let’s try to extract all records/people with age between 30 and 45 year old.

For this kind of filtering, we might want to reuse the simple Row Filter node seen before for the row filtering based on matching patterns in nominal columns. Since we are dealing with numerical values, here we use the “use range checking” option. Here, we need to specify the lower and the upper bound of the numeric interval. Again, selected rows can be included into or excluded from the final data set depending on whether the flag “Include rows by attribute value” or “Exclude rows by attribute value”, respectively, is enabled. Remember that the lower and upper bound values are both included in the matching range.

Figure 6. Configuration window of the Row Filter node used to filter numerical columns based on value range.

The same row filtering for age values included in a given numerical range can be obtained using the Rule-based Row Filter node, which implements rules using the same syntax as the Rule Engine node. In our case the rule would look something like:

$age$ >= 30 AND $age$ <= 45 => TRUE / FALSE

Double-clicks on column names and operators automatically inserts them in the rule editor. The radio button below the rule editor allows you to keep or drop the results of the implemented rules.

Note that this node allows mixing nominal, numerical, and other column types in the condition antecedent.

Similarly, numerical row filtering rules can be implemented using a Java Snippet Row Filter node. Here, the filtering condition must return a Boolean value and rows are kept if the condition matches TRUE.

boolean response = true;

if($age$ >= 30 && $age$ <= 45) response = true;

return response;

Notice that also the Java Snippet Row Filter Row allows to include all types of columns in the condition antecedent.

Filtering Space and Time

Geo-Coordinate Row Filter

The Geo-Coordinate Row Filter is an interactive row filter. This means that in the node configuration window the user manually selects an area on the world map. Data rows with latitude and longitude coordinates in this area will be excluded/included from/into the input data set.

Therefore the configuration window of the Geo-Coordinate Row Filter node requires:

  • To draw a polygon on the world map of the area to be excluded/included.
  • The option whether to include or exclude the rows with latitude and longitude coordinates in the selected area.
  • The columns containing latitude and longitude in the input data set.
Figure 7. Selected geo-coordinates in the configuration window of the Geo-Coordinate Row Filter node.

Note. Remember that the Geo-Coordinate Row Filter node is part of the KNIME Open Street Map Integration.

Extract Time Window: Row Filter on Date&Time Columns

Sometimes it is necessary to filter records based on dates. In our census data set, we have a data column for birth dates. Once imported as String and converted into a Date&Time format by means of a String To Date&Time node, a Date&Time-based Row Filter node can be applied to keep only the records of interest.

If, for example, the goal is to extract all records/people born in the 70s, the following configuration window of the Date&Time-based Row Filter node can be adopted. Here records are filtered based on datetime values contained in the birth date column. More specifically, records are retained if birth date value falls between January 1st 1970 (starting point) and December 31st 1979 (end point).

Notice that this node operates only on Date&Time type data columns. This node also works with start and end points as date, time, and both date and time.

Figure 8. Filter date records between January 1st and December 31st, 1979.

Special Filtering Cases

By RowIDs

Another kind of row filtering might want to match RowID values. In this case, a Row Filter node with “Include rows by row ID” enabled option on the left of the configuration window might just solve the problem. Here the matching pattern can be represented by means of a regular expression, as a single rowID, and as the start of the RowID value. All of those options can be set to be case sensitive if the corresponding flag is enabled.

In the figure below, all rows with a RowID starting with “Row1” are selected. Since we have 11 rows, the two rows identified respectively with Row1 and Row10 are selected.

Figure 9. Configuration window of the Row Filter node to filter by RowID.

By Row Number

If the goal is to keep only some of the rows based on their position on the data set, for example the first 10 rows, the Row Filter node can still help. The option “Include rows by number” on the left allows to keep the rows positioned between the row at “First row number” and the row at “last row number” into the resulting data set.

Symmetrically, the option “Exclude rows by number” allows to eliminate the rows positioned between the row at “First row number” and the row at “last row number” from the input data set.

Figure 10. Configuration window of the Row Filter node to filter by row number.

Missing Values

Sometimes the filtering operation has the goal of identifying the records with missing values. Continuing our census analysis, the records in our data set might be identified by means of supposedly unique ID labels. A missing ID for a record might prevent the execution of following operations, like for example a join.

To identify rows with missing values in a given column, we can resort again to the Row Filter node and activate the last option in the Filter Criteria tab, named “only missing value matches”.

Again, selected rows can be included into or excluded from the final data set depending on whether the flag “Include rows by attribute value” or “Exclude rows by attribute value”, respectively, is enabled.

Exporting Hilited Rows only

This is a very special case of row filtering, associated with the interactive brushing property, called Hiliting in KNIME. When rows have been hilitied in a view, for example of a scatter plot view, an interactive table, a decision tree, etc., the same hilited rows can be exported into a data table with the HiLite Row Filter node. No configuration settings are necessary, since the hiliting property has already been associated with the selected rows in the data set.

In-Database Row Filters

All previous filters are executed on the KNIME installation. Of course, the same filter operations can also be executed as in-database operations. It is enough to identify the appropriate SQL query to perform the rather complex filtering on the selected database.

The DB Query node allows you to write a SQL query in the SQL editor it provides.

However, if the required SQL expertise is not available, another database node is available to perform row filtering and with a graphical GUI to help you with the task without writing a single SQL statement: this is the DB Row Filter node.

The configuration window of the DB Row Filter node lets you select the column to work on, the operator to use, and the value for pattern or range checking.

Notice that the same configuration window applies to numerical and nominal columns.

This node gets a SQL query as input and produces a SQL query as output. The output SQL query corresponds to the input SQL query augmented with the row filter SQL statement. In order to execute the SQL query and retrieve the data you need to attach a DB Reader node to the output of the DB Row Filter node.

Remember that the execution of the DB Reader node runs on the database and not on the KNIME installation machine. The main time there is spent passing the data from the database to the KNIME workflow. The execution of the DB Row Filter node just builds the SQL query, it does not actually process the data.

Figure 11. In-database row filtering using the DB Row Filter node.

Big Data Row Filter

Along with filtering data rows on databases, KNIME offers the possibility to perform filtering operations on fully functional local big data environments such as Apache Spark. After creating a local Spark context using the Create Local Big Data Environment node and transforming the KNIME data table into a Spark DataFrame, we can use the Spark Row Filter node to filter rows directly on the input Spark DataFrame by adding and grouping conditions. Rows that match the conditions are included in the output DataFrame.

This node has a configuration window identical to that of the DB Row Filter node (Figure 11), and in a similar fashion it requires the use of another node, for example the Spark to Table node, to output the filtered rows as a KNIME data table.

Workflow Abstraction Row Filters

The last two row filter nodes can come in handy if we intend to abstract our workflows. The Nominal Row Filter Configuration node provides a value filter configuration option to an encapsulating component’s dialog. This node takes a data table and returns a filtered data table with only the selected values of a chosen column, as well as a flow variable containing those values.

To ease column selection, this node allows us to choose among several selection types, be it list, twinlist, check boxes (vertical) or check boxes (horizontal), and to limit the number of visible value options, if the check box is enabled.

Figure 12. Configuration window of the Nominal Row Filter Configuration node to filter data rows that match the value United States.

Similarly, its twin the Nominal Row Filter Widget node can be configured in the same way but its use is more appropriate if we wish to create Data Apps and consume them on the KNIME WebPortal.

Are you still thinking “Come on…that’s easy! It’s just a Row Filter”? What is your row filtering use case?

Explore the row filtering nodes in action by downloading for free these workflows from the KNIME Hub:

Happy KNIMEing!

--

--

Roberto Cadili
Low Code for Data Science

Data scientist at KNIME, NLP enthusiast, and history lover. Editor for Low Code for Data Science.