GETTING STARTED | FILTERING AND SORTING | KNIME ANALYTICS PLATFORM

Pure Data Gold

Understanding Medal Distribution from The Tokyo 2020 Olympics Using the Top k Selector node in KNIME Analytics Platform

John Denham
Low Code for Data Science

--

Introduction: Where Are All My Medals At?

So you want to quickly identify the top and bottom 10 gold medal winning countries from the Tokyo 2020 Olympics. You want to do this efficiently and maybe with the fewest nodes possible.

With the Top k Selector node you can quickly and easily sort and filter your data based on one or even multiple columns. Whether it’s customer purchasing data, items produced or medals won, rather than use a sort node and then a filter node, the Top k Selector combines functionality into a single node that outputs a sorted, filtered result quickly and easily.

In this post, we will describe and apply this node to identify:

  • Top 10 Teams by gold medal wins in Tokyo Olympics 2020.
  • Bottom 10 Teams by gold medal wins in Tokyo Olympics 2020.
  • Unique values vs rows.

The dataset we are using is the Tokyo 2020 Medal Standings, scraped directly from the olympics.com website and included with the workflow. The reference workflow for this blog post, named Top_K_Selector, is available on the KNIME Hub and can be downloaded for free.

The Top K Selector Node

The dust has settled on the Tokyo 2020 Olympics and to prepare for the Winter Olympics in Beijing 2022 you want a ready-made workflow that will organize results. The dataset in front of you includes data on the number of wins by medal type per team, team rank overall in the Tokyo 2020 games and team rank overall by number of medals won.

You might be tempted to just scroll through the dataset and get an overall impression of the medal spread since it really isn’t all that large. Ideally you would really like to quickly identify various top 10 results and have something that you can reuse later with the Beijing 2022 data.

The node we are exploring today is the Top k Selector node (Figure 1). This is a KNIME Base node and should be easy to find after a quick search of the Node Repository.

Figure 1: Top k Selector node.

Specifically, the Top k Selector is organized under Row Transformation (Figure 2).

Figure 2: Top k Selector location in Node Repository.

This node,

“… behaves the same as a combination of the Sorter node followed by a Row Filter that only keeps the first k rows of the table except for the order of the rows which depends on the Output order settings.”

The node configuration allows us to select data by number of rows or unique values and then sort column values in either Ascending or Descending order. After the initial selection we can add an additional sorting criterion to sort a different column within the results after the initial sorting.

Within the Sorting criteria portion of the node configuration dialog the up and down arrows allow us to adjust which column is sorted first, second etc. While the trashcan icon will remove a selected column from sorting.

The Advanced settings tab provides controls around the order of our output data and missing data.

Step 1: Top 10 by Gold Medal Count

The first thing we want to do after reading the data into the workflow is to configure the node to display the top 10 countries by gold medal count. We only want 10 rows though. If there are some countries that share the same number of gold medals those teams will not show in these results.

Double click the node or press F6 to open the node configuration dialog. Since we just want the top ten teams by gold, we set the Number of to rows and the value to 10. Sort by the Gold column in Descending order (Figure 3). This selection means that the top 10 rows from Gold will start from the largest value through the 10th value.

Figure 3: Left: Top k Selector node settings configuration. Right: Top k Selector Advanced Settings.

Additionally, we want the output to display in Descending order, so under the Advanced Settings tab we select ‘Sort output’.

Execute the node by pressing F7 or right-click and select Execute. The output of our configuration is shown below in Figure 4:

Figure 4: Top 10 Output.

This output provides 10 rows of data (remember KNIME is a zero-based tool), sorted in descending order based on the values in the Gold column.

Step 2: Include All Rows Associated With The Top 10 Unique Values

There is a second option available with the Top k Selector node that will not limit the list to just 10 rows, but rather, all rows associated with the top 10 unique values.

What this means is that for the Olympic Medals data, we will see teams that have the same or equal gold medal count. Since this is looking at all rows associated with the top 10 unique gold medal count values, we should expect to see the following gold medal count values:

39, 38, 27, 22, 20, 17, 10, 7, 6, 4

The configuration is almost 100% the same as the previous example. Rather than Number of rows, we select Number of unique values. Everything else remains the same (Figure 5).

Figure 5: Top k Selector with unique values.

After executing the node, the output is what we expected (Figure 6).

Figure 6: Top k Selector output with top 20 unique values.

There are 21 rows of data, from 39 Gold at the largest down to 4 Gold at the smallest. Since we are capturing rows of unique values, we see multiple instances of the same gold medal count. This is a great way to quickly view an expanded set of data. In this case we can see the spread of the top 10 by gold medal count.

Step 3: Top 10 By Gold Medal Count Organized In Ascending Order By RankByTotal

This Olympic Medals dataset is interesting in that two rank columns are included. Rank provides the Team ranking by overall standing based on the kind of medals won, whereas the RankByTotal column is a rank based on the total medals won.

We want to output the top 10 by Number of rows sorted Descending by number of gold medals won, but we also want to organize the output in Ascending order based on the RankByTotal column. As a refresher, the original top 10 output looked like this (Figure 7).

Figure 7: Original Top 10 Output.

The only change we make to the configuration dialog is the addition of the RankByTotal column. Just click Add Sorting Criterion and select the column from the dropdown. Select Ascending, click Ok and execute the node (Figure 8).

Figure 8: Sorting by Two Columns.

This output is still 10 rows but is organized slightly differently (Figure 9).

Figure 9: Output With RankByTotal In Ascending Order.

The bottom 4 Teams shared the same gold medal count (10), so the RankByTotal order was adjusted to sort them in Ascending order after the Descending order of the Gold column was applied. This flexibility helps us understand and make sense of our data in just a few clicks.

Step 4: Bottom 10 Teams

Now say we want to see the bottom 10 number of rows. The configuration dialog looks very much the same. Rather than select Descending, we select Ascending for the Sort by (Figure 10).

Figure 10: Configure For The Bottom 10 Teams.

After executing the node, we see the following (Figure 11).

Figure 11: Bottom 10 Teams.

10 rows of data with the top 10 lowest gold medals earned.

We can also easily visualize the Rank and RankbyTotal for these bottom 10 teams (Figure 12).

Figure 12: Bar Chart Of Bottom 10 By Rank and RankbyTotal.

I touched on this briefly earlier in the article, but in the Advanced Settings of the node, we have a choice of output order. In the above examples I’ve shown the outputs as sorted so they make sense around what we would expect. Below is an example of the Step 1 output as No order and Retain order (Figure 13).

Figure 13: Left: Output No Order. Right: Output Retain Order.

If our data is already sorted how we want, selecting the Retain order would work well here. For the greatest efficiency in the algorithm runtime, you can select No Order and look to see if your data remained in the same order you input (though there is no guarantee this will be the case) and manage from there.

The Thin Red Line: Discussion Of Variable Control Of The Top k Selector

As with most of the nodes in KNIME, every configuration choice can be controlled by a flow variable originating somewhere else in the workflow. Based on Step 1 here is what each flow variable contains as its value. I provide this here as a handy reference to understand the format and names of the variables and what they control in the configuration dialog of the node (Figure 14).

Figure 14: Output Of Variables From Top k Selector Node.

Note specifically for the order Variable:

The Boolean false = descending and the Boolean true = ascending

Of additional note here — the columns and order variables are String and Boolean collection variables as denoted by the brackets[]. In some cases, you can get away with feeding a single String or Boolean variable into the node.

I tested this by using the new Variable Creator node to build these two variables and while they do not output as collection variables, they worked fine to control the Top k Selector with ONE sort column selected (Figure 15).

Figure 15: Creating a Single String Variable to pass into the node columns variable.

If you are controlling the Top k Selector node with variables and need to sort more than one column, you will need to create String and Boolean collection variables so the node will work through the columns and sorting options correctly. In the companion workflow I’ve provided an example of how to do this (Figure 16).

Figure 16: The variable control examples in the example workflow.

Getting Ready for Beijing 2022

Bringing this all together we can build a dynamic report with some up front options that will out top and bottom gold medal numbers by team (Figure 17).

Figure 17: Component GUI View.

As long as your data is formatted the same, you can connect the Beijing 2022 data to this component and get a dynamic report output (Figure 18).

Figure 18: Composite View From Top/Bottom With Bar Charts Component.

Conclusions: Final Notes

In this article, we explored sorting and filtering the Tokyo 2020 Medal Standings dataset with the Top k Selector node. We explored some ways we could understand the ranking spread through row and unique values selections in the node.

We ended with a workflow that provides a few different options to explore the data both visually and through the output tables.

The brilliance of KNIME is how easy it is to change parameters in your configuration and quickly view the new outputs to tinker and tweak until you arrive at the output you need for your project.

--

--

John Denham
Low Code for Data Science

I am a Data Scientist who is passionate about empowering people to make the most of their data. I run the website KNIME.tips.