Rule the web with Selenium nodes in KNIME

Armin Ghassemi Rudd
Act of Intelligence Accretion
8 min readMar 17, 2022

--

Take your web scraping to the next level with the help of Selenium nodes in KNIME

In the previous post, we saw an example of web scraping in KNIME. Now we shall take web-scraping to the next level by utilizing Selenium nodes in our KNIME workflow.

Selenium nodes are the tools for web scraping, task automation and application testing in KNIME. In this article, we are going to demonstrate the power of Selenium nodes and show how they let you rule the web in KNIME. So let’s get started with the first use case: Ranking KNIME forum users.

If you are a KNIME forum user, certainly you have noticed how active this forum is. Users ask questions and often get their answers in a few hours or even minutes. So how about ranking users and find the most helpful members in the forum? By doing this, we can encourage all the users to become even more active. To do so, we need user statistics like what is already available at the users’ statistics page.

So how can we collect the data we need except using the forum APIs, the feature which is not provided by all websites? This issue here is one of the simplest ones when you are trying to obtain information from websites which use new technologies and services. By utilizing the Selenium nodes in KNIME, we can simply resolve this kind of issues. Let’s get back to our use case and start building a workflow to rank Forum users:

The users’ statistics page provides data in six different periods: Today, Week, Month, Quarter, Year, and All Time. The URLs to these pages are:

https://forum.knime.com/u?period=daily https://forum.knime.com/u?period=weekly https://forum.knime.com/u?period=monthly https://forum.knime.com/u?period=quarterly https://forum.knime.com/u?period=yearly https://forum.knime.com/u?period=all

First, we use a “ Table Creator “ node and input the URLs into a column named “url” and enter the periods into another column named “period”.

Now we need a “ Chunk Loop Start” to loop over URLs one by one. So we set the “Rows per chunk” option to 1 and execute the node. Then let’s use a “ Table Row to Variable “ node to feed the URLs to our main flow which contains Selenium nodes.

In a new flow, we use a “ WebDriver Factory” and choose a WebDriver (e.g., Chrome) and then a “ Get Pooled WebDriver “ node. Why “Get Pooled WebDriver”? Because we want to loop over the URLs and using a pooled WebDriver let us increase the performance of our workflow by keeping the WebDriver alive in each loop.

Let’s Feed the output (variable port) of our “Table Row to Variable” node to the “Get Pooled WebDriver” node and in the “Flow Variable” tab of the configuration window of the “Get Pooled WebDriver” node, assign the “url” value to the “urlInput” option.

Now we are ready to start scraping! But let’s wait for a few seconds by using a “ Wait “ node. We do that to make sure the page has completely loaded. Ok, let’s begin and dive into the “Score” Metanode which contains all the nodes that score and rank users. To calculate the score, we will use the number of likes received, number of replies and the likes per reply rate.

In the “Score” Metanode we split the flow into four (and then three more) flows. Let’s start by the top flow. First, a “ Find Elements” node to find the usernames. Let’s select the element by its XPath (we have several options here, but in our case, XPath is a good choice). You can find the XPath just like how we demonstrated in this article. After finding the XPath (we need to remove the counter of “tr” element in the path to select all instances of the element in the list — check the image below) and inputting it into the “Find Elements” node (which selects an “a” element), we use an “ Extract Attribute” node to extract the text attribute of the element we found. The text is username we are looking for. After that a “ Column Filter “ node to exclude columns other than our usernames.

The second flow from the top is the one to extract the number of likes received. For this one, after selecting the “span” element (which contains the number) in the “Find Elements” node, we use an “ Extract InnerHTML “ node to extract the number. Again a “Column Filter” node to exclude unnecessary columns.

And for the third flow, we do the same as the previous one but this time for the number of replies.

The fourth flow beginning with a “Wait” node, first sorts the list by number of replies and then does the same as first three flows. This way we will have top fifty users (the number of users in the list per page) in the number of likes and replies. To sort the list by the number of replies we use a “Find Elements” node to select the column header named “Replies” in the list (which is an “a” element) and then a “ Click “ node to click on this header and sort the list.

Then two “ Joiner” nodes and a “ Concatenate” node to join usernames and number of likes and replies and then concatenate these tow lists. The “ GroupBy” removes duplicate usernames. The “ Column Rename” renames columns to “user”, “likes” and “replies”. The “ String Manipulation” nodes convert the numbers like 1.9k to 1900: (In this article you can read more about using Regular Expression in KNIME)

regexReplace(regexReplace($likes$, "\\.", ""), "k", "00")

The “ String to Number” node converts the number of likes and replies columns from string to number. The first “ Math Formula “ node calculates the likes per reply rate using this formula:

$likes$ / $replies$

The “ Math Formula (Multi Column) “ node calculates a factor for each field. This factor then balances our parameters. The formula to calculate the factor (the output columns are named by adding “_m” to the original names):

100 / COL_MAX($$CURRENT_COLUMN$$)

The “ Missing Value “ node converts missing values to zero. When the number of replies is zero, we have a missing value for our rate column so we convert it to zero so that we can still score based on the number of likes.

The last “Math Formula” calculates the score:

(($likes$ * $likes_m$) + ($replies$ * $replies_m$) + ($rate$ * $rate_m$)) / 3

Each parameter is multiplied by its factor and then the sum of these values is divided by 3. So the score will be a number between 0 and 100.

And finally, the “ Rank “ node which ranks users based on their scores in “Ordinal” mode which generates unique rank numbers.

Well! We have done our job. The rest of the workflow is straightforward. A “ Row Filter” to include top ten users. A “ Constant Value Column” to add period by using the flow variable we produced in our first steps. Making things clean and the “ Loop End “ node. Creating a “User-Score’ column using a “String Manipulation” node by applying this expression:

join($user$, " (", $Score$, ")")

And the “ Pivoting “ node to group on “rank” column and using “period” column for pivoting and applying “First” aggregation method on “User-Score” column.

You can read about this workflow (and download it) here on this topic.

But wait! We are not done yet. All we discussed here was only half of the power in Selenium nodes (although our example was straightforward you can use these powerful nodes in much more complicated use cases). Reading data from the web as we did is just one side of the coin. How about writing data to the web?

There is a topic in KNIME forum in which a user has asked how to upload files to a website by using Selenium nodes. Let’s see how:

In this example, we are going to upload a CSV file to a GitHub repository. Just like the previous example, first we use a “WebDriver Factory” node but after that a “ Start WebDriver “ node this time as we are going to do the process in a single run.

We set the URL to the GitHub login page in the configuration window of the “Start WebDriver” node and execute it. Then we select the “username” input box with a “Find Elements” node (this time we use the ID value to find the element) and send the value by using a “ Send Keys “ node then we do the same for “password” input box and finally selecting the “Sign in” button and click just like how we did in our previous example.

After the login process, we navigate to the repository page by using a “ Navigate “ node and select the “Upload files” button and click. Now we are on the page that we have to drag and drop our files or select them manually.

We find the element for the “choose your files” link which is an “input” element of type file and select it in a “Find Elements” node and use a “Send Keys” node to send the address of the file we want to send. After a few seconds of waiting (by using “Wait” node) for the uploading process we find the “Commit changes” button and click it, and the job is done! So we close the browser now by using a “ Quit WebDriver “ node.

So easy, right? Of course, you can use this method in so many different use cases (the forum user has said that he uses the Selenium nodes to upload about 200 files to a website every day automatically).

You can find the workflow in this post.

This video thoroughly explains how to build the workflow:

Originally published at https://blog.statinfer.com.

--

--