DATA STORIES | WEB SCRAPING | KNIME ANALYTICS PLATFORM

Master Web Scraping in KNIME: Extract Web Data like a Pro

Three web scraping scenarios explained

Laura Sandino Perdomo
Low Code for Data Science

--

Written by: Laura Sandino Perdomo, Edgar Diaz, Yanith Gomez
IQuartil SAS, Colombia

Did you know that it is possible to automatically collect specific information from a website? In other words, if you want to download a set of articles, posts, news, indicators, among other information that could be relevant to a given page, you can do it systematically through a process called Web Scraping.

Web Scraping is a set of techniques used for the automated extraction of data from websites. This article seeks to show you the capacity of KNIME as a tool to fulfill this purpose and the flexibility it offers when searching within web pages for the information you want.

Web Scraping process.

KNIME: As an open-source tool with a graphical and intuitive environment, KNIME supports users throughout the entire analytical process

KNIME (Konstanz Information Miner) is an open-source data analysis platform that through workflows allows the user to perform ETL tasks, design Machine Learning and Deep Learning models, and even build graphical reports or load results to other tools.

In KNIME, workflows are sequential pipelines composed of nodes, where each node performs a specific operation on the data. For example, there are nodes in charge of reading files (for each type of file there is a node), there are also nodes to train models and there are nodes to write the results, among other types of nodes. Moreover, KNIME can integrate programming languages such as Python, R, Java, SQL, etc.,

KNIME can also be used for Web Scraping. For the case of Web Scraping, Selenium nodes allow KNIME to fulfill this purpose. In Uniting Technology and Business Strategy: Obtain Valuable Data from the Web in a Simple Way written by Ángel Molina Laguna, you can learn more about the nodes that compose the extension, and its main features.

Example of workflow in KNIME for Web Scraping.

Dynamic Interaction in Web Scraping: Flexibility to Obtain and Structure Detailed Information

Standard Web Scraping

A web scraping exercise will be conducted in KNIME using a job page called Fake Python, which is used to develop programming exercises. The objective is not only to extract information from the cards on the main page, such as the position, company, location, and date, but also to gather information from the job description that appears only when clicking on the Apply link for each job.

Web Scraping exercise for a job page.

In the following video, you will see the KNIME workflow that accomplishes this task for each of the jobs listed on the cards.

The workflow consists of three steps:

  1. It accesses the web page and extracts information from the main page using Selenium nodes.
  2. It iteratively enters each job description and extracts the relevant details.
  3. It compiles all the extracted information into a table, with each variable collected becoming a column.
Web Scraping Exercise 1 in KNIME.
Results are tabulated for each item of information.

Web Scraping with a scheduled interaction

Sometimes, a higher level of user interaction is required to extract information. The following example simulates rolling a dice by clicking on the “Roll again” button. The objective is to collect the resulting values of each roll and repeat this process ten times.

Web Scraping exercise for dice rolling.

As in the previous exercise, the workflow consists of three steps:

  1. It accesses the web page through Selenium nodes.
  2. A clicker is used to activate the “Roll it again” button and capture the resulting value.
  3. Each value obtained from the interaction with the clicker is collected and compiled into a table.
Web Scraping Exercise 2 in KNIME.
Results of the roll of 10 dice.

Web Scraping with Login

Another type of interaction between the user and the web page that can increase the level of complexity when performing Web Scraping, is when it is required to log in to access the information. In the following example, it is required to log in to an application to access the information of three Greek gods.

Web Scraping exercise with login.

As in the previous exercises, the workflow consists of three steps:

  1. In this case, unlike the initial steps of other exercises, we need to log in using two Text Sender nodes (one for the username and one for the password). This will grant us access to the information.
  2. We navigate through the application to extract the information from each link.
  3. The collected information is consolidated into a table.
Web Scraping Exercise 3 in KNIME.
Result, information from the Greek gods.

Easy dynamic web scraping with KNIME

Throughout this article, we have demonstrated KNIME’s ability to perform web scraping, even on web pages with varying levels of complexity in accessing information.

The tools provided by KNIME for web scraping, along with their flexibility, allowed us to tackle each case and successfully extract the required information. Finally, this information was consolidated into tables for further analysis.

--

--