Web Scraping 101
The quickest and cleanest way to grab information for your next project.
What is web scraping?
Nowadays, the world is information driven in all aspects. Possession of accurate data at the right moment is a fundamental ingredient of success for any project. The most common way of fetching data is to surf the internet and extract data from web pages. Suppose you have a requirement to collect a huge amount of data readily available within a large number of web pages. If the pages are to be manually read and the data have to be copied into a spreadsheet or document, processing all the pages would take a considerable amount of time. Since we can automate processes with programming, such automation of data extraction from web pages is known as web scraping. This method comes in handy when you want to quickly grab unstructured data from many different sources and compile them into a consolidated format for you to get insights out of it.
When is web scraping required?
To programmatically collect data from the web, one must carefully work on the scraping script to make it a success. Since the web embraces changes, the scraping process becomes inevitably fragile and therefore, a considerable effort should be invested in maintaining such a scraper. If you require to collect a few paragraphs of data, you could easily copy it manually without a scraper. Suppose you have to go through thousands of pages to collect data and those data should be collected frequently as they update, then it is the right time for you to invest in such a scraping project. You can save a lot of time from data collection and the accuracy of the collected data will be higher. Almost all human errors when copying data from source to destination can be eliminated.
Whatever you see on a webpage can ideally be extracted programmatically, and sometimes what you do not see as well.
How to identify the correct toolset?
Before diving into the actual elements of the process, the nature of the information elicitation should be analyzed to choose the correct tools for the job.
The first thing to consider is if the source web page requires any kind of user interaction to reveal the information being searched once after the webpage is visited for the first time. For an instance, the user may need to sign in to a personal account to fetch data; or even after that, a series of searches, button clicks, and page scrolling-like user interactions may require to arrive at data in interest. Maybe none of those interactions is required and data is available just by visiting a pre-known URL.
- If any user interaction is required, browser automation-supported tools should be considered.
- If no interaction is required, less sophisticated tools can be considered.
If a browser automation tool is selected, all the required components to load HTML pages, interact with the elements, and extract data from the DOM are bundled as a one-stop solution. Selenium and Cypress are some examples of browser automation tools that are available in multiple programming languages.
If no browser automation tool is required, then separate solutions for issuing network requests, and for parsing HTML DOM are required.
The tools for those requirements will vary as per your choice of programming language. Each ecosystem has various libraries to achieve this.
The Process
After analyzing the requirement, the scraping process can be organized as follows.
- Identify the entry point of the scraping job. Usually, this could be a URL to a web page where all information exists or a list of URLs to be crawled in sequence/parallel.
- Access the webpage with the URL using a network access mechanism or an automated browser tool and interact with the content if required.
- Upon loading the web content of the desired information, the HTML markup can be read using HTML parsing tools. Most of the time, those tools provide mechanisms to target the content within the HTML page using CSS selectors or XPaths.
- Once the HTML elements are targeted, the text content and other data within those elements can be extracted, transformed as necessary, and persisted in a convenient data structure for further processing.
The above steps can be repeated as necessary to gather all the data from different source URLs.
Beware of Caveats
Even though the process seems straightforward, unintended events may occur while web scraping. Let’s go through some of those events that could be a challenge to develop a successful scraping job.
Web pages are usually meant to be searched and read via human beings and not via programs except for search engine crawlers. If a scraping job attempts to flood the servers with too frequent requests, the scraper will be blocked out. To overcome this situation, always honor the robot.txt
specifications of a website that explains how crawlers and scrapers should access the content and the limitation of usage as well. To safely scrape web content, you could introduce proper pauses in the scraping process to avoid flooding the destination servers with too many requests and to imitate the human interaction pattern as well. Spreading the requests among web proxies is a better practice to overcome the problem of being blocked by remote servers in an event of your IP address is identified as suspicious of web scraping.
Another common and inherent challenge a scraping job would encounter is selecting HTML elements from the loaded DOM. Defining element selectors is not that difficult to achieve but the uniqueness and consistency of the element position within the HTML DOM may not be preserved even among the same kind of web pages. Therefore, you may have to think out of the box to construct robust element selectors that would not break abruptly in the middle of the scraping.
When it comes to interactive scraping, unintended pop-ups, banners, and elements would appear blocking the content to interact with. In such scenarios, exceptions will be thrown informing that the element is not interactable. These types of challenges are inevitable since we cannot predict such unintended appearances. The only way to overcome these scenarios is to observe and identify blockers and take remedial actions to conditionally remove those elements from the loaded web page when exceptions are thrown.
Another less invasive version of the previous challenge is that the targeting element is not loaded at the moment the element is searched. Since network requests are inherently asynchronous, not all the elements of a web page complete loading at the same time. Proper waiting mechanisms should be introduced until the targeting element appears in the DOM.
Even if you have implemented a successful scraping job, the same would not work properly after some time as a consequence of the evolution of web content. The content, styles, and HTML page structures will undergo endless changes, and scraping jobs must be reassessed and reimplemented as necessary to capture those changes accordingly. This is inevitable and therefore, you have to suffer the consequences of change.
Room for Improvement
Almost all scraping jobs involve issuing network requests. These requests belong to the category of I/O operations which are asynchronous. If there are independent requests to be made, then of course, those requests can be issued in parallel so that the waiting time for responses can be minimized compared to issuing requests sequentially. This will be a major performance enhancement that you would consider at some point when you have to deal with large amounts of requests. The right number of requests to be processed in parallel is a parameter that must be carefully chosen and tuned as per the allocated computing resources, network bandwidth, and destination server limitations.
The same parallelization can be applied to parsing of the HTML response and data extraction as well (i.e., multiple responses can be processed in parallel if no dependencies are present). Even browser automation tools provide options to load web pages in parallel by opening multiple browser tabs at the same time. These options can be considered if time is critical for your job.
You would notice that I have not described any code examples of scraping. I want this chapter to be concise and limited to the fundamentals of scraping. Stay tuned for another chapter in which I would take you through a simple but complete scraping job explaining the implementation details as well.
If you have come this far, you must have enjoyed the content. Thanks for reading.
👨💻 Shane Wolff 🐺