The essentials that Data Scientists need to know about web scraping

Published in

Patrick’s notes

6 min readMay 8, 2022

Figure 1: Web scraping is a great skill to have in your arsenal as a Data Scientist

The internet represents one of the most significant resources for data available for a practicing data scientist. The good news is that this data can be captured using a number of tools available in Python that can be learned rapidly. This tutorial will provide you with explanations and demonstrations of how these tools can be used to extract information from web sites.

A quick side note on the recent history and legality of web scraping

Before diving into the key components of the tutorial, it is important to note the state of play from a legal perspective when it comes to web scraping. Short answer, “yes, it is legal” but does come with some caveats depending on the jurisdiction that you fall under.

United States: web scraping in the United States is legal assuming the data scraped is from publicly available sources and the scraping activity does no harm to the website that is being scraped. There is one specific act from 2016 against purchasing an excessive number of tickets at once, using bots to prevent black markets.

European Union and the UK: based on the recently passed Digital Services Act, which aims to bring all EU countries under the same digital sharing regulations, “reproduction of publicly available content” is not illegal. However, the context of the Digital Service Act’s focus is from an IP (intellectual property) perspective, therefore other activities that could be relevant to web scraping such as the fair use of personal data based on regulations such as GDPR.

China: Within sources in English, there is no direct regulation against web scraping in China. Similar to other countries, it seems like web scraping is used in China for business use cases as well but it is not legal to scrape and process personal data.

N.B: This is the state of play at the time of writing, with a number of legal cases ongoing that have the potential to change the current rules around web scraping. As a consequence, please do your own research and scrape at your own risk…Now let’s crack on with the tutorial.

Requests and BeautifulSoup

To make the explanations clearer as the tutorial progresses, I will include appropriate examples with code. Obviously, for clarity, the tutorial will only contain snippets but the full coded solutions can be found with this Github link. In the first part of this tutorial, we will be scraping Trustpilot customer reviews for the company Access Software Group.

To web scrape a website, we need to obtain the html (i.e. the code that is used to structure a web page and its contents), which contains all the information about that specific web page, which we can then scrape. The easiest way of doing this is by using the Requests package. In short, the Requests module allows the user to send HTTP requests to a website through your python code. If the HTTP request is successful, it will return a response object with data specified or the actual page source, depending on the specification. In this instance, we will be using the “.get” function which returns the html of the web page.

Once we have obtained the html code, we can parse the text and obtain the key information that we need using BeautifulSoup. It is important when parsing the text that we actually understand the structure so that we only extract the key information, when coming to write our code.

One method of understanding the html is by viewing the page source is itself and searching for the key content that you are looking to parse. We can do this by first locating the exact web page, which in this case is https://uk.trustpilot.com/review/theaccessgroup.com, then right click and view the page source. Please note, we are using Google Chrome to view the page source, the exact process will slightly vary depending on the browser that you are using.

The page source for the link is too large to individually scan, however instead we look to see where the customer reviews are located within the page source. Unfortunately, the class names for the reviews are rather complex but we can observe that the block is contained in an article with class=’paper_paper__1PY90 paper_square__lJX8a card_card__lQWDv styles_reviewCard__hcAvl’, which is used to separate every separate review.

Figure 2: Search the HTML for where the reviews start and view the Class

Within the HTML of the review, the main text component is in the class=’ typography_typography__QgicV typography_body__9UBeQ typography_color-black__5LYEn typography_weight-regular__TWEnf typography_fontstyle-normal__kHyN3’ and is the text component. We can retrieve the star rating direct from the image content using the find() and get() applications of beautifulsoup.

Figure 3: Searching for an actual review and identifying the HTML structure

Now that we have this information, we can build a python function to iterate through the whole webpage to extract the key information and store as a dictionary. You can find the whole code base using this Github link with the core code function shown below.

What to do when interacting with the webpage is required

While BeautifulSoup and Requests are optimal solutions when we simply need to parse the html source of an individual page. There is added complexity when interactions are needed to be committed with the page before any parsing can take place. For this we can use the Selenium Webdriver, which is an API that allows interaction using Python code with the web page of a site. In a very simple explanation, the Selenium WebDriver sends HTTP requests (RESTful API) to your browser driver and handles the response. Its strength during web scraping derives from its ability to initiate rendering we pages, just like any browser, by running Javascript.

Selenium’s functionality gives us many more options when extracting html from a web page. This tutorial won’t cover all the functionality or how it works in detail but I recommend this article for a detailed explanation.

To use Selenium, it requires three components:

· Web browser — supported browsers are Chrome, Edge, Firefox and Safari.

· Driver for the browser.

· The selenium package.

Firstly, the package and the driver for the browser needs to be installed, this can be done following the instructions in the link provided here — https://selenium-python.readthedocs.io/installation.html.

So what are some of the key functions needed for Selenium that can augment your BeautifulSoup parsing?

get()

The get command launches a new browser and opens the given URL in the Webdriver that you have installed. For example if you wanted to find the Google Maps location for a Gail’s Bakery in St Albans, this could be achieved with the following.

find_element()

You can use the find element to find a specific element on the page, for example a click through button. Copy the Xpath for the specific element, found by inspecting the source, finding the blue section and copying the full Xpath. For example if you wanted to find the maximum number of reviews for Gail’s Bakery, this could be achieved with the following.

Click()

This function uses the found element and performs a “click” on it. This can be do by finding the Xpath for that element and then using the .click() function, in similar fashion to find_element.

Future developments for web scraping

While the tutorial serves as a good grounding of the essentials of web scraping, there are many more developments that I am looking to incorporate in this area.

1. Configurable execution code

Currently the code provided is still fairly specific to specific websites. The next step is to develop a solution that is much more generic with limited adaptation needed to crawl across multiple sites. Mykhailo Kushnir writes a number of very good articles on this, which I will be looking to implement myself.

2. Use of proxies with Selenium

Selenium allows you to pass proxy IP addresses. This means that the website that you are scraping is unable to detect the bot as it gives a different location every time a scrape is executed.

Closing remarks

Despite the attractiveness of web scraping, it is important to be cautious when completing requests. It is possible to make 100s of requests a second and in cases of peak traffic could impact a website’s servers or in extreme cases could be considered a denial of service attack, which has significant legal ramifications.

If you want to support the generation of content like this then please subscribe to me and Patrick’s notes.

References

https://realpython.com/beautiful-soup-web-scraper-python/

https://medium.com/@isguzarsezgin/scraping-google-reviews-with-selenium-python-23135ffcc331

https://medium.com/gitconnected/how-i-scrape-lots-of-sites-with-one-python-script-9fba09d5c9be