What is web scraping

Raffaello Ippolito
5 min readJul 21, 2023

--

When studying computer science it is known that one learns much more by doing than by studying, so one engages in a wide variety of exercises and projects. When approaching data science, however, a problem often arises: where can I find databases with thousands of rows to analyze?
We live in the age of the Internet; there is a ton of data out there! Web scraping is the practice of retrieving this data from the web in an automated way.

Web scraping, however, is not just a student activity but is actually very often used at the business level as well, therefore it is a practice that is worth starting to relate to if you plan to work in this area.

Having to work with websites, it is clear that the more one knows about how these work the easier it will be to work with them. Nevertheless, it is not necessary to be an expert to get started, just a few notions are enough, and if you do not know them, you will have the opportunity to learn them here.

When one visits a website, all one does is make a call to a given url, in response to this call we will receive a web page. Web pages are nothing more than files written in a language called HTML, which can eventually be enriched with CSS that defines the graphical style of the page and with scripts in javascript ( lately the more recent pyscript is also starting to make its way, but as of today javascript is the one used in almost all cases).

How does it work?

The idea behind web scraping is to perform a call through code to the website of interest and extract the information we are interested in from the HTML (directly or indirectly).

The first step is then to visit the site of interest to try to understand how it is structured, identify the location of the information we are interested in, and consequently the strategy by which to retrieve the data.

An HTML file is characterized by tags; each tag is delimited by a beginning and an end and defines the sections into which the page is organized. Tags can be inside each other thus going on to define sections and subsections. Tags are also of various types defining different types of content.
We may be interested in retrieving all the elements of a given type (all images for example), or we may be interested in information written in specific places and thus reachable through a seguence of tags.

Once manually identified the way forward, we can move on to automating the process. Having extracted all the information one has to organize it, possibly transform it so that it can be stored, how to do this can greatly differ depending on the needs.

Yes, but what about in practice?

Now that we have fixed the concepts, I’ll present a few tools to get you started.

Regarding the initial inspection of the site, you can safely use your favorite browser, such as Chrome or Mozzilla. To view the HTML simply press the key combination Ctrl + u, or type in the address bar “view-source:” before the url. Doing this, however, will show the HTML as it is; some information may be dynamically generated by running scripts. If you want to retrieve the HTML you get after executing the scripts instead you can press the F12 key, or right-click and select “analyze” or “inspect.” We will then also have access to other more advanced tools such as network traffic records and a console.

As for the script, the most widely used language for this kind of work is Python, with its various libraries. Let’s look at some of them:

  • Requests: this is the most basic library that can be used, it is very lightweight and can be used to make requests, however, it is not capable of executing scripts and therefore can only be used for scraping static pages or for side tasks. Because of its being so basic and especially lightweight, other more complex libraries contain it.
  • BeautifoulSoup: is a library used for parsing HTML, it makes searching for information much easier and faster
  • Requests-html: is a version of requests that can execute scripts. It is not as lightweight as requests, but it is still lighter than many other libraries. To render pages, this library relies on chromium, an open-source browser lighter than many of its counterparts.
  • Scrapy: a robust and comprehensive library allows you to do a little bit of everything and also provides a built-in shell that is useful for debugging.
  • Selenium: undoubtedly the most process-heavy library among those on this list, but on the other hand is the one that really allows you to do anything. Selenium allows you to open your favorite browser and use it in all its features.

Once we retrieved the data we would like to save them, beyond how we want to organize them, which can change tremendously from case to case. In regard to the technology to be used, the possibilities are several. The most common and basic approach is to export the data to a CSV file; some libraries such as scrapy natively provide methods to export to to this format. Alternatively one can pull up a MySQL database in case one needs more tables, this approach is certainly more professional but absolutely not necessary for the most basic applications. If you are dealing with Big Data, it may be convenient to try a NoSQL approach, if you want to learn more about what Big Data is and why it may be worth using a nonrelational database I invite you to read my previous articles What is Big Data and Why nonrelational databases. In case you want to go this route, several libraries such as scrapy natively provide methods for exporting to JSON format.

Conclusions

To conclude, web scraping is a very useful practice in many ways and it is definitely something useful to learn, understanding how web pages work can help a lot in scraping, one can for example intercept the calls where the data comes from to directly receive the data in an aggregated and organized way as JSON. On the other hand, however, some sites do not want the data present to be retrieved in an automated manner and thus implement various anti-scraping strategies.

--

--

Raffaello Ippolito

Italian software developer and data analytics student. Graduated in Mathematics for Engineering talking about Big Data and Image Processing