Scraping from 0 to hero (Part 1/5)

Ferran Parareda
4 min readOct 28, 2021

Introduction

As Wikipedia says here, Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

Simple as this. Enter into a website, get the most important information/data for you and you can use it as you wish. Everything automated, of course.

The idea behind this, it is pure automation of processes. Normally when we are constantly retrieving information from a particular website, we only need a few data, and not the website’s entire information. That’s why we need to scrape this specific information and avoid the other parts.

Let’s see an example that uses scrapers. You are setting up a startup. The model of this startup is to share your sofa in exchange for a few euros. After a few days, the startup still does not have the necessary flats to offer to the clients. One of the founders knows a possible way: scrape Craigslist in order to get the people’s interest in offering their sofas, rooms or even the entire flat! This is amazing, right? This startup now is called AirBnB.

In this article, I am going to explain from top to bottom all that is needed to understand Scraping and make it perform better. Before starting, just keep in mind one important thing: do scraping only when you can not get the data in another way (such as through public API). It’s not an easy way to:

  • Keep the maintainability of the scraper,
  • Deal with all the problems that can occur once you are scraping,
  • Minimize cost infrastructure.

So, if you still want to learn about it, let’s start with it!

Assumptions

Before start with the article, you should know a few things about:

  • HTML,
  • XPath or CSS selectors,
  • Code knowledge (preferible Python. Also accepted Java or Go).

Concepts

In this article, I am going to use several words that are important to understand:

Crawling

When we need to navigate every single page in a website, we need to know all of their links. If not, how can we access them to be scrapped? The way to get all the links is called crawling.

The algorithm is really simple:

  1. Define the scope (for instance, if you want to go over all Craigslist website and you found a link outside Craiglist, you will ignore it).
  2. Start with a small amount of links to scrape. These links will be stored in a list of links (LoL).
  3. If there is a link in LoL, try to get all the hyperlinks appearing in the HTML’s content of this link.
  4. Every new link that is present in the current HTML’s content has to be added to the LoL (Given that it’s inside the boundaries and was not visited previously, otherwise ignore it).
  5. Repeat from step 3 until there are no remaining links in LoL to consume. In that case, stop the process.

You have reached all the crawled links in the webpage.

Take in consideration that to do crawling, the methodology to do it it’s similar when you are doing scraping. You need to do it in a nice way (or smart way) to avoid overloading the website’s server or the others users that are using the website. I will explain a bit how I recommend to do it when you are doing scraping. It can be applied in the very same way.

Legality

Doing scraping is one of the most controversial topics to discuss with every webmaster or web owner. Why? Because (if done well) we can not distinguish if it’s a human who’s navigating the website or a bot (a scraper). The webmasters only ask to be polite once you are using their webpage. And this is totally fair.

In my opinion, you can scrape every page if:

  • Their terms and conditions allow you to do so.
  • You are polite and not generating a kind of DDoS attack. (It’s a situation where a lot of requests are sent to retrieve information, causing the shutdown of the website).

Frameworks

There are a lot of frameworks in different languages that can help you to start scraping. Each one has benefits and drawbacks.

Scrapy (Python)

Personally, I think it’s one of the best options out there. Period.

Scrapy is the most easy, usable and scalable framework ever. Made by the company ScrapingHub, this framework allows you to scrape whatever you need to. You can check their website or their documentation to know more.

Colly (GoLang)

This is a pretty new framework that allows you to use the power of GoLang (language of the giant Google — by the way, do you know what Google does behind the scenes? Yes: Scraping! -).

Colly is fast, reliable and scalable as well. The main drawback from using this one is that you will have to develop in Go. The complexity of the language creates more enemies than friends. If you are a good developer in GoLang, this is your framework! You can check their website or their documentation for more information.

There are more frameworks in other languages, like Ruby on Rails, Java or even in PHP. However, in this article, I am going to assume that we will only use Scrapy for the sake of its simplicity.

Development environment needed

The most basic development environment needed is:

  • Python 3.9 installed.
  • PyCharm to develop the code in Python. You can use the community edition. It’s free and with enough possibilities to use.
  • Docker to run the DB needed to store the information.

To be continued…

This article is part of the list of articles about Scraping from 0 to hero:

If you have any question, you can comment below or reach me on Twitter or LinkedIn.

--

--