Web scraping

Architecture behind a Web Scraper that runs on the AWS Cloud

How does a web scraper that runs on the Cloud work?

Michelangiolo Mazzeschi
Plain Simple Software

--

Web scraping is an activity that is almost impossible to complete on large scale with a single computer. Although web scraping is one of the easiest software to create, it requires a long time to complete its task because of the limitations imposed by our internet network. In this article, I am going to describe in detail the architecture behind a web scraper that runs on the AWS Cloud.

While Machine Learning software, in comparison, can be sped up with a greater amount of computing power, there is no way to improve the speed of web scraping software that is running on a single machine. The only option we have is to let the program run until it is finished. If it is running on our local machine, we might be willing to stop our work until we have scraped all the data that we are interested in. Conventionally, a web scraping algorithm has a limit of 1 website per second. However, what if you need to scrape hundreds of websites, each one made of tens of thousands of links?

A single computer does not have enough power for such an enterprise, but one remote computer (or multiple ones) that run 24/7 on the Cloud might just do. The Cloud offers you the possibility of opening virtual machines, computers that run 24/7 and have an hourly cost. AWS has a service which is called EC2, specialized in running virtual machines of several kinds. Later on, we will see the specifics and the challenges of running a web scraping algorithm on the Cloud.

When the algorithm breaks

The main issue of web scraping algorithms is that they might encounter pages that are problematic for the software to run, resulting in a potential crash of the virtual machine.

A virtual machine can usually have trouble when:

  • Attempts to download a pdf file
  • Attempts to download an image
  • Attempts to download an unknown format
  • Attempts to download a page that is too heavy

Unfortunately, web scraping is very delicate, because the software does not discriminate between HTML pages only add pages that direct to other files. There is, however, the possibility of making them skip those pages by using a few lines of code in our main software.

Software architecture

To build an entire web scraping software, a proper algorithm is not enough. We need to connect several services together that allows us to:

  • Communicate with the software some inputs (the websites we wish to scrape) and different parameters
  • Save the scraped data on a storage unit
  • Mange the EC2 instance, stopping it or activating it when we need
Architecture of a Cloud Web Scraper

Managing the EC2 Instance

As mentioned, there are several EC2 instances we can choose from. While the cheapest option (about .14 USD per day) is the t2.micro, with 1 CPU and .5 GB of RAM may not be fit for web scraping. The next model is the t2.micro, which has 1 CPU and 1 GB of RAM, a bit more powerful and able to withstand more intensive power demands.

Once activated, I have mounted my web scraping software on the virtual machine, so it will run 24/7 until the task is done. In case we encounter a bad link and the web scraper breaks, we can set up a task in the virtual machine that makes the software run again. The software will be unkillable.

Saving scraped files in storage

The web scraper will keep downloading files, but they cannot be stored in the EC2 memory. We could, but it would make much more sense, as we would need to connect to the EC2 each time to extract our data. The are several kinds of storage we can choose from, the most two popular formats in which we can store our data are the json format and SQL.

The issue when using python is that is not a language that goes along very well with relational databases. Also, because we need to store textual data, a tabular format like SQL is not the best choice. We need a data format that is flexible enough to have different fields, if needed, and can accept very long textual fields. In our case, json is perfect. It is also less expensive, as using the service for NoSQL on AWS has no initial costs, unlike an SQL database.

The service that AWS uses for storing NoSQL data is called DynamoDB and allows us to store an unlimited amount of NoSQL tables, even offering a UI from which we can access it and easily visualize our data. For every web page, we are going to scrape, all the data will be sent and stored on our DynamoDB storage.

SQL vs. NoSQL, retrieved from https://www.youtube.com/watch?v=nigPkP6QeTk

Communicating with the Web Scraper

Now that we have established where we will get the data from and where to store it, we need a way to communicate with the machine to give it proper directions. There are many ways we can interact with an active Virtual Machine. Because the objective of the web scraper is quite simple, my objective is to only send new links to a table in DynamoDB that is scanned periodically by the web scraper. Every hour, the web scraper will read if there are new entries in that specific DynamoDB table, and if there are it will start scraping all the links of that web page.

To update the dataset and send small information to the DynamoDB service, I am using a service called Lambda. There are two main advantages of using lambda functions:

  • The cost is limited to the running time of the function
  • We can send information from the browser

What I want to avoid is to log in to my coding environment on my own computer (or even on the DynamoDB service page on the AWS console) and input new links every single time. What if my friends or some clients want to use it? To make it accessible, I will connect those lambda functions that can send new links to my DynamoDB table to API Gateway. Now, my functions will be accessible from any browser, which simplifies things.

In reality, when an API is created, applications use buttons instead of a browser URL, which is even simpler for the end-user. Make sure to follow Plain Simple Software for more software articles!

--

--