Automated Web-scraping with AWS Free Tier

Learn how to move your scraping job into the cloud.

11 min readJul 2, 2022

Introduction
Developing the Script
Extracting Data from a Website
Code Refactor
Running the Scraping Job in the Cloud
AWS Basics
Saving to the S3 Bucket
Setting up the EC2 Instance
Automating the Script on the Instance
Conclusion

Introduction

Have you ever faced a situation where you wanted to collect data from an online source at regular intervals? For me, creating my own datasets is one of the most fun, even artistic aspects of working with data. In this blog post I’ll take you through the process of automating a scraping job on AWS, I’ll share all the useful resources that I collected on this topic and finally, I’ll show you what my brand new data set looks like!

To develop and test the scraping script locally, I used conda to create a virtual environment with Python 3.10 and a couple of open source packages in it. You can find a list with all the details in the requirements file.

Here’s what my setup on AWS looks like:

AWS Web-scraping Set-up (Image by the Author)

In short, I used an EC2 Instance (i.e. a virtual machine) to run my scraping script daily. As a trigger I set up a cron job on the same instance. In the last part of the scraping script, it saves a new CSV file with the day’s data to an S3 Bucket. That’s all I needed to complete the task, but read on for a more detailed description of the different steps in this process.

Developing the Script

Usually when I develop a script for some task, I start with basic exploration in a Jupyter Notebook. Once I know exactly what I want to do, I refactor my code: I write some functions and then switch to PyCharm to put together a clean script that can be executed from the command line. In the following subsections I will elaborate on this process.

Extracting Data from a Website

I decided to scrape the website of the Swiss national TV and Radio broadcaster SRF daily to put together a data set of German news headlines. The website is static, so I worked with BeautifulSoup. For dynamic websites you would need to use Selenium.

In Firefox, I can inspect a website’s HTML code using Ctrl+Shift+C. If you’re using a different OS or browser you can find the relevant shortcut here. In the Inspect mode I can see that for each news article, there’s a ‘kicker’, a ‘title’ and a ‘lead’, as well as the publication time.

Using the HTML Inspector on the SRF News website to understand what elements to scrape (Image by the Author)

I want to extract all of these for each article. The first step is to request the HTML script from the website and to parse it with BeautifulSoup:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import dateurl = “https://www.srf.ch/news/das-neueste”
page = requests.get(url)
soup = BeautifulSoup(page.content, “html.parser”)

As a next step, the code below creates a list for each of these elements from all the articles.

teaser_lists = soup.find_all(“div”, class_=”js-teaser-data”)
kickers = soup.find_all(“span”, class_=”teaser__kicker-text”)
titles = soup.find_all(“span”, class_=”teaser__title”)
leads = soup.find_all(“p”, class_=”teaser__lead”)

Then I can iterate through these lists and store the data in a dictionary, which can easily be turned into a pandas DataFrame.

news_snippet_dict = {
    “time_published”: [],
    “kicker”: [],
    “title”: [],
    “lead”: []
}for (teaser_list, kicker, title, lead) in zip(teaser_lists, kickers,
    titles, leads):
    news_snippet_dict[“time_published”].append(teaser_list.get( 
    “data-date-published”))
    news_snippet_dict[“kicker”].append(kicker.get_text())
    news_snippet_dict[“title”].append(title.get_text())
    news_snippet_dict[“lead”].append(lead.get_text())news_snippets_df = pd.DataFrame(news_snippet_dict)

As a final step, I turn the variable time_published into a timestamp and select only the news articles that were published on this day.

news_snippets_df["time_published"] = pd.to_datetime( 
    news_snippets_df["time_published"])news_snippets_df = news_snippets_df[news_snippets_df[ 
    "time_published"].dt.date == date.today()]

This is it for the scraping. The code creates a DataFrame with all the available data for the day. If you want more information on how to use BeautifulSoup, there’s an excellent resource over at RealPython.

Code Refactor

In order to automate the code above, it needs to be adapted for a production environment, which generally means writing functions and integrating them into one script that can be executed from the command line. Another common requirement is to document the process well, for example by including a description and comments in functions and by creating a requirements file that lists all the packages that are required to run the script.

Here’s the function that I compiled from the previous code snippets:

In addition to this, it is common to write a main function which brings all the different steps together (in this case scraping and saving the data). I also normally implement some basic logging, so that in case there is an error, I can see where it happened. Here’s what my main function looks like:

To complete the script, I added the following boilerplate code to define when the main function should be executed (i.e. whenever this script is run as the main script) and what should be done in this case:

The full script can be found here.

After putting all of this together, I can run this file from the command line. I usually use a GNOME Terminal, since I’m working with a Ubuntu OS. Therefore, the following commands are in Bash. If you’re working on a Windows system, you can either install Bash or try looking up equivalent commands for PowerShell.

So in my terminal, I first need to activate the virtual environment where the required packages are installed and navigate into the project folder. Then I simply tell Python to run the script:

conda activate <environment_name>
cd <path_to_project_folder>
python src/main_local.py

By running this script from the terminal, I can scrape data from the SRF News website and save it to a local folder with just one command.

In addition to the script, at this point I also created a requirements file in which I listed the packages I’m using. It looks like this:

The file makes it very easy to create a new virtual environment similar to the one I’m using. This concludes the first part of this tutorial.

Running the Scraping Job in the Cloud

With the code refactor all done, I can move on to the next step, which is to automate the running of this script somewhere other than my own computer. I decided to use AWS for the automation of my web-scraping task, since it is one of the biggest cloud service providers with a really wide range of options available. In addition, there’s a ‘Free Tier’ plan for the first year which covers everything I’ve tried out so far.

AWS Basics

It is best to complete a few basic steps on AWS before using any of the services, just to ensure the set-up is according to their best practices, also in terms of security. To complete these steps correctly, I followed this walk-through provided by AWS. It is very detailed, so I’m not going to repeat everything here.

Just to give a rough overview, here’s what I had to do:

Sign up for an AWS account; this gave me my root user credentials. The credentials are best stored in the .aws folder in your /home.
With this root user, create an IAM user; give this user admin access. To avoid confusion resulting from the different user credentials, it is best to use a config file in your .aws folder.
Create the S3 bucket with the IAM user and attach a permissions policy to this bucket, which gives the IAM user read and write access.

This is also a good time to set up billing alerts to make sure the services do not exceed a certain budget.

Saving to the S3 Bucket

Now I can test this set-up by trying to write my data file from my computer to the S3 bucket. If the permissions are all in place, changing the folder path in my boilerplate (just below if __name__ == '__main__':) to "s3://srf-news-snippets/data" should write the scraped data to the folder /data in the bucket named srf_news_snippets instead of a local folder.

I hope this convinces you that writing data files to an AWS S3 Bucket is just as easy as writing them to a local folder. All we need is the pandas .to_csv() method. The full script for scraping and writing to the bucket can be found here.

Setting up the EC2 Instance

In the previous step, I showed how I used an S3 bucket to store data files in the cloud. Doing this becomes a lot more useful when the script is also run in the cloud, so the entire process is detached from my own machine. One way to achieve this is to set up a virtual machine in the cloud. The AWS Free Tier offers this with EC2 Instances. I’ve broken down the process into the following three steps:

Give your User EC2 Access

Give the admin user that you created before access to EC2 instances by adding the policy AmazonEC2FullAccess in the user’s profile (more information here).

Launch the Instance

To do this, go to Services -> EC2, scroll down a bit and click ‘Launch instance’.

Launching an EC2 instance from the AWS Console (Image by the Author)

On the next screen, give your instance a name and choose your preferred options. I worked with an Ubuntu instance, 64-bit, of the type ‘t2.micro’, which is free tier eligible. Create a new key pair (default settings are fine here), so that you will be able to connect to the instance later on. Save this key file in your project folder. Then scroll down and click the button ‘Launch instance’.

Give your Instance S3 Access

You will want to use this instance to write to your S3 bucket, so the instance will require the permissions to do so. Standard AWS instructions to complete this process can be found here. I tried to simplify it a bit below.

First, you need to create an IAM instance profile. Go to Services -> IAM, in the Menu on the left click ‘Roles’ and then ‘Create role’. Select ‘AWS service’ and then ‘EC2’; click ‘Next’.

In the search bar, type “AmazonS3” and hit Enter. This shows you all the available options. I used the AmazonS3FullAccess policy to make sure the instance gets all the required permissions. Select it and click ‘Next’.

Displaying all the permission policies related to “AmazonS3” on the AWS Console (Image by the Author)

Give this role a meaningful name and click ‘Create role’. Now you can assign this role to any of your instances, as needed.

As a second step, attach this newly created role to your instance. Go to Services -> EC2, click ‘Instances’ and select your instance. Click on the ‘Actions’ drop down, select ‘Security’ and ‘Modify IAM role’. In the drop down on the next screen, you can select the role you just created. Finally, click ‘Update IAM Role’ to give your instance access to all your S3 buckets. This concludes the set-up and you can now start working with your instance!

Automating the Script on the Instance

For this next section I am again switching to my GNOME Terminal. These commands are in Bash, so if you’re working on a Windows System, you may need to find an alternative that works on your OS.

Connect to your Instance

Open up two different terminal windows. Use one of them to run commands on your local machine and the other to connect to your instance. In the first window:

Give yourself full permission to access all files in the project directory:

chmod 700 <path_to_project_folder>

Navigate to your project folder with all your files:

cd <path_to_project_folder>

Restrict access to the EC2 key file (which should be stored here from before):

chmod 400 <keyfile.pem>

Then connect to the instance using your keyfile and the Public IPv4 DNS (you can find it in the AWS console by clicking on the Instance ID):

ssh -i “<keyfile.pem>” ubuntu@<Public IPv4 DNS>

Here’s what my first terminal now looks like, just to give an idea of what you would expect to see:

Terminal window connected to my EC2 instance (Image by the Author)

As you can see in the last line, I now have access to my instance via this terminal, so I can run commands in the cloud. If this did not work for you, check out these AWS guidelines.

Copy Files and install the Required Packages

Now that I have access to my instance, I want to copy some files and install some packages on this instance, which will ultimately enable me to run my python scraping script on the instance.

In your second terminal window, navigate to the project folder:

cd <path_to_project_folder>

Copy the required files (requirements.txt & main.py), again using the key file and the instance’s Public IPv4 DNS:

scp -i <keyfile.pem> requirements.txt ubuntu@<Public IPv4 DNS>:~/
scp -i <keyfile.pem> src/main.py ubuntu@<Public IPv4 DNS>:~/

Now switch back to your first terminal window, where you are still connected to your instance. You can check if the files are available there (use ls), navigate through folders (cd <folder_name>), create folders (mkdir <folder_name>) and move files (mv <file_name>).

You now need to install all your packages from the requirements file, but you will require pip to do this. So first, follow these instructions from AWS to install pip on your instance (you can skip the first part about installing python). Once you have pip available, you can install your packages with the following command:

pip install -r requirements.txt

Once you have your files and packages ready on your instance, you can move on to the final step.

Automate the Running of your Scraping Script

Now you can finally run your script on the instance:

python3 main.py

This should save a data file to your S3 bucket, so even if you do not get any error message, go check if the file is available there.

To automate the running of the script, you can set up a simple cron job on the instance. First of all, check what timezone your instance is running in:

timedatectl

In my case this did not match my own timezone, so I used the following commands to list the available time zones and then set the instance to be in the same zone as I am.

timedatectl list-timezones
sudo timedatectl set-timezone Europe/Zurich

Then I set up a trigger to run the script daily at 11:50pm. To do this, I opened the cron file: crontab -e. Then I wrote in it: 50 23 * * * python3 main.py to run the script main.py every day at 23:50h, and saved the changes in this file with Ctrl.+O, then exited the editor with Ctrl.+X. If you’re not sure how to translate a specific timing into the cron logic, use this handy resource.

This concludes the second part of the automated scraping set-up. I hope this documentation of the process that I followed will help you along the way to automating your own tasks.

Conclusion

To sum up, in this article I first showed how to scrape a website and refactor this code into a clean script. I then ran this script locally and saved the resulting file to a folder on my machine, as well as an S3 bucket in the cloud. Finally, I automated the running of the script using a cron job on an EC2 instance.

This simple setup has enabled me to scrape data from a website daily at the same time, without even thinking about it. In case you’re wondering what my data looks like, here’s an example data file. After a few months, I will have a unique, sizeable and interesting German text data set, on which I can try out different NLP tasks such as Named Entity Recognition, Topic Clustering and Sentiment Analysis.

I hope you found this tutorial useful! All the materials used in this article are available on my GitHub in this repository. I will be happy to respond to your questions and comments and feel free to reach out to me on LinkedIn!