Setup Cloud-based Data Scraping for Free using AWS
For our Exploratory Data Analysis project at Columbia (as a part of the M.S Data Science program), we chose to analyze viewership data from the game streaming website twitch.tv.
The data for this project was to be obtained by using the REST API provided by Twitch. They provide clear instructions on how to structure GET requests and they return the data in JSON format. Since the project required analysis of “LIVE” twitch data, we wrote a Python script to do this. Although Twitch provides this data via easily accessible APIs, we needed to have a headless machine to just run our script every 10 minutes. In essence, we needed to setup a periodic data scraper.
Data Scraping on AWS.
We needed a machine- preferably linux based, that could run python scripts efficiently. We did not need much CPU horsepower or RAM. As is the case with most data scraping tasks, a small box would do the job just fine. This is where Amazon AWS came in.
There are multiple cloud-based compute/storage platforms available on the web, and we chose to use AWS due to it’s ease of setup, availability of educational credits for students, and the AWS Free Tier.
The instances we setup on AWS were all run on the free tier. So, after running the data scraping job for around one month, we were charged $0.
What were we trying to achieve?
We setup our python script to run on the Amazon EC2 instance. This script was setup to be executed every 10 minutes. Our data flow is illustrated in the image above.
- The script makes API requests to the External API (twitch in our case)
2. Once the external API responds, we process the data for fields that are required.
3. Then push data into Amazon RDS, which is a Relational Database Service. It makes it easy to setup and manage a MySQL instance in the cloud.
At the end of this; when it was time to do some data analysis, we just needed to connect to the MySQL instance, and pull the data! We even extracted the consolidated data as CSV files.
The cloud based — data scraping guide
This guide details our experiences in setting up data scraping in the cloud. The topics covered are:
- Setting up an EC2 instance
- Setting up RDS
- Configuring the machines
- Configuring data scraping
- Pushing data into RDS
- Scheduling jobs
To follow along, make sure you have a functioning AWS account. The signup process is straightforward, and they have educational discounts available for university students. Additionally, you can also obtain up to 150$ credit using Github Education Pack.
Note: This guide utilizes the AWS ‘free-tier’, and under this pricing, you will not be charged for the machines we will set up.
Setting up an EC2 instance
Once an account is created click here to launch and create the AWS instance. Click ‘Launch Instance’ to set up the machine.
We’ll use the ‘Ubuntu Server’ machine image (AMI) for the web-scraping machine. We chose this because we were very familiar with Ubuntu’s package manager, and troubleshooting issues on Ubuntu is simpler, due to the huge community support available.
Since the machine we will be running will just run a lightweight Python script every 10 minutes or so, we don’t need much computing power. Choosing the ‘t2.micro’ instance gets us in the free tier.
At this point, we are ready to launch the instance. We do not need any more customizations at this point. But, feel free to Configure Additional Storage and Security Groups. Clicking ‘Review and Launch’ brings up the dialog below.
We need to setup our SSH key-pair. If you’re unfamiliar with this, read more here. Give a name to the key pair, and download the private key. We’ll need this to SSH into the machine.
Once this is done, click ‘Launch Instance’ to launch your EC2 instance! The machine will take a few minutes to spin up. Once it is live, your dashboard should look something like this:
You will use the public IP listed here to SSH into the machine.
Setting up an RDS instance
Setting up an database instance using RDS is straightforward again. Launch an instance here.
For our purposes, we can just use a MySQL database instance. It is free tier eligible, and provides all the features we need. After this, setup the database password. (We stuck to the 5GB default storage, since this was more than what we needed).
Once you launch your instance, you should see it running on your dashboard.
Configuring the machines & security groups
You can now access the EC2 instance via SSH (make sure you have the SSH private key, from the first step). Right click on the EC2 instance on AWS and click ‘connect’. Amazon provides clear details and instructions on how to connect to the machine.
As we began connecting to our machines, we noticed that we were easily able to connect to the EC2 machine via SSH.
However, to connect to the RDS machine from EC2 was blocked. We realized that we needed to open up the rules on the security group. Read more here and allow access. Make sure you allow inbound access from your local laptop/desktop, as well as access from the EC2 instance.
Now, to initially configure the RDS instance and setup a database /table etc, you can either use MySQL Workbench, or connect to the database via command line.
You can now connect to the machine, and use it as if it were a local machine. Setup the libraries you need, download python packages via pip, write code, etc.
Configuring the Scraping Script
We chose to write our data scraping code in Python.
For our data scraping purpose, we wrote a Python script to make and receive REST API calls. We then processed these JSON Responses.
We used some python packages to help us perform web requests and handle the JSON that is returned.
requests, json, urllib
We used the urllib and requests libraries to send GET requests to pre-defined urls. The data was then parsed with the json library. The ‘json’ library puts the text data into a Python dictionary where you can now reference various sections of the JSON by name.
For convenience, we suggest putting the output into a Pandas Dataframe for convenient writing to the database that we describe in the next section.
Pushing Data into RDS
If our data is in a Pandas Dataframe it is very easy to write to a MySQL instance from a script running on EC2.
We used ‘mysql.connector’ and ‘sqlalchemy’ libraries in Python to push data into MySQL hosted as part of RDS. The API is quite straightforward. Once we have our connection object, we simply invoke the .to_sql() function of a pandas dataframe to write directly to the database.
Scheduling Jobs at fixed time intervals (CRON job)
The data we pulled is coming from a REST API in a “snapshot of time” format. So, to build a history over time, we needed to run our scraper at fixed time intervals to pull data from the API and then write to the database.
We setup a cron job to do this. Basically a cron job allowed us to execute a shell script at fixed time intervals and we invoked our python scraper from inside that shell script.
A helpful tutorial on cron jobs can be found here: https://www.taniarascia.com/setting-up-a-basic-cron-job-in-linux/
Once this was completed, we had a data-scraping system completely running in the cloud. The screenshots below show the monitoring stats from our instances. (Notice how usage spikes every 10 minutes).
For reference, our code is available here and our final analysis is here. This blog post was very high level, and aimed at just setting up instances and getting your hands dirty. This should help you get started on setting up your own cloud based data scraping service, all for free!
-Shashank Rao and Cyrus Lala.