Automate the data extraction using AWS — S3, Lambda, CloudWatch Events[Part 1]

Harish Gandhi
4 min readNov 4, 2018

--

Space and Time efficient!

Why do you need it?

Initially I was doing the data extraction pipeline all by myself! Running a cron-job every day 10 am in the morning and storing it in my local machine! I know it’s a tedious process until I found out about these workers who can do it for me at ZERO COST(for free tier account)! Saves a lot of your productive time(and your hard-disk memory)

This Cloud Automation helped me save a lot time for completing my project: NewsOptimism

BEWARE!THERE IS NO FREE LUNCH EVERY TIME!

What is this article all about?

This article will walk you through the process of running the script on the cloud. The script extracts the data(from the web) and stores it in the S3 bucket. I will walk you through an sample project.

This article is more of a quick starter tool for you to take your first step towards using cloud application! I am going to use Python 3.

Note: This article is not going to explain all the settings of AWS! But sufficient enough to get you boarded 😉

Steps:

  1. Part 1: Setup— S3, IAM, LAMBDA, CLOUDWATCH.
  2. Part 2: Creating a function/script.

The first step is the hardest!

Project Objective

I want to collect the article links from the Hacker news website on a daily basis. However I only want to collect those articles that have a score more than 100 points.

HackerNews

SETUP PHASE

Before you start you need to have registered the AWS account(need to have free tier AWS account).

S3 — SETUP

  1. Launch S3 from the AWS console.

2. Create a S3 bucket(i have already 2 buckets).

3. The name of the bucket should be unique(hackernews-bucket).

4. Done 👏

IAM — SETUP

  1. Launch IAM from your AWS Console.

2. Click on the Roles and Create role.

3. Choose Lambda.

4. Now search and select AmazonS3FullAccess policy and press Next.

5. Give a role name(I am going with news_scraper_role) and Create role.

LAMBDA — SETUP

  1. Create Functions(I have 4 functions..you might see none if you’re new)

3. Choose your runtime(python 3.6) and select the custom role.

4. Select the IAM role that you have created(news_scraper_role) during the IAM setup stage.

5. Successfully Created your lambda function. 👏

Lambda Setup

CloudWatch Events-Setup

  1. Go to your Lambda function(s3-news-trigger) and cloudWatch events from the add triggers section.

2. Click on the CloudWatch Events to configure. Create your own new rule and choose Schedule expression and I am triggering everyday 10pm UTC.(check out the cron syntax)

3. Add it and save it!You’re done 👏

--

--

Harish Gandhi

I am CS graduate with expertise in data exploration through Software and Machine Learning Techniques