Automate the data extraction using AWS — S3, Lambda, CloudWatch Events[Part 1]
Space and Time efficient!
Why do you need it?
Initially I was doing the data extraction pipeline all by myself! Running a cron-job every day 10 am in the morning and storing it in my local machine! I know it’s a tedious process until I found out about these workers who can do it for me at ZERO COST(for free tier account)! Saves a lot of your productive time(and your hard-disk memory)
This Cloud Automation helped me save a lot time for completing my project: NewsOptimism
What is this article all about?
This article will walk you through the process of running the script on the cloud. The script extracts the data(from the web) and stores it in the S3 bucket. I will walk you through an sample project.
This article is more of a quick starter tool for you to take your first step towards using cloud application! I am going to use Python 3.
Note: This article is not going to explain all the settings of AWS! But sufficient enough to get you boarded 😉
Steps:
- Part 1: Setup— S3, IAM, LAMBDA, CLOUDWATCH.
- Part 2: Creating a function/script.
The first step is the hardest!
Project Objective
I want to collect the article links from the Hacker news website on a daily basis. However I only want to collect those articles that have a score more than 100 points.
SETUP PHASE
Before you start you need to have registered the AWS account(need to have free tier AWS account).
S3 — SETUP
- Launch S3 from the AWS console.
2. Create a S3 bucket(i have already 2 buckets).
3. The name of the bucket should be unique(hackernews-bucket).
4. Done 👏
IAM — SETUP
- Launch IAM from your AWS Console.
2. Click on the Roles and Create role.
3. Choose Lambda.
4. Now search and select AmazonS3FullAccess policy and press Next.
5. Give a role name(I am going with news_scraper_role) and Create role.
LAMBDA — SETUP
- Create Functions(I have 4 functions..you might see none if you’re new)
3. Choose your runtime(python 3.6) and select the custom role.
4. Select the IAM role that you have created(news_scraper_role) during the IAM setup stage.
5. Successfully Created your lambda function. 👏
CloudWatch Events-Setup
- Go to your Lambda function(s3-news-trigger) and cloudWatch events from the add triggers section.
2. Click on the CloudWatch Events to configure. Create your own new rule and choose Schedule expression and I am triggering everyday 10pm UTC.(check out the cron syntax)
3. Add it and save it!You’re done 👏
Once you’re done with the setup, move on to my Part 2, Creating the Lambda function.:https://medium.com/@hramachandran/automate-the-data-extraction-using-aws-s3-lambda-cloudwatch-events-part-2-254002056a97
References:
- Free usage: https://aws.amazon.com/free/
- API usage for Amazon S3: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-examples.html
- AWS Lambda pricing: https://aws.amazon.com/lambda/pricing/
- Different ways for Job Scheduling in python: https://raybuhr.github.io/talks/python-job-scheduling/python-job-scheduling.html#/