Automate the data extraction using AWS — S3, Lambda, CloudWatch Events[Part 2]

4 min readNov 4, 2018

Objective

I want to collect the news from the Hacker news website on a daily basis. However I only want to collect those articles that have a score more than 100 points.

Before you start , checkout my part 1 — Setup phase: https://medium.com/@hramachandran/automate-the-data-extraction-using-aws-s3-lambda-cloudwatch-events-part-1-563a00118394

Now creating the Lambda function/code.

Creating a Deployment Package

Step 1: Build a python Script

Before we start we need to import these packages boto3, BeautifulSoup, requests, time, json.

import boto3
import json
import requests
from bs4 import BeautifulSoup
import time

I had used three functions here:

lambda_handler(event, context): This is the function that gets invoked when the service executes the code.

def lambda_handler(event, context):
    data = collect_data_from_website()
    for key, val in data.items():
        if int(val["points"].split(" ")[0]) > 100:
            file_name = time.strftime("%m%d")+"/" +str(val["title"])
            save_file_to_s3('hackernews-bucket', file_name, data[key])

2. collect_data_from_website(): I scraped the articles based on their score(>100) values.

3. save_file_to_s3(bucket, file_name, data): Here we pass in our data dictionary into the bucket.

def save_file_to_s3(bucket, file_name, data):
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket, file_name)
    obj.put(Body=json.dumps(data))

Since this project needs third party libraries(requests and BeautifulSoup), we can’t directly copy paste the code on to the Lambda function code inline feature.we need to deploy it as a zip package along with the libraries.

Step 2: Create the Zip file

There are various ways of deploying the package. I am going to show how you can do it through a linux terminal virtual environment.

Steps:

Create a virtual environment on your directory(virtualenv myenv)
Activate it(myenv) and store your lambda_function.py script
Install your third party libraries(pip3 instaill requests bs -t .)
Zip the package along with your script file(lambda_function.py)

Step 3: Upload and test

Before uploading the zip file, make sure your Handler name is same as the filename.filehandler(say lambda_function.lambda_handler)

2. Save and test your code.

3. A prompt window will appear for some Event name.

4. TEST it! 😏

Note: Change the timeout(otherwise time execution error will occur) from your basic settings(scroll down your lambda function page)

Step 3: Output:

Check out S3 bucket(hackernews-bucket)

2. You will see the articles with score greater than 100 points.

3. Click on any of the files using the select from and have a JSON preview.

Full code:

import boto3
import json
import requests
from bs4 import BeautifulSoup
import time

def lambda_handler(event, context):
    data = collect_data_from_website()
    for key, val in data.items():
        if int(val["points"].split(" ")[0]) > 100:
            file_name = time.strftime("%m%d")+"/" +str(val["title"])
            save_file_to_s3('hackernews-bucket', file_name, data[key])

def collect_data_from_website():
    data = {}
    page = requests.get("https://news.ycombinator.com/")
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.select("td")
    key_title = ""
    for row in table:
        score = row.find("span", {"class": "score"})
        link = row.find("a", {"class": "storylink"})
        if link:
            key_title = link.get_text()
            data[key_title] = {
                "title": key_title,
                "link": link.get('href'),
                "points": "0 points"
            }
        if score:
            if key_title in data:
                data[key_title]["points"]=score.get_text(strip=True)

    return data

def save_file_to_s3(bucket, file_name, data):
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket, file_name)
    obj.put(Body=json.dumps(data))

References:

Free AWS usage: https://aws.amazon.com/free/
API usage for Amazon S3: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-examples.html
Pip install documentation: https://pip.pypa.io/en/stable/reference/pip_install/
Different ways for Job Scheduling in pyhton: https://raybuhr.github.io/talks/python-job-scheduling/python-job-scheduling.html#/
No module named lambda function: https://www.iheavy.com/2016/02/14/getting-errors-building-amazon-lambda-python-functions-help-howto/
Resolve import errors during deployment: https://davidhamann.de/2017/01/27/import-issues-running-python-aws-lambda/
Lambda function errors: https://www.iheavy.com/2016/02/14/getting-errors-building-amazon-lambda-python-functions-help-howto/

Thanks for reading!

If you liked this article keep sharing with your community and consider exploring what happens when you click the clap icon more than once 👏