Automate the data extraction using AWS — S3, Lambda, CloudWatch Events[Part 2]

Harish Gandhi
4 min readNov 4, 2018

--

Objective

I want to collect the news from the Hacker news website on a daily basis. However I only want to collect those articles that have a score more than 100 points.

Before you start , checkout my part 1 — Setup phase: https://medium.com/@hramachandran/automate-the-data-extraction-using-aws-s3-lambda-cloudwatch-events-part-1-563a00118394

Now creating the Lambda function/code.

Creating a Deployment Package

Step 1: Build a python Script

Before we start we need to import these packages boto3, BeautifulSoup, requests, time, json.

import boto3
import json
import requests
from bs4 import BeautifulSoup
import time

I had used three functions here:

  1. lambda_handler(event, context): This is the function that gets invoked when the service executes the code.
def lambda_handler(event, context):
data = collect_data_from_website()
for key, val in data.items():
if int(val["points"].split(" ")[0]) > 100:
file_name = time.strftime("%m%d")+"/" +str(val["title"])
save_file_to_s3('hackernews-bucket', file_name, data[key])

2. collect_data_from_website(): I scraped the articles based on their score(>100) values.

3. save_file_to_s3(bucket, file_name, data): Here we pass in our data dictionary into the bucket.

def save_file_to_s3(bucket, file_name, data):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, file_name)
obj.put(Body=json.dumps(data))

Since this project needs third party libraries(requests and BeautifulSoup), we can’t directly copy paste the code on to the Lambda function code inline feature.we need to deploy it as a zip package along with the libraries.

Step 2: Create the Zip file

There are various ways of deploying the package. I am going to show how you can do it through a linux terminal virtual environment.

Steps:

  1. Create a virtual environment on your directory(virtualenv myenv)
  2. Activate it(myenv) and store your lambda_function.py script
  3. Install your third party libraries(pip3 instaill requests bs -t .)
  4. Zip the package along with your script file(lambda_function.py)

Step 3: Upload and test

  1. Before uploading the zip file, make sure your Handler name is same as the filename.filehandler(say lambda_function.lambda_handler)

2. Save and test your code.

3. A prompt window will appear for some Event name.

4. TEST it! 😏

Note: Change the timeout(otherwise time execution error will occur) from your basic settings(scroll down your lambda function page)

Step 3: Output:

  1. Check out S3 bucket(hackernews-bucket)

2. You will see the articles with score greater than 100 points.

3. Click on any of the files using the select from and have a JSON preview.

Full code:

import boto3
import json
import requests
from bs4 import BeautifulSoup
import time

def lambda_handler(event, context):
data = collect_data_from_website()
for key, val in data.items():
if int(val["points"].split(" ")[0]) > 100:
file_name = time.strftime("%m%d")+"/" +str(val["title"])
save_file_to_s3('hackernews-bucket', file_name, data[key])

def collect_data_from_website():
data = {}
page = requests.get("https://news.ycombinator.com/")
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.select("td")
key_title = ""
for row in table:
score = row.find("span", {"class": "score"})
link = row.find("a", {"class": "storylink"})
if link:
key_title = link.get_text()
data[key_title] = {
"title": key_title,
"link": link.get('href'),
"points": "0 points"
}
if score:
if key_title in data:
data[key_title]["points"]=score.get_text(strip=True)

return data

def save_file_to_s3(bucket, file_name, data):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, file_name)
obj.put(Body=json.dumps(data))

References:

  1. Free AWS usage: https://aws.amazon.com/free/
  2. API usage for Amazon S3: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-examples.html
  3. Pip install documentation: https://pip.pypa.io/en/stable/reference/pip_install/
  4. Different ways for Job Scheduling in pyhton: https://raybuhr.github.io/talks/python-job-scheduling/python-job-scheduling.html#/
  5. No module named lambda function: https://www.iheavy.com/2016/02/14/getting-errors-building-amazon-lambda-python-functions-help-howto/
  6. Resolve import errors during deployment: https://davidhamann.de/2017/01/27/import-issues-running-python-aws-lambda/
  7. Lambda function errors: https://www.iheavy.com/2016/02/14/getting-errors-building-amazon-lambda-python-functions-help-howto/

Thanks for reading!

If you liked this article keep sharing with your community and consider exploring what happens when you click the clap icon more than once 👏

--

--

Harish Gandhi

I am CS graduate with expertise in data exploration through Software and Machine Learning Techniques