Building a scraper Slack bot using Amazon EventBridge

Mart Noten
NBTL

--

I’m a big fan of vinyl records. I’m not sure where I got it from, but I’m picking up something new at least once a month. When buying these records, I prefer to stay away from the big shops like Amazon but go for the local dealerships.

However, most of these local companies don’t offer mobile applications. They’ll most likely have a webshop with the latest offers, though. So, time to build me a scraper that sends me through these latest offers.

Photo by Mick Haupt on Unsplash

This post is an excerpt of the original on the NBTL blog. If you’re looking for more content and the full code examples: check out the original post.

Building blocks of the solution

AWS Lambda & Amazon DynamoDB are essential services for computation and database storage.

AWS EventBridge is an architecture changing tool, and it will transform the way you think about Serverless development. It helps you build event-driven applications at scale across AWS. Are you looking for more information on this service?

AWS CDK is here to stay. It allows you to define your cloud application resources using familiar programming languages. If AWS is a toolbox, CDK is the machinery to operate it. You can find our introductory post here if you haven’t used it before.

Building a Serverless scraper that sends the newest items to Slack

Using BeautifulSoup to fetch the latest offers

In our AWS Lamba function, Python runtime, we’ll actually scrape the content of our webpage. To help us with that we can use a library called BeautifulSoup. The following snippet. allows us to grab all the artists, titles, prices and images from the webshop page.

url = f"https://www.xxxxxxx.nl/xxxx-xxxx?page={page}"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
all_lp_artists = soup.find_all("h4")[4:-5]
all_lp_titles = soup.find_all("h5")
all_lp_prices = soup.find_all("div", {"class": "article__price"})
all_images = soup.find_all("div", {"class": "article__image"})

We can then loop through the results and create items that we can store in DynamoDB.

all_ddb_items = []
for artist in all_lp_artists:
image = str(all_images[index])
image_url = str(image.split("('", 1)[1].split("')")[0])
ddb_item = {
'recordId': index,
'album': all_lp_titles[index].text,
'price': all_lp_prices[index].text.replace("\n", ""),
'artist': artist.text,
'cover': image_url
}
all_ddb_items.append(ddb_item)
index += 1

Sending the latest records to Slack

Incoming Webhooks are a simple way to post messages from apps into Slack. Creating an Incoming Webhook gives you a unique URL to which you send a JSON payload with the message text and some options. You can use all the usual formatting and layout blocks with Incoming Webhooks to make the messages stand out.

def send_message_to_slack(message):
url = "https://hooks.slack.com/services/xxx/xxx/xxxx"
payload = json.dumps({
"text": message
})
headers = {
'Content-type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
return response

Schedule the Lambda using AWS EventBridge and AWS CDK

We will use AWS Cloud Development Kit to deploy our solution. The following snippet deploys:

  • The DynamoDB table where we store results
  • The AWS Lambda function containing the code to be executed
  • The AWS EventBridge rule that triggers the Lambda every 600 minutes
const table = new dynamodb.Table(this, 'VinylSalesDdb', {
tableName: 'VinylSalesDb',
partitionKey: { name: 'recordId', type: dynamodb.AttributeType.STRING },
billingMode: dynamodb.BillingMode.PAY_PER_REQUEST
});
const bus = new events.EventBus(this, 'vinylSaleScraperBus', {
eventBusName: 'VinylSaleScraperBus'
});
const scraperFunction = new PythonFunction(this, 'MyFunction', {
entry: './lambda/scrapers/subfolder', // required
index: 'handler.py', // optional, defaults to 'index.py'
handler: 'handler', // optional, defaults to 'handler'
runtime: lambda.Runtime.PYTHON_3_8, // optional, defaults to lambda.Runtime.PYTHON_3_7
timeout: Duration.seconds(30),
memorySize: 256,
environment: {
'STREAM_NAME': bus.eventBusName,
'DYNAMODB_TABLE_NAME': table.tableName
}
});
// Allow Plato Scraper to put events on the bus
bus.grantPutEventsTo(scraperFunction);
table.grantFullAccess(scraperFunction);
// Trigger Plato every 10 minutes
new events.Rule(this, `PlatoScaperProcessor`, {
schedule: events.Schedule.rate(Duration.minutes(600)),
targets: [new targets.LambdaFunction(scraperFunction)]
})

Remember this

  • Website scraping is a grey area on the internet. Make sure that you don’t overstep in ways that could harm business.
  • Serverless means pay for use. With the setup that we have built-in this post, you’ll keep low costs forever.
  • With AWS CDK, you can easily transport your solution. The only thing you have to do is swap the profile to another account.

What’s next?

We now have access to this information at any point using our tools, but no one else can. We could invite people to our Slack channel or build an interface over the database that we have created. If you want to find out what we’re planning on doing next, check out our official NBTL blog.

We might be starting a newsletter with fresh weekly AWS delivered to your inbox. Interested? Sign up here.

--

--

Mart Noten
NBTL
Editor for

AWS Architect from the https://nbtl.substack.com/ writing technical articles focussing on cloud technologies.