How I built a Scheduled Web Crawler using Serverless & Golang
This post references my experience building a Singapore lottery (TOTO) web app that periodically displays winning lottery numbers and provides an easy way to compute your winnings.
Disclaimer: I do not encourage gambling, especially when done excessively and irresponsibly. This app is designed for the elderly folks in mind, especially for those who like to purchase some lottery tickets from time to time for “just a little flutter or social recreation”.
What I will cover
I’ll be sharing how this idea came about, how I set up the Serverless infrastructure and how it integrates with Golang code.
What I will not cover
- Golang code-level intricacies (will cover this in a separate article, watch this space!)
- Web frontend development
Some assumptions I have:
- You have some knowledge of database structure and design
- You have some knowledge of Amazon Web Services
Why did I choose to build this?
Before I dive right in, I wanted to share what inspired me to create this application. I noticed that many older folks populate the queues of Singapore Pools outlets all over the island. My paternal grandmother (or ahma) is not an exception — to pass time, she looks at the results regularly even when she didn’t purchase any tickets. However, the painful computation of Singapore lottery (or TOTO) prize group(s) are often a little too complicated. I have first-hand experience since my ahma calls me to clarify way too much.
I felt that there were some possible improvements to the entire flow. In my opinion, there were two key desired outcomes:
- An accessible avenue for the elderly to see the latest TOTO results
- An easy way to tabulate TOTO prize groups, especially with its not-so-simple computational logic
There were no readily available solutions that met both of these outcomes. As such, I decided to build it myself.
However, when embarking on building the solution, I encountered a couple of roadblocks.
Problems I needed to solve!
1. Lack of publicly available API to fetch up-to-date TOTO results
Without any publicly available API that provides timely TOTO results, there was a need to develop a method of fetching them directly from the source.
2. TOTO draws happen twice a week (and sometimes on special occasions)
Since draws happen on Mondays and Thursdays weekly, the fetching of TOTO results should only be done on those specific days.
Alas, I needed to develop a Scheduled Web Scraper, allowing for TOTO results to be extracted from the main webpage, and yet only performed periodically, triggered by schedule to ensure minimal function calls.
Hello, Serverless (and specifically, AWS Lambda)!
I wanted to focus on building the web scraper logic and the frontend application, spending minimal time on managing the infrastructure.
With a single file (serverless.yml) updated with Function (your Lambdas), Event (your function triggers) or Resource (your infrastructure, e.g. DynamoDB, S3) configuration, you can deploy your entire infrastructure on AWS.
How does this work?
The Serverless Framework translates all syntax in serverless.yml to a single AWS CloudFormation template, which speeds up the entire cloud provisioning using Infrastructure as Code (IaC). I’ll share what resources were needed for my app.
Firstly, there’s a need to store all TOTO draw results (past and present) somewhere. I’m a huge fan of NoSQL databases, so DynamoDB was a no-brainer.
Each TOTO result has a corresponding date, hence storing the entry is relatively simple;
- partition key (PK) — Universally Unique Identifier (uuid) of the TOTO entry, created at runtime
- sort key (SK) — creation timestamp (Unix/Epoch time) of TOTO entry, created at runtime, not to be confused with the date of the TOTO draw
I decided to have the partition keys as generic fields, to allow for other forms of data to be stored in the same database (I store all the feedback for the app here too, and I welcome more if you have any 😜)
Global Secondary Index
There’s a need for a Global Secondary Index(GSI) as well, which provides another index for a different access pattern. Since I need to fetch TOTO results by date (draw date), a GSI is added for the attribute:
According to the design mentioned above, here’s how the
Resources part of the serverless.yml would look like:
Lambda Function (Scheduled Web Crawler)
Next, we can configure AWS Lambda functions in our serverless.yml.
As mentioned earlier, since the TOTO draw happens weekly on Mondays and Thursdays, the web crawler has to run opnthose respective days. I noticed that results are either released at 6:30 pm or 9:30 pm. Hence, the use of cron jobs fits like a glove here.
As for the Golang function code, I used Colly, a popular, fast and clean scraping framework for Go (I won’t be covering too much about how the scraper works here).
We will need to define the location of the web crawler function
scrapeLatestResult binary file, which is generated after we build and compile the function code.
I’ve provided the skeleton of my function code below.
The full serverless.yml file would look something like this.
With the use of Serverless Framework, I was able to swiftly develop and deploy all my function code and required cloud infrastructure. It enabled me to free up more time so that I could focus on the TOTO calculator logic, the web crawler and the frontend, to ensure it was mobile responsive and accessible.
Once again, I wanted to share my excitement and learning points when building this project. I welcome all forms of feedback, either here or directly on this post. In time, I do plan to cover certain guiding principles I had on designing a web scraper, and how I structured my Golang code (for all the Gophers out there!).
Oh BTW, I am a Software Engineer in GovTech Singapore, Government Digital Services, working on SupplyAlly & GovWallet. Every day, I relish the opportunity to work with like-minded individuals on meaningful projects that will benefit Singapore citizens. If you are interested in our work, please visit https://hive.tech.gov.sg/ for more information, or connect with me on LinkedIn.
And hey! We’re hiring 😃