Building a Serverless Data Pipeline
How I designed my OpenWhisk-based app to work with data efficiently and cost-effectively
As well as writing simple tutorials for serverless technology, I’ve also been enjoying using it in the applications I build inside the team. Today I thought I’d show you around one of these projects. The code is open source, and it might help to give others some ideas on how they can use this technology themselves.
The project: A Stack Overflow dashboard
I work on a team of Developer Advocates; we try to make sure that developers using our tools have access to everything they need. This includes keeping an eye on Stack Overflow in case anyone has asked a question we could answer. We built a tool to track those questions, but also to track which of us was looking at a particular question — we’re all in different timezones so it’s easy to accidentally duplicate effort :)
To achieve this, we run some search queries against the Stack Overflow API every few minutes and store the results in our own database. If any new questions appear in the search results, a notification goes to a bot to share with a particular Slack channel.
Why choose serverless?
There’s nothing in this project that couldn’t have been achieved with a serverside script of more or less any other technology stack. One reason to choose serverless in a project like this is because of the billing model. With a serverless setup, charges are only made for the time spent with the actions actually running. For this situation, where the code is idle for most of the time in between queries, this can be a pretty good use case.
The application itself is pretty simple, which fits well with serverless. It also doesn’t matter if we encounter “cold start” times in serverless. (Cold starts are where an action that hasn’t been used for some time runs more slowly than normal, usually because it has been removed from memory and now has to be reloaded.) That’s because this part of the application is simply moving data around, updating existing records, and creating new ones. It really isn’t critical if we sometimes get the database record or notification half a second later.
Designing the serverless application
Working with FaaS (that’s right — Functions-as-a-Service!) means working with fairly small components. The holy grail is to build a collection of reusable components, which isn’t always possible but is a great thing to aim for! The main points that I look out for when designing for serverless are:
- Modular, testable functions: Split the application down by drawing a flow diagram of the different steps. As a starting point, making each step into a serverless function can work well. By thinking about the boundaries between components, our application will also become easier to test.
- Single purpose components: Think of a Unix commandline program — it does one thing, and one thing only. It is probably brilliant at doing that thing, but if I want to format the output or write it to a file, then I need a different utility for that. The same principle applies here. Try to give each component a singular purpose.
- Data hygiene: Data hygiene is a bit like kitchen hygiene. Every utensil in the kitchen should not be used in every dish being prepared for the table. Equally, every component in our application should not be making calls to every datastore. Think about which components need which data, and how to achieve that data access with as few contact points between components and datastores as possible.
For the Stack Overflow project, I created four components, grouped into two sequences:
collectoraction makes an API call to Stack Overflow. It checks if we received sane data, and then returns it.
invokerloops over the data fed in from the
collectorand programmatically invokes a new sequence (
qhandler) for each of the questions. it finds
qhandler sequence operates on each of the question results retrieved:
- First the
storerdetermines whether we should insert this record into the CouchDB database or update an existing record. It also adds some metadata to the data before passing it along ...
- … to the
notifierthat looks at what has happened so far, and if it's a new question, sends the webhook to trigger the notification.
Stand by for code
Setting up triggers and rules
Serverless functions run in response to an event. In this case, the event is the equivalent of a cron job. The built-in alert trigger will be configured to fire every five minutes, and to pass in the tags to be used in the API call to Stack Overflow.
In fact, we use a whole bunch of these types of triggers on the “real” version of this application, at different frequencies and with different tags. This helps to spread out our API calls and avoid the rate limits on an external API.
The trigger needs a rule to link it to the action or sequence that should be run:
When a rule has been changed it becomes disabled, so the code sample above includes the command to re-enable the rule.
Making a serverless API call
The output of this action is a data structure including a list of questions. This becomes the input to the next action.
Invoking a sequence from code
Sequences are a chain of actions, but in this case one action per data item is needed.
The openwhisk library is available by default in an IBM Cloud Functions context, and it’s this that makes the calls to invoke an action per question.
Store data in the database
This is the only component that needs to hit the database. The data that was written is also passed along to the final piece in the puzzle: the notifier.
Sending webhooks from a serverless action
A webhook is simply a POST request, and the first action had an API call in it, so this part probably isn’t a surprise! The request library is used here:
There you have it: one working data pipeline, consisting of four moving parts.
Data and serverless
With this post I aimed to show off not only the detail of the code, but also give a sense of how manageable the serverless platforms are to develop for. Now I’m hoping that you’ll build something of your own, and let me know what you choose!