Building a Serverless Data Pipeline

How I designed my OpenWhisk-based app to work with data efficiently and cost-effectively

--

As well as writing simple tutorials for serverless technology, I’ve also been enjoying using it in the applications I build inside the team. Today I thought I’d show you around one of these projects. The code is open source, and it might help to give others some ideas on how they can use this technology themselves.

Hopefully you already know a little bit about serverless. If not, there are some great resources on http://openwhisk.org and you can also read some of our previous posts about serverless.

The project: A Stack Overflow dashboard

I work on a team of Developer Advocates; we try to make sure that developers using our tools have access to everything they need. This includes keeping an eye on Stack Overflow in case anyone has asked a question we could answer. We built a tool to track those questions, but also to track which of us was looking at a particular question — we’re all in different timezones so it’s easy to accidentally duplicate effort :)

The data pipeline we built with serverless functions feeds a web dashboard that our team uses to monitor Stack Overflow for tags we’re interested in answering.

To achieve this, we run some search queries against the Stack Overflow API every few minutes and store the results in our own database. If any new questions appear in the search results, a notification goes to a bot to share with a particular Slack channel.

Why choose serverless?

There’s nothing in this project that couldn’t have been achieved with a serverside script of more or less any other technology stack. One reason to choose serverless in a project like this is because of the billing model. With a serverless setup, charges are only made for the time spent with the actions actually running. For this situation, where the code is idle for most of the time in between queries, this can be a pretty good use case.

The application itself is pretty simple, which fits well with serverless. It also doesn’t matter if we encounter “cold start” times in serverless. (Cold starts are where an action that hasn’t been used for some time runs more slowly than normal, usually because it has been removed from memory and now has to be reloaded.) That’s because this part of the application is simply moving data around, updating existing records, and creating new ones. It really isn’t critical if we sometimes get the database record or notification half a second later.

Designing the serverless application

Working with FaaS (that’s right — Functions-as-a-Service!) means working with fairly small components. The holy grail is to build a collection of reusable components, which isn’t always possible but is a great thing to aim for! The main points that I look out for when designing for serverless are:

  • Modular, testable functions: Split the application down by drawing a flow diagram of the different steps. As a starting point, making each step into a serverless function can work well. By thinking about the boundaries between components, our application will also become easier to test.
  • Single purpose components: Think of a Unix commandline program — it does one thing, and one thing only. It is probably brilliant at doing that thing, but if I want to format the output or write it to a file, then I need a different utility for that. The same principle applies here. Try to give each component a singular purpose.
  • Data hygiene: Data hygiene is a bit like kitchen hygiene. Every utensil in the kitchen should not be used in every dish being prepared for the table. Equally, every component in our application should not be making calls to every datastore. Think about which components need which data, and how to achieve that data access with as few contact points between components and datastores as possible.

For the Stack Overflow project, I created four components, grouped into two sequences:

The serverless architecture of the soingest data ingest tool written for OpenWhisk. We’ll review each JS file here in this article.

The socron sequence:

  • The collector action makes an API call to Stack Overflow. It checks if we received sane data, and then returns it.
  • The invoker loops over the data fed in from the collector and programmatically invokes a new sequence (qhandler) for each of the questions. it finds

Next the qhandler sequence operates on each of the question results retrieved:

  • First the storer determines whether we should insert this record into the CouchDB database or update an existing record. It also adds some metadata to the data before passing it along ...
  • … to the notifier that looks at what has happened so far, and if it's a new question, sends the webhook to trigger the notification.

Stand by for code

The only surprising thing about the code for this project is how little there is of it. The whole project is on GitHub, but let’s walk through the main code ingredients to my Stack Overflow recipe in this post. All the code here is written in JavaScript and designed to be deployed to IBM Cloud Functions or any other Apache OpenWhisk platform.

Setting up triggers and rules

Serverless functions run in response to an event. In this case, the event is the equivalent of a cron job. The built-in alert trigger will be configured to fire every five minutes, and to pass in the tags to be used in the API call to Stack Overflow.

In fact, we use a whole bunch of these types of triggers on the “real” version of this application, at different frequencies and with different tags. This helps to spread out our API calls and avoid the rate limits on an external API.

The trigger needs a rule to link it to the action or sequence that should be run:

When a rule has been changed it becomes disabled, so the code sample above includes the command to re-enable the rule.

Making a serverless API call

Since the API call will be asynchronous, the request is put into a JavaScript Promise object that is returned. When the promise is completed, it will be either rejected with an error or (hopefully) resolved with the data that it fetched.

The output of this action is a data structure including a list of questions. This becomes the input to the next action.

Invoking a sequence from code

Sequences are a chain of actions, but in this case one action per data item is needed.

The openwhisk library is available by default in an IBM Cloud Functions context, and it’s this that makes the calls to invoke an action per question.

Store data in the database

This project uses IBM Cloudant, a hosted version of the excellent open source document database Apache CouchDB. There are a few options for JavaScript libraries to use with Cloudant, but the cloudant Node.js library is great and available on the IBM Cloud Functions platform, so it’s used here.

This is the only component that needs to hit the database. The data that was written is also passed along to the final piece in the puzzle: the notifier.

Sending webhooks from a serverless action

A webhook is simply a POST request, and the first action had an API call in it, so this part probably isn’t a surprise! The request library is used here:

There you have it: one working data pipeline, consisting of four moving parts.

Data and serverless

Working with data in this way is a great fit for serverless. The scalable nature of the compute fits well with the distributed approaches that are widely used with big data already (think of MapReduce, for instance). The cost model means that it’s viable to use a platform like this on an as-needed basis, for example when working on importing or cleaning up a large dataset. And crucially the technical barriers of entry are low. My team are a majority of JavaScript developers, but IBM Cloud Functions also supports Python, Java, and Swift as first-class languages. So developers of all stripes can be up and running very quickly on this platform.

With this post I aimed to show off not only the detail of the code, but also give a sense of how manageable the serverless platforms are to develop for. Now I’m hoping that you’ll build something of your own, and let me know what you choose!

--

--

Lorna Mitchell
Center for Open Source Data and AI Technologies

Polyglot programmer, technology addict, open source fanatic and incurable blogger (see http://lornajane.net)