Creating Our Web Push Service

A technical look at the building process behind the Guardian Mobile Lab’s web notifications.

Published in

The Guardian Mobile Innovation Lab

6 min readJul 1, 2016

Photo by dcJohn used under a Creative Commons license

The Mobile Innovation Lab isn’t just for editorial experiments, it’s for technological ones as well. So when we started planning the backend service that would send our push notifications, we took a look at Lambda, a relatively new service provided by Amazon that lets you execute and scale code very efficiently. It is often called a “serverless” service, which isn’t technically correct — your code still runs on a server provisioned by Amazon, but you don’t manage it, maintain it or scale it. Amazon does all of this automatically, so no matter how many people are using your site/service, it won’t get overwhelmed.

Amazon is not the only provider of a service like this, Google created Cloud Functions and Microsoft has Azure Functions. They’re in varying stages of development, but all share the ability to use the Node runtime to execute JavaScript code. With an eye to avoiding future vendor lock-in, we used a framework named Serverless to develop the backend — it currently only works with AWS Lambda, but plans support for Google and Azure by version 1.0. Actually writing a Lambda function is incredibly simple, and looks like so:

module.exports = function(event, context, cb) {    cb(null, {        "this": "is the JSON object it'll respond with"    })}

Another advantage of AWS Lambda is that it integrates seamlessly with Amazon SNS, a push notification service that is already capable of sending iOS and Android push notifications. You can subscribe a Lambda to an SNS topic, executing whenever a message is sent. With this in mind, our overall publishing flow looks like so:

The start of the process is a little circular — our original SNS publish is picked up by a listener Lambda, which in turn re-publishes the message to a separate topic a set number of times, depending on how many subscribers there are. The broadcast Lambda attached to that topic is then run multiple times in parallel, encrypting the payload and sending to each individual Web Push client, using the web-push npm library.

Of course, our Lambdas need somewhere to store this subscription data. Amazon offers an auto-scaling NoSQL-based database called DynamoDB, but it requires an entirely custom API that is far from common SQL-based solutions. Again wary of vendor lock-in, we decided (in the short term at least) to use Amazon’s ElastiCache service, which is entirely compatible with the Redis command set and offers replication — though it isn’t automatic, so it won’t scale with your traffic the way a Lambda will.

So with that in mind, how did it go?

The Good

The most straightforward success metric is: it worked. The same infrastructure that we originally conceived for a few hundred subscribers to our first jobs report notification scaled up to 12,000 or so subscribers during the EU Referendum results. Parallelising our web push requests meant that we were able to send notifications to every subscriber within 30 seconds.

Serverless also proved to be a very capable tool for deploying code — it has built in support for different stages (e.g. staging, production) and a CLI tool to deploy your code. It’s also recommended to bundle up your JavaScript code (much like you do with client JS) before uploading as a Lambda, and the Serverless plugin system allows you to integrate Browserify or Webpack easily (even if it pains you to do so).

The Bad

With the benefit of Lambdas comes uncertainty. Lambdas take X seconds to start up. Unless you’ve run one in the last Y seconds, in which case it is able to reuse an existing context, Z number of times.

Confused? Us too. We’re still not totally sure what any of those variables are, so when we say it takes 30 seconds to send 12,000 notifications, we’re not entirely sure where the time is being taken up. Is it in spinning up the Lambdas? The HTTP requests themselves? We’re not sure yet. Either way, the startup time for our subscribe/unsubscribe endpoints is prohibitively slow. If a user arrives on our signup page without a Lambda already “warmed up” it can take up to five seconds before they are able to sign up, which is prohibitively slow. The best way to get to the bottom of all of this is to improve our logging.

…speaking of which: logging. Lambdas automatically log their output to Cloudwatch, which is fine, but the asynchronous nature of Lambdas means that it’s very difficult to pin down an exact execution and follow it all the way through. It’s also very difficult to keep track of the execution flow in and out of SNS queues. For example, we had to hack together a Redis hash for individual Lambdas to track their execution time in order to find out which one finishes last and should report the overall duration. This was problematic at the start of the EU Referendum results, as we had gained so many new subscribers that our Lambdas were hitting their (default, 6 second) execution limit before completing. The fix was very quick (changing the timeout in the Serverless config JSON file) but finding the original problem was not.

Another problem with AWS Lambdas is that in order to make them accessible through HTTP you need to use the AWS API Gateway. While Serverless handles all this mapping for you, the API gateway itself is very frustrating to use. For instance, it is not possible to specify the HTTP status code of a response from within your Lambda. Instead, you must set up a series of templates in the API Gateway that are based around regexes of the response body.

Completely aside from our tech stack, during particularly busy nights (such as the EU Referendum) we started to receive timeout errors from Google Cloud Messaging (the service Chrome Web Push payloads are sent through), along with a number of “NotRegistered” errors, presumably from clients that had manually unsubscribed from our updates. Our Redis-based, bare-bones data storage means it’s very difficult to store these individual results for further analysis.

What we’d do differently

With the benefit of hindsight, we’d make a few changes to the code we have today (which is available on GitHub):

Only use Lambdas for sending notifications
The startup delay is an annoyance when sending notifications, but it’s a legitimate problem for our signup pages. No user should have to sit on the page for four seconds waiting to sign up. Instead, subscribe/unsubscribe options could easily be handled by a web server living inside Amazon Elastic Beanstalk.
Use HTTP requests instead of SNS for broadcasts
The asynchronous nature of SNS makes it very difficult to keep track of the progress of each broadcast call. Instead of using SNS, we could set up an HTTP endpoint — our broadcast code could then call that endpoint X number of times (rather than broadcast X number of messages) and collect the responses to those calls in the same place.
Use a relational database instead of Redis
Although Redis has proved very performant in storing subscription data, we’d like to be collecting a lot more data about our individual push requests, durations and so on. While it’s possible to store anything in Redis, a relational database would make it a lot easier to store query this data.

In Conclusion

AWS Lambda (or the alternatives like it) has a lot of very interesting implications for scaling ambitious projects — particularly useful in news given the huge peaks and troughs associated with a story in the news cycle. But it’s still very rough around the edges and frustrating to work with.

That said, the ecosystem is moving fast — just in the time since we wrote Pushy, Serverless has announced their plans for a 1.0 release with a number of improvements and rethinks of the way we structure projects. Pushy is still a very early-stage project, and as we build it out (including support for native notifications) we’ll hopefully be able to integrate newer versions of Serverless, along with some of the improvements mentioned above. In the meantime, we’d love to hear from anyone else using the Web Push API about how they’ve gone about solving these problems. We’re on Twitter at @gdnmobilelab and email at innovationlab@theguardian.com.

The Guardian Mobile Innovation Lab operates with the generous support of the John S. and James L. Knight Foundation.