A journey into serverless, part 2

A look at our serverless toolbox

The first part of this series described how we gradually built up the CI/CD chain for our serverless APIs and what we learned along the way. However, we left out a lot of details about how the APIs are actually designed, and how we manage common operational topics such as testing, logging, etc. In this part of the series, we’ll be focusing on the tools we use to manage our serverless architecture. So, without further ado, here’s part ✌

API architecture

In part 1, I briefly mentioned that our API are built out of stock AWS components, mainly API gateway and Lambda functions. Now’s the time for a closer look.

Here is what a typical API stack looks like:

The components of a typical API stack

The API gateway and Lambda functions, as well as the supporting IAM roles, are bundled in a CloudFormation stack which is deployed from scratch on each change (again, refer to part 1 for a complete explanation of the how and why of the immutable approach).

The public URL of the API points to the currently active stack through a custom domain name mapping; in other words, moving to a new version or rolling back is simply changing a pointer to a different stack. Custom domain names can manage several path mappings, allowing us to expose different versions of an API simultaneously (e.g. /v1 and /v2 can point to 2 different stacks).

In addition to this stack, DynamoDB tables are used to store persistent data, and we make heavy use of the SSM parameter store both for sensitive information (e.g. API keys) or shared information (for instance, the ID of the currently active stack). Lambda functions call each other directly or use SNS to publish notifications.

Testing

This is what it feels like to run integration tests on micro-services ( Photo by Nicolas Thomas on Unsplash)

Testing Lambda functions is complicated. Even more so when they act as glue between services, some of which are provided by third parties. How do you test locally? How do you test once you’ve deployed? Should you mock the services your Lambdas are calling, or call the real thing? What about writing to Dynamo, or posting to SNS?

On this topic, we followed the “different paths, same test” strategy described in this post. During tests we call our Lambda functions in one of two different ways: either locally by calling the code directly, or remotely by calling the Lambda function through the API gateway (or through boto for Lambda functions which do not have a public endpoint). This switch is managed by an environment variable, and some helper code that abstracts the different ways to call a lambda function; the test code however is exactly the same for local and remote tests.

This allows us to write all tests as unit tests; they are run locally before deployment, and remotely, as integration tests, after deployment (but before the stack becomes active).

To mock, or not to mock

We call third-party services rather than mocking them, since a good deal of our code is about integrating with those services — so we want to catch integration issues or changes in the interface as early as possible. This approach works well in development and staging environments. However, calling third-party services in production creates “real” objects and transactions. We roll these transactions back at the end of the tests and/or isolate them on specific test accounts. But we’re still looking for a better way to manage the trade-off between the safety of systematic integration tests and the risk of creating fake transactions in production.

Another ongoing area of improvement is how to stub synchronous calls to other internal services. For integration tests this isn’t really an issue, since we want to make sure the platform as a whole works as expected. However, this makes local/unit tests more complex than they should be whenever a Lambda function calls another API: the called service has to be deployed and running on the developer’s account, and there’s some environment fiddling to get the same behavior as what happens in staging or production. And sometimes tests fail due to incorrect behavior in the called service, which is not always easy to track down in the logs. To resolve these issues, we’ve started to mock internal services so that we can have more isolated unit tests.

Our testing methodology is still a work in progress. The set-up outlined above worked really well when we had small APIs and they were mostly independent. It still holds up well for isolated or loosely coupled services; but as our platform grows and our services become more interconnected, we’re facing new challenges and reviewing how to best handle our unit and integration tests.

Logging

This is what searching through CloudTrail logs feels like (Photo by Glen Noble on Unsplash)

Logging is like chocolate: when it is good, it is very, very good; and when it is bad, it is better than nothing. Yes, this is a shameless reboot of a saying originally about documentation and, well… not chocolate.

Logging is very straightforward in Lambda. You can use the standard logging library, or even just print to stdout, and the logs will end up in AWS’ logging solution Cloudwatch. Unfortunately, Cloudwatch falls in the “better than nothing” category, and comes reeeeally close to the “better off without it” tier. I won’t explain why Cloudwatch is not ideal to explore logs, especially when you have a lot of Lambda functions. If you’ve used it, you know why. And if you haven’t used it… Lucky you, I guess.

Our first approach to more robust log exploration was to deploy an ELK stack. This is very simple to do with the ELK service provided by Amazon. However, this solution had a few issues for us: it’s not managed; you have to add and manage your own authentication if you don’t want the world to read your logs; and it’s ridiculously easy to over-provision — when we tried it, the default template spun up 4 extra-large VMs (in each of our environments), leading to an invoice where our biggest spend was on *logging* 🤦‍♂️(leading, in turn, to adding billing alarms to our account).

Enter Datadog

We took a look at different logging/monitoring services and settled on Datadog. Datadog provides monitoring and alerting features which are well integrated with AWS, and since early 2018 has added log management to its features. This allows us to have a comprehensive monitoring dashboard and access to all of our logs in the same place.

We export our logs to Datadog using the standard log forwarding mechanism (a special Lambda function triggered by all Cloudwatch events). Our Lambda functions follow the naming convention <stage>-<API name>-<Lambda function name>-<stack ID>: adding a simple pattern matching rule to the ingestion pipeline automatically gives us filters on all these attributes. With the full-text search and temporal filters provided out of the box by Datadog, we can explore our logs to our heart’s content.

Correlation helps root causation

Even with all your logs in the same place and an easy way to sort and search them, it can still be a challenge to debug an issue, especially with a micro-service architecture where requests are passed from Lambda to Lambda. We solved this issue with correlation IDs, which are automatically inserted by AWS in the context of a new request and are passed along from service to service. Grabbing the correlation ID from the context and adding it to the logs makes it trivial to follow an interaction during its entire life cycle, even if it happens to go through several Lambda functions.

Following a request as it goes through different Lambda functions (each color represents a different Lambda)

Managing secrets

How not to manage secrets (image stolen from this blog)

Environment variables are a breeze to use with Lambda functions, and all serverless frameworks will allow you to configure and populate environment variables out of the box.

However, environment variables should not be used for everything, and especially not to hold sensitive information! They are usually handled in clear text, stored in configuration files and committed along with the code: there are a lot of ways for these values to leak, and you don’t want the world or even “just” everyone with read access to your Git repo to know the API keys for your payment provider…

AWS offers the SSM Parameter Store as a simple solution to hold key-value pairs securely. Values can be stored encrypted, with very fine-grained read and write access, making it ideal to manage sensitive information.

We use Parameter Store with a “namespace” strategy to manage variable scope: each variable name is prefixed either with <API name>_<function name>, <API name>_common or global depending on who can read its content (only a given function, all functions in the API, or everyone). This gives us tiered access rights without having to explicitly configure read rights for each variable.

Using Parameter Store rather than environment variables does have a performance cost, as the variable must be fetched and deciphered each time. To mitigate this, we use the ssm_cache Python library to cache values in memory, with a custom development to trigger cache invalidation as needed.

AWS has recently launched the Secrets Manager service, which looks like Parameter Store on steroids — we may move to this new service once we’ve had a chance to test it.

Service discovery

Wait, which version of the service should I call? (Photo by N. on Unsplash)

In the first part of this series, I mentioned that we build and deploy a new stack every time we make a change to an API. Each stack is given a unique name through the use of a commit tag and timestamp.

Only one stack is active at a time, but we may have several stacks of the same API deployed concurrently. In production, we will usually keep the previous version of each API in addition to the current one should we need to roll back. In staging, it’s common to have several versions around, as each pull request generates a new stack. This raises the question of how to keep track of the active stack.

The most common way to interact with a stack is through its public service endpoint, which, as described above, points to the active stack via a custom domain name: in this case, there’s no ambiguity in which stack is called. However, it’s also common to call Lambda functions directly. Lambdas can also be triggered by SNS notifications, or DynamoDB streams. In these cases how do we know which stack to use ?

Once again, SSM Parameter Store to the rescue! We store the active stack for each API in a global parameter, and update this parameter when we change the active stack. This effectively turns the active stack ID into a shared, centralized environment variable.

This information is used either by the calling Lambda function (to know which function to target), or by the called Lambda function (to determine if it’s active and therefore should react to SNS notifications and such).

More serverless goodness

Photo by Matt Schwartz on Unsplash

Although the focus of this article is on how we build and host our APIs, this isn’t the only use case for which we’ve found serverless to be a good fit.

Our websites are hosted on S3 buckets + CloudFront. This set-up is a good fit for static websites, but also (with a little tweaking) for single page applications. In particular, we’ve built a customer portal on top of our subscription API, using vue.js for the client-side logic. This design pattern (a single page application backed by an API) works well when your website is basically a GUI to consume existing APIs, but can introduce needless complexity in other scenarios: for instance, when you don’t have an API to start with (and it wouldn’t make sense to build one), or when you don’t need a lot of client-side logic and interaction.

(Note also that although S3 + Cloudfront is a great way to quickly and cheaply host a website, you need to jump through some hoops to have more control over how your content is served — e.g. you have to set up a Lambda@Edge function to add security headers)

So when we needed a simple back-office website to review complex claims, instead of going the SPA + API route we built a simple Flask app that we bundled in a Lambda function. With a few tweaks to our CI scripts, we were able to include this new application in our existing API deployment process, and voilà, a simple serverless back-office in a few days of work.

On the back-end, we’ve tested the new WebSocket API feature from API Gateway, which we use to communicate with our claim-processing bot. It’s still early days for this feature (for instance, there was no initial CloudFormation support for it) but the ability to push notifications from the server to the client really simplifies workflows with long-running server processing, which don’t lend themselves well to traditional HTTP requests.

Until next time…

Well, this (temporarily) concludes the feedback from our journey into serverless. Our architecture is still very much a work in progress and we expect our tools and processes to keep evolving; for instance, we’ve started using Terraform for new services, and it will probably replace our CloudFormation / Ansible scripts at some point. We will also experiment with new serverless use cases as we develop new services and grow our IT plaform.

So stay tuned for further updates, and in the meantime if you’d like more details about specific topics drop us a note in the comments section 👇 !

The Moonshot-Internet IT team