Our Road to Serverless

Over half a year ago, the Soluto Nashville team at Asurion decided to tackle a new initiative. We wanted to build a platform to connect our customers with experts who could help solve their technology problems. The catch? We also wanted the experts to benefit from the experience, so we built a platform to allow them to work whenever and wherever they wanted to.

Our small team dedicated to this project set out to make an initial impact without being constrained by existing processes or technology. We’d heard about serverless before– no managing infrastructure or scaling sounded perfect to us! We decided to give it a shot. Nine months later, serverless has proven to be a key enabler and we’ve learned a few things along the way.

What is serverless anyway?

When people talk about serverless, they’re generally talking about a concept called FaaS (functions as a service). FaaS is an event-driven service provided by a third party (Amazon Web Services, Azure, Google Cloud) that takes management of infrastructure and scaling out of the equation, allowing you to simply write code and upload it. You’ll never have to worry about whether your servers need to be patched, if your clusters are scaling properly, or about spinning up additional infrastructure for a new service you want to create. Pretty nice, right?

As great as that is, the real power of serverless comes from being able to connect this compute infrastructure to the rest of your existing cloud resources. Want to have a completely serverless REST API? No problem, in AWS you can connect your serverless functions to a service called API Gateway which will create a defined REST API that sends http requests to your functions as events. How about processing messages on a pub/sub topic? Easy, a Lambda can be triggered by an SNS topic. What about something a bit more esoteric and specific to your architecture? Most FaaS providers can be triggered by things like database table updates, S3 bucket changes, log events, and event streams.

If you’re as invested in cloud infrastructure, as we are at Asurion, then serverless is an attractive option.

Why we chose serverless

Specifically, we chose the Serverless framework running in NodeJS on AWS. Our previous technology stack for our APIs before we dove into serverless was Java/Spring Framework services running on Docker images in an autoscaling ECS cluster. We’ve had many headaches over the past couple of years troubleshooting the vagaries of this infrastructure– issues with the autoscaling, Docker container resources, and the list goes on. When we began working on a new project a few months ago, the prospect of duplicating the effort involved with standing all of that infrastructure up and integrating with our existing automation was daunting, to say the least. Our team was lean and wanted to move quickly, so spending a bunch of time on infrastructure automation, scaling, and management was not really an option.

The most compelling thing about serverless architecture, besides being the new hotness, was the ability to move quickly and only have to think about writing code and drinking coffee. Another added benefit was cost — we would only be paying for the traffic and resources we were using, rather than having to pay for a bunch of servers to sit, mostly idle.

So what do we think?

Overall, I believe that adopting serverless architecture has been a success. Despite it being quite different from our existing tech stack and having to start completely from scratch, we were able to get something into production in about a day or so. This was a huge win for a team that was trying to create value quickly. Another benefit — the time it took from starting work on a new feature to deploying that feature was drastically reduced, though some of this is due to having less process around releases.

We also rarely have to think about the underlying infrastructure. This has been probably the biggest advantage, so far. In the nine months we’ve been running a serverless stack in production, we’ve never had a single issue concerning scaling or resource allocation. In fact, we’ve hardly had production issues at all, which is remarkable, considering this was an unfamiliar stack to us. The few problems we’ve had were simple configuration issues.

Highly scientific.

I also took a quick poll of our engineering team on Slack and the feedback was mostly positive.

It hasn’t been all roses though. While the overall experience has been pretty good, we’ve had our fair share of pain points. Here are our top three criticisms.

The pain points

Maturity

This one is obvious, but it bears mentioning: serverless is really new. Certain things, that would be trivial in a mature framework, are difficult or simply don’t yet exist for serverless. Here are a few things we would really like to have, but aren’t there quite yet.

Unified logging — The nature of serverless demands a modular, distributed, microservice architecture rather than a monolithic architecture. That’s a great thing! However, one side effect is that there are many different resources generating their own logs. Even for a simple serverless API in AWS, at a minimum, you’ll have logs for the API Gateway and each individual lambda function connected to the API. There is no convenient way in AWS to bring all of these logs together in a single place so that you can get a clear picture of everything that happened for a given HTTP request.

And as your application grows in complexity, this problem will only compound. For example, if you want to secure an endpoint in your API with a custom authorizer, you now have 3 different log locations for a given request. What if you also need to interact with another AWS resource, maybe a kinesis stream? Now, there’s four different logs for one request. It would be really nice to have all of these logs tied together.

We’ve tried to mitigate this problem by writing a custom logger for our lambdas that pushes logs into Kibana, but we still can’t group them with API Gateway logs or other resources’ logs to get a full picture of a single http request.

Blue/green deployments — This is a concept that’s been around for some time. While it is certainly possible to accomplish this with serverless architecture in general, the serverless framework does not make this easily achievable. AWS is making some great additions to their services to support this concept (canaries for API Gateway and traffic shifting for AWS Lambda), so we are confident that this is something we will be able to implement in the near future.

Auto-generated clients/documentation — AWS already supports exporting documentation and clients from an API Gateway, but it’s difficult to maintain. The API documentation has to be hand-crafted and added to your API Gateway. It is possible to export Swagger-formatted documentation of your API, even if you have not added any documentation (which we are doing), but it will be incomplete and missing important information like the request/response formats and required headers for authorization.

It would also be great if the documentation could be generated using annotations in code, similar to Swagger Codegen. For now, we have opted to document our API using a Postman collection.

Latency

By its nature, serverless architecture will sometimes be slower than a continuously running cluster of services. There will be times that a compute resource will need to be instantiated to handle the incoming request, which adds overhead to the response time. This is known as a “cold start.” It’s best practice to avoid cold starts whenever possible and to minimize the cold start penalty when it’s unavoidable.

Sometimes, a cold start can be quite painful. For example, in AWS it is possible to have your lambda function exist within a specific VPC, or virtual private cloud. This may be necessary if your lambda has to interact with another resource that is only accessible within its VPC. One pretty common example of this is if you need to connect to a database, which most REST services would need to do. Unless you are using DynamoDB, your database instances will need to be in a VPC. Cold starts for a lambda function within a VPC can take much longer than it would otherwise; the worst case cold start penalty, in this case, can be as much as 20–25 seconds! That is staggering.

Be careful around those spikes!

It’s worth noting that our average latency is still really good, generally under 200ms. There are other factors to consider regarding latency, such as deployed code size, function resource configuration (memory and CPU), and data caching. Sometimes, it’s just completely out of your control. There isn’t anything that can be done about the details of Amazon’s Lambda infrastructure. Hopefully this is something they can fix in the future, since needing to connect to a database… is pretty common.

One way that you can combat the cold start penalty is to keep your lambda functions warm. And, if you’d like to avoid the extra long VPC cold start penalty, use DynamoDB for persistence so your lambdas won’t have to be in VPCs.

Visibility

As with every framework designed for ease of use, the complexity of the implementation is intentionally abstracted away from the developer. This is usually a great thing; you’re now free to write your code and have one less thing to consider. But, what if there’s a problem?

Debugging a serverless application can be difficult. But, there are a few ways to simulate the exact environment your application is running in locally. Tools like SAM Local exist, but it’s really built for use with AWS SAM (Amazon’s serverless framework) and it’s still in beta.

There’s also a ton of moving parts. Even a simple serverless API in AWS will likely have:

  • An API Gateway with a myriad of configuration options
  • A Custom Domain for that API, which is technically a Cloudfront distribution that maps to an API Gateway and stage (what?)
  • At least one authorizer to secure your API (Cognito User Pool or a lambda function)
  • Lambda functions to handle the incoming request

All of which will have their own configuration options and can each be a point of failure. And if you happen to be using the serverless framework, like we are, then all of these resources are generated using CloudFormation behind the scenes, which can be a headache to troubleshoot.


TL;DR

Serverless is cool. You can move quickly and not get bogged down by managing your infrastructure. But, there are tradeoffs to consider: for increased speed and reduced overhead you will incur latency penalties, reduced visibility, and increased complexity of your architecture. It’s up to you to decide whether the benefits outweigh the drawbacks.


If you’re interested in joining our team, feel free to check out the job openings at Soluto Nashville and send me a note!