This year I was lucky enough to be able to present at the Container Camp conference in Sydney, Australia. The talk was titled “Keeping an Eye on your Serverless Containers”, and I was doubly lucky in that I got to present together with my friend Prateek Nayak from Innablr.
As I don’t believe I can replicate the back and forth that (hopefully) made the talk itself entertaining, the structure of this writeup will be slightly different than the talk. If you are interested in the slide deck however, you can find that below, and once the video recording goes up, I will add that as well.
A billion-dollar idea
One day while Prateek and I were catching up, we spoke about the popularity of serverless. It seemed to us that everything is headed in that direction, but that nobody seems to care about those poor servers that will suddenly be without a purpose. But then, while we had a beer in memory of those servers, inspiration struck! We could help with this, and in doing so make a lot of money!
And so was born the idea for Cuddle Kube, the retirement home for your servers. The premise is straightforward; instead of storing your retired servers in the corner of a dusty old attic (or throwing them away), you can bring that server to us where it can live out the rest of its unnatural life in comfort.
Sounds like a pretty good idea, doesn’t it? While I’m sure that some of you might be interested in the logistics of building the actual home, I will now only focus on the infrastructure for the website.
Unless you skipped over the title of this talk, it won’t come as a surprise that we decided to use serverless containers for our architecture. And as we chose to run this on AWS, that means Fargate. I’ve written and presented about Fargate and its advantages elsewhere, so I’ll keep it brief here.
When talking about serverless containers, we mean long-running containers that are managed in a serverless manner, not a FaaS product like Lambda. Serverless containers give you most of the same benefits as containers in general but take away the need to maintain the underlying infrastructure. In other words, instead of having to deploy, patch, scale, and otherwise manage the instances your containers run on, you only need to focus on your containers. AWS handles everything else. I like to visualise this as having these containers float in space.
As the initial version of the website discussed here is a proof of concept, we only put some of the APIs we considered in the design. That said, we’re going for a fully API-driven microservices architecture to ensure we win buzzword bingo, I mean, can easily extend it later. Using this in combination with Fargate, we came up with the following architecture.
An Application Load Balancer provides the entry point for the public website service, which in turn calls all of the other services as they are needed, and for our datastore we use DynamoDB. While an ALB is not technically a serverless service, it is one where we don’t need to worry about the underlying infrastructure. However, let’s be clear that both the ALB and Fargate are running inside a VPC, with the ALB being the only publicly available endpoint.
The above is a working architecture, and for the small environment we have here it might be enough. However, the plan is to expand on this, and we want to make sure that the environment stays up and running. To do so, we will want to have some observability tools included so we can find out what is going on with our application. Instead of focusing on a specific tool, we first decided to define what observability meant for us.
There are three main pillars that we wanted to cover for Cuddle Kube.
- Distributed Tracing
Let’s go over these one by one and then discuss how we can handle them in our infrastructure.
We defined logging as a series of discrete, timestamped, events. In our case, this should consist of the logs from both our containers and the services running in them. We can then use these logs as the forensic information to determine what exactly failed.
For example, if our application throws an error because it can’t reach DynamoDB, that is something that should show up in the logs.
For monitoring, we want to focus on the user experience and pay attention to the metrics that are related to the working of the application. In most cases, we don’t care if a container is using 80 or 90 percent of its allocated CPU. As long as there is no impact to the user, that is perfectly fine. However, if the response time of a page suddenly spikes because those containers can’t handle the incoming requests, it becomes something we should be aware of.
This means that our monitoring is focused on known failure modes. When we set up the monitoring, that means we focus on the things we’re aware of that can impact our systems.
Tracing means that we can follow a single request from start to finish, thereby finding out what services it goes through and how these services perform. This, in turn, means that it gives us a way to pinpoint where failures occur or where poor performance comes from.
In many cases, tracing will enable us to find out things we didn’t know were going wrong and help us get to the root cause of them.
We now know what data we wish to collect, but that still leaves us with some potential questions about how we collect this. And not only do we need to collect it, but we’ll need to aggregate and visualise it as well to make it truly useful.
In more traditional environments, a lot of observability is handled at a server level, or through tooling built into the container orchestration or service meshes. As we don’t have access to the underlying servers; however, that means we have to use other means.
How to do this serverless?
We can gain some of these insights with tooling provided by AWS. CloudWatch integrates with ECS, which Fargate runs on, and allows us to automatically send all of the logs generated by our containers to CloudWatch Logs. This way, we have the logging part handled. We can then parse these logs using CloudWatch Logs Insights or pass them to S3 for parsing with Athena.
For the metrics, we can get quite a bit of insight from CloudWatch as well, and specifically Container Insights. This allows us to get detailed metrics at a container level without needing to do any work for it, and we can query and visualise this data as we require. We can hook this up to CloudWatch Alarms as well to trigger any alerts.
Both of these together will already give us a fair bit of the information we’re after, but it doesn’t provide us with everything. Most importantly, we are still missing the tracing capabilities. AWS has a service for that, X-Ray, but that would mean we need to integrate that into our applications and do a fair amount of work that way. Something that would make this easier is a service mesh. And as it happens, AWS offers a service mesh as a service: App Mesh.
App Mesh integration
When App Mesh first became available, I wrote about it in comparison with Istio for a Kubernetes setup. In there I showed that while App Mesh was more limited in its capabilities (although a number of those issues have since been resolved), it was still capable enough for many scenarios. In addition, one of the great strengths of App Mesh is that it works just as well with a Fargate infrastructure as a Kubernetes one.
In this case, that means we can tie our microservices together with App Mesh. Like many other service meshes, App Mesh allows for a centralised configuration of your routing. These routing rules are then sent to the Envoy proxies running as sidecars in your Fargate tasks (an ECS or Fargate task is a collection of containers, similar to a Kubernetes pod).
Additionally, App Mesh supports native integration with X-Ray, allowing you to configure this more easily. As this is HTTP traffic, you will still need to do some instrumentation in your services to ensure the trace headers don’t get lost between service calls, but the required work for that is fairly minimal.
This then leads to the final version of our architecture. The diagram has been updated to show that traffic now goes through the Envoy proxy that is in front of every service, while at a high level also showing that CloudWatch and X-Ray are there for the observability.
Using this architecture, we built the application and had it up and running. Unfortunately, it seems that every time we try to demo it, we run into a little snag that requires us to do some debugging to fix it.
Luckily, thanks to our observability setup, we’ve always been able to find the issues quickly. As a small demonstration of how that works, the below screenshot of the X-Ray overview helps us to pinpoint that there is an issue with the register API.
In particular, we can see here that the issue is with the connection to the DynamoDB table. Without showing screenshots of every step, X-Ray gave us a direct entry to the logs generated by the application that showed the actual error. In this case, it was caused by missing SSL certificates when trying to connect to the DynamoDB API.
The good and bad
As expected, the theory was wonderful, but while building out the actual implementation, we ran across a couple of things that were less than perfect. In general, App Mesh is an excellent service, but it is a bit finicky and requires that everything is configured properly and in the right order. To be fair, it’s far from alone in that, and that by itself isn’t a significant concern. However, that is made worse by the lack of insight into your mesh network while you’re building it.
Until you’ve got it up and running, App Mesh doesn’t show you any information on what is happening or why it might be failing. This doesn’t improve after it’s running, but at least by then you can try things out and see their effect. On top of that, we had some issues with the integration with Cloud Map, which was used for service discovery. It’s certainly possible that much of this was our own fault, but the lack of information while building the mesh definitely caused some frustration. Because of that, I highly recommend that if you wish to use a set up like this, you start with a small portion to make it work and build out from there.
The other thing to point out is something already mentioned in passing. Initially, we thought that the X-Ray integration with App Mesh would give us the complete integration, but that was unfortunately not the case. While it showed all traffic coming in through the Envoy proxies, it didn’t show the connections between the services. Or rather, it showed these all as new calls. This means we had to add a couple lines of code to the application to handle that.
All that said, we were quite happy with the end result. We found that AWS provided us with a lot of things out of the box, that this all integrated nicely with each other, and it wasn’t all that much work to finalise the X-Ray integration. What this meant is that we found we could get all of the observability we were hoping for, despite not having any underlying infrastructure to run our observability toolkits on.
As for Cuddle Kube itself, unfortunately, we didn’t find any willing investors, and we have shut down the project for now. However, as we are still hopeful someone will step up to take care of those poor servers, we have made all of the source code available. You will just have to build out the physical facilities, but how much work can that be?