Monitoring Serverless Applications

Make sure your code works, even in production

Published in

WAES

6 min readJan 18, 2022

Since AWS released Serverless back in 2015, every cloud provider started creating their implementation of FaaS — Function as a Service. We can find serverless implementations even in Kubernetes clusters: open-source versions as Kubeless, and the Serverless Framework, an agnostic tool for writing and deploying serverless functions.

This new architecture type has a lot of benefits but should not be considered a silver bullet. In most cases, bad serverless implementation is caused by a misunderstanding about what this paradigm solves, for example:

Let’s say you create a CRUD application for products and decide to use Serverless to “save some money” and not have a service running 24/7. Then you remember each serverless function works as an isolated unit of work. Therefore each own might have separated runtime and dependencies — in some cases.

Once you have coded the functions, you need to deploy them and make them available. Normally, in an API model, it would look like this:

POST /product
GET /product
GET /product/{id}
PUT /product
DELETE /product

You can solve it by configuring your API Gateway targeting each serverless function, but now your complexity has increased. If you change the domain object, you will need to update (and deploy) 4 separated services. Then, you realize it increases the complexity, therefore, the total product cost.

It is important to remember that the total cost (measured in effort -> money) of every software product is 30% creation and 70% on maintenance through the product life cycle.

Representation of how a bad serverless implementation looks like

Let’s talk now about where the serverless architecture is more beneficial. Since every function is isolated, everyday operations since synchronizations, transport of objects, validations, transformations are good use cases for this paradigm. Always remember: simple units for processing data.

Validate your code works as expected

If we are talking about monitoring, first, we need to start with the code itself. You need to make sure every function you deliver is compliant with good practices, scalable, secure, and, even more importantly, it does what it is meant to do. You have a lot of tools to improve your code quality, such as:

Linters
Test (unit, component, mutation)
Code scanners (GitHub actions, for example)

Those tools generate reports, and that’s the first line of monitoring. You need to be aware of the state of the product you are delivering since the source code, in many (almost all) cases, is the explanation for some weird behavior in production. Once your code is ready for deployment, we will start looking at what we need to monitor.

What should be monitored?

Now that your code is up and running, you should start paying attention to the entire lifecycle of the function.

Cold starts

Every function runs inside a Docker container. Then, depending on the load, the service can scale up. The function may be delayed in the first deployment, called “cold-start”, because it is a container starting from zero.

Depending on the implementation, if you have lifecycle events at the function start event, it may take more time than initially expected, such as retrieving data from a source, performing a scan, etc. This delay won’t be present in the other containers. At least, it won't be until the service is downgraded.

Errors

You need to identify your errors and classify them to find the reason fast. Once your function is deployed, you must be prepared to receive any error raised by the function. Even if it is produced by the code implementation, data inconsistency, and platform problems, you must be two steps ahead and be ready to deal with them.

It is pretty common to use alarms to identify when an error occurs more times than acceptable. For that, you will need to define your threshold before. Another common thing is to get notified of the errors by some third-party services such Slack, PagerDuty, Opsgenie, etc.

Success Request Ratio

You can use how many successfully executed requests as your service level indicator (SLI). With this value, you can then define your service level objective (SLO) and your service level agreement (SLA).

Typically, SLA is the number specified on the contract to punish the company if you don’t deliver the service online. On the other side, the SLO is the objective to be met by the SLI. This objective usually is above the SLA and defines how satisfied is your customer with the service you are providing.

Let’s suppose your SLO is to have 95% of successful requests. Then, you identify a strange behavior: First hour, from the total demand, your success ratio falls to 80%, then 60%. If you have alarms configured, you will be notified of the triggered errors. Still, generally in this kind of scenario, it indicates you are potentially under attack, then the approach is entirely different.

Latency

You can use this value as an indicator (SLI) for your SLA/SLO. For example, suppose your client requires every request should take no more than 1 second. If it does not, then it should be considered an issue.

There are a lot of factors for a latency issue: bad network, errors in the code, poor performance. In the end, this is something you also need to be aware of.

Cost

This is a MUST HAVE metric. Remember, in serverless, you pay for the computing power you are using, whether it is a success or an error. That’s why you need to pay attention to the success requests/total requests metric.

Even if you have a problem with latency and application performance, you will pay for how much computing power you consume. So, start by defining limits for memory usage, the max execution time for a function, and most important, external resources used: database, queues, notifications…

Recommendations when monitoring your functions

There are a lot of tools to monitor serverless functions. Some of them are services offered by the cloud provider. No matter which tool you choose, they all need one resource to work: data.

Meaningful logging

The base unit for monitoring your service is a proper logging strategy. Your logs should provide enough information about what is happening at some point in the source code. It is crucial to define them during the development phase. Your code should not go to production without proper logging.

It is important to mention that your logs must be readable and appropriately formatted so that you can query them fast. Some essential pieces of information which need to be in your log messages are, for example, error codes, service information, user id, and whatever other information you may find necessary.

Tracing

The logs help you to know what happened in the function itself, but your function is just a part of a bigger picture in typical scenarios. Tracing allows you to track the entire request lifecycle and measure performance independently of each service. For example, if your service is too slow, you can create the whole lifecycle graph and identify where the problem lies: code, database, network, etc.

Tools

And as before, you cannot do anything with the collected data if you don’t have anything to collect and process this insight. The monitoring service must be fast, easy to adopt, and provide insightful metrics.

One of the tools that gained more relevance in the serverless monitoring world is Dashbird. It offers a free trial, and it is a good starting point for your benchmark with Observability tools.

You have more providers such as Instana, HoneyComb, CloudWatch, Grafana and the list goes on. It is up to you to define what you need, create your benchmark and implement.

Preventive is better than reactive

A good observability strategy is considered a competitive advantage over similar products. You need to be informed about what is happening and make adjustments where the system needs depending on the Error budget and SLO defined. Don’t wait until your client raises a ticket. Instead, be informed and make proper decisions based on data and client satisfaction.

Join WAES

Are you thinking about moving abroad to take your career to the next level?

WAES has several excellent opportunities to work at top-notch tech companies in the Netherlands. The only thing you have to do is pack your bags; we take care of everything else.

Take a look at our jobs on our website.

Follow WAES:
LinkedIn — Instagram — Facebook — Twitter — YouTube