PAY per USE can derail your Serverless (dream) Budget

Published in

AppGambit

7 min readDec 18, 2020

This is the 2nd post in the Serverless communications series.

re:Invent 2020 UPDATE: AWS Lambda has changed the billing duration from 100ms to 1ms 🙌 . Check the full update here: https://aws.amazon.com/about-aws/whats-new/2020/12/aws-lambda-changes-duration-billing-granularity-from-100ms-to-1ms/

[POST #1] Select the right Event-Routing service from Amazon EventBridge, Amazon SNS, and Amazon SQS

[POST #2 ] PAY per USE can derail your Serverless (dream) budget

Monitoring tells you whether the system works. Observability lets you ask why it’s not working.
https://dashbird.io/blog/observablity-serverless

Serverless and Event-Driven Services

Serverless applications are all about EVENTS. Events drive the whole system. One of the best advantages of using serverless architecture is that we can extend the whole system with the introduction of new subsystems (microservices) that are connected via communication services.

Queue, Pub/Sub, and Streaming Data services can help extend the system easily with patterns like Fan-Out and Service Chaining.

For example, we can add new subscribers to SNS or new targets to EventBridge rules without any modifications to the existing pipeline.

Pay for what you (actually) CONSUME

Pay for CONSUME is the fundamental billing model on which the whole Serverless ecosystem works.

One bad (or overlooked) design decision can derail the whole consumption cycle that might have been working just fine.

For example, let’s take this use case.

Note: This flow example is used to highlight some design problems.

The pipeline interacts with SQS, SNS, and EventBridge to fan-out the event and further writes the data into the DynamoDB table.

This pipeline is hosted in the Non-US region. All the major AWS services has parity across the regions, but sometimes new services are only available in few regions initially. I will show how this difference can lead to failures and eventaully different design decisions to optimize resources.

“orange” stretch is the Lambda function execution time

Looking at the cost, it would come down to this for 100,000 events.

0.00019838 * 100,000 = $19.17

Note: This is an estimated cost and it is used just for the example purpose. This example is not considering the API Gateway and SQS cost that is sourcing the events. Plus all the lambda functions are finishing in the first 100ms block

The above screenshot is from the warm Lambda functions. If we take the cold-start in account the cost will be much higher. But assuming that we have steady stream of data and cost-starts are not that frequent.

Same workflow with cold-start for all the functions

Also let’s check the service level latency and message delivery latecy for each of the services.

Now, let’s extend our Serverless application

AWS recently announced the general preview of Amazon Timestream. It’s a great database plus it’s fully SERVERLESS and we pay for what we store and query. Similar to how the Amazon DynamoDB works.

Amazon Timestream is a fast, scalable, and serverless time series database service that collects, stores, and queries time-series data for IoT and operational applications. With Amazon Timestream, you pay only for what you use.

That’s great, now let’s add that into our pipeline and see it running.

The below screenshot is after integrating the Amazon Timestream database service. The service is working as expected and I can see the data in the Timestream table just fine.

Although if you look at the Lambda run-time analytics, the functions are taking longer to finish. That is BAD NEWS. The Timestream API calls are taking longer and that wait time is causing increase in the Lambda execution time.

Looking at the cost now, it would come down to this for 100,000 events.

0.0043275 * 100,000 = $432

So What Happened!!

Amazon Timestream is available in limited regions and this particular application is running in a separate region. New services may take some time to be available in all the regions. We used the simple SDK calls to connect with the Timestream table and write the data.

The “orange” stretch is showing Lambda run-time execution. The example is using the Amazon Timestream database that is hosted in US-EAST-1 region.

The additional (outrageous) cost is mainly due to the latency that it takes from the source region to US-EAST -1.

Amazon Timestream is just one example, this can vary well be replaced with any downstream service, 3rd party APIs or some other process that may experience delay.

Service Latency table from Lumigo Dashboard

Possible Solutions!

If you compare the two, you will see that the Amazon EventBridge is also having significantly higher latency. So why that is not affecting the cost!

Amazon EventBridge latency is a service-level latency, it takes time to actually trigger the Lambda function but that does not add into the consumption part. This simply means that the events will arrive late to the Lambda function. This latency is really important if you need to choose the correct service based on your use case. Refer to the Post #1 for more detail on this.

The first solution that comes to mind is to use the VPC PrivateLink (Endpoint) to communicate using the AWS private network.

Amazon Timestream does not support the private endpoints as of this writing.

The next feasible solution is to consume the service in the same region instead of making the cross-region API calls.

Using the service in the same region

Note: Although this would not be the best solution practically.

If you run the whole pipeline in the same region, the latency is reduced drastically. The Lambda function now only waits for the Timestream Write API to complete.

Looking at the cost now, it would come down to this for 100,000 events.

0.000545109 * 100000 = $54.51

Note: This does not include the Amazon Timestream pricing as it depends on number of features like storage, retention and data-per-query.

Other options would be to use low latency services like API Gateway and SQS to do the cross-account event sourcing.

API Gateway —> async invoke —> Lambda function —> Timestream API

VPC Endpoint —> SQS Queue —> Lambda function —> Timestream API

This would introduce some overhead to maintain pipelines in different regions. Although it is not a bad design, as we are doing that to optimize the overall pipeline cost.

What about 3rd party API Calls

Oftens times the long running APIs may not be part of the synchronous call chain where the caller is waiting until the response comes back. This would be a pure wastage of resources in Serverless applications.

Instead the API or the integration call should be made independently and the response should be chained either in the Workflow using Step Functions or simple orchastration like Lambda Destinations.

This way the API call can run independently, fail independently and may not need the same resources that the other functions need.

Keep in mind that Lambda Destinations do not replace Dead-letter queues.

If both Destinations and DLQ are used for Failure notifications, function invoke errors are sent to both DLQ and Destinations targets.

Monitor and Measure what you CONSUME

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

The overall example illustrates the significance of Observing the Serverless applications so that key metrics can be used to optimize the application across performance, availability, and cost parameters.

For this example, I am using the Lumigo.io timeline feature that makes it extremely simple and easy to extract meaningful insights.

Make sure you have the right tools and metrics in place to monitor and measure WHAT YOU CONSUME.

Monitoring Serverless Applications using Lumigo

Lumigo adds the important consolidated layer of Monitoring and Oberservability into your serverless application.

They have a Free tier with up to 1 Million invocations and 150K traced invocations.

Signup to get more detail. https://lumigo.io/pricing/

Next post…

In the next post, I will extend this example to further illustrate how to capture failures and retries in the serverless pipeline and configure the services to save the failed messages for future processing.