Real Time Lambda Cost Analysis Using Honeycomb
All of the services that support our web and mobile applications at Fender Digital are built using AWS Lambda. With Lambda’s cost-per-use billing model we have cut the cost of hosting our services by approximately 90%. Lambda functions are billed on the number of requests and the gb-seconds of compute power consumed. The value of gb-seconds is calculated using the memory the function is configured for along with the billed invocation duration in 100 ms increments. For example, a function configured for 512MB that runs for 924 ms is billed at the 1000 ms increment andconsumes 0.5 gb-seconds of compute. While we have been tagging our resources diligently to calculate the cost of each service, the AWS Cost Explorer does not allow us to delve into the configuration of the function and the actual resources it has consumed versus the actual invocation times.
Log aggregation to the rescue! We aggregate all of the Cloudwatch log groups for our Lambda functions into a single Kinesis stream, which then parses out the structured JSON logs and publishes them to honeycomb.io. I cannot recommend that tool highly enough for analyzing log data. It is a great product from a great group of people. AWS adds its own log lines as well, including when the function starts an invocation, when the invocation ends, and a report of the invocation that can be used to calculate the cost of the invocation. Here is an example of a report line:
REPORT RequestId: 4af5ccd9-fef7–11e8–8de1-bfd74775c3f5 Duration: 38.07 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 32 MB
Lots of good information in there, we just need to tease it out with a regular expression. The one we use with golang is:
`REPORT RequestId: ([0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12})\s{1,}Duration: (\d{1,}.\d{1,}) ms\s{1,}Billed Duration: (\d{1,}) ms\s{1,}Memory Size: (\d{1,}) MB\s{1,}Max Memory Used: (\d{1,}) MB`
This extracts the following from the report line:
- Request ID
- Invocation duration in ms
- Billed duration in ms
- Memory size configured in mb
- Maximum memory used in mb
The Cloudwatch log entry contains the account id, function name, & invocation time.
Using the values of billed duration and configured memory size, we can calculate the gb-seconds consumed by the function. We then package up an event to publish to Honeycomb.
You’ll notice that we send the duration_ms field as a string. Honeycomb allows us to update the schema to use that value as a float when performing calculations. We’ve also defined the billed_duration_ms field in the Honeycomb schema as an integer so we can order by it.
Now that we have data in Honeycomb, let’s query it to see the functions that consumed the most compute power over a three day span:
The function that cost us the most during this time was use1-prod-subscriptions_lambdas-appstore_poller. That function is triggered off of a DynamoDB stream TTL record expiration, which signals the stream function to check the current state of an in-app purchase receipt from Apple or Google. We need to poll for the receipt state to determine which users have disabled auto-renew on their subscriptions as that event is not sent to us via the server-to-server app store events. This allows us to target win-back campaigns to students that do not want to continue their subscription.
Let’s dig deeper into that function to see the average actual elapsed time per 100 ms billing increment…
Now we can see that roughly 2/3 of our cost is on the invocations in the 100–300 ms billed duration. Let’s check out the heatmap for the actual invocation duration of those billing increments.
From the heatmap that covers the full range of 100–300 ms billed increments, the vast majority complete an invocation in under 10 ms. Lets break down each of those billing durations to see the distribution in each increment.
The function is currently configured for 512 MB, and since most of the time it spends is in network I/O on calling app store APIs, we can reduce the configured memory. If we were to reduce it to 384 MB we would see an ~25% reduction in cost. The average invocation time may be slightly higher, but as this function is invoked off of a DynamoDB stream it has no direct impact on user experience. What are the actual costs incurred? Assuming we’ve already consumed the free tier for Lambda, during the 3 day period we are looking at there were 1,405,640 requests x $0.0000002 per request = $0.28 and 140,057.8 gb-seconds of compute time x $0.00001667 = $2.33. For those three days, this function cost $2.61, leading to a potential savings of $0.65. That’s not a lot, but as our use of lambda steadily grows we can ensure we are not over-provisioned.
What’s Next?
We do have an enhancement to make to the report log analysis. Currently we are not processing the log line when a function times out:
2019–04–18T18:50:54.807Z ddca0e63–91ce-4fa1-a61f-a73f96925b3b Task timed out after 6.01 seconds
We have middleware on our function that will capture a timeout before AWS kills the invocation, but to be complete it would be best to record the timeout in case we have a failure in our middleware.