Revisiting the Serverless email delivery service

Published in

Schibsted engineering

8 min readMay 17, 2023

This article is a follow up to Building a serverless email delivery service on AWS which was published about 2.5 years ago. Not much has happened during this time but some observations and improvements have been made.

Speeding up the build

AWS SAM CLI added some cool features to speed up the build process. When working with SAM it is recommended to have a samconfig.toml file to store SAM configuration. It can contain the default deployment parameters like stack name and region along with settings to trigger prompt before executing changeset or how to act when deployment contains no actual changes. More information on how to use the samconfig.toml is available in the AWS SAM CLI Configuration documentation.

Two parameters are added to speed up the build process, cached and parallel.

[default.build]
[default.build.parameters]
cached = true
parallel = true

By default, all functions are built sequentially, and by adding parallel = trueto the samconfig.toml the functions and layers in the template file are built in parallel. Similarly, the cached directive will enable reusing of build artifacts that haven’t changed from previous builds, avoiding unnecessary building altogether. Speedy build = Happy dev!

Optimizing the email building lambda

Emails are asynchronous by nature and 100 ms extra lead time won’t make a difference for the user. However, 100 extra ms costs extra and if you have a job pushing 20,000 emails to queue at once, you want to be able process those as quickly as possible since regular emails triggered by user actions are stuck in the queue. Also, it’s never fun to waste CPU cycles.

The lambda taking up the most time was the email building and sending lambda, the one that read the MJML file from disk, compiled it to HTML, made variable substitutions and sent the HTML to AWS SES. Let’s set that aflame with 0x and see what optimizations can be made.

A small test script was added that loaded all dependencies, slept for 1 second and then proceeded to build and send an email.

$ 0x -o --collect-delay 1000 lib/clisend.js

A delay was added to remove the time it took to load all files, something that typically only happens on cold start in AWS Lambda.

The flame graph showed that 46% of the time was spent compiling MJML to HTML. That compilation was not dependent on any input and compiling it every time for the same email template was wasted CPU cycles.

One optimization to be made is clearly to pre-build MJML into HTML during the build process and then use the HTML in the lambda function directly. This can be achieved by first splitting out the compiling into an own function

const mjmlCompile = async (template) => {
  const buf = await readFile(`${process.cwd()}/templates/${template}`);
  const content = buf.toString();
  const { html, errors } = mjml(content, {
    filePath: `${process.cwd()}/templates`,
    keepComments: false,
  });

  assert.ok(errors.length === 0, JSON.stringify(errors));
  return minify(html, {
    minifyCSS: true,
    collapseWhitespace: true,
    conservativeCollapse: true,
  });
};

And then add a script to perform that compiling

const { readdir, writeFile } = require(‘fs’).promises;
const { mjmlCompile } = require(‘./builder’);
readdir(`${process.cwd()}/templates/`).then((files) => files
  .filter((f) => f.endsWith(‘.mjml’))
  .forEach(async (template) => {
    try {
      const html = await mjmlCompile(template);
      const filename = template.replace(‘.mjml’, ‘.html’);
      console.log(`Compiled ${template} to ${filename}`);
      await writeFile(`${process.cwd()}/templates/cache/${filename}`, html);
    } catch (e) {
      console.error(`Failed to compile template: ${template}`, e);
      process.exit(1);
    }
}));

The script is executed by the npm run build script in the package.json.

Once the HTML files are built and available, an environment variable can be used as a flag of using them or not.

let html;
if (process.env.USE_TEMPLATE_CACHE === 'true') {
  const buf = await readFile(`${process.cwd()}/templates/cache/${data.template}.html`);
  html = buf.toString();
} else {
  html = await mjmlCompile(`${data.template}.mjml`);
}

Flame graph now shows that entire block being gone.

But how did it affect the actual execution of the Lambda function in the production environment?

Average execution time went from approx 850 ms to 220 ms, almost 4 times as fast.

Using SQS the right way

Queues are great! We established that in the previous post. By having the Lambda triggered by SQS, any Lambda failure will keep the message left in queue to be retried as many times as one specifies in the configuration.

Concurrent executions and rate limiting

Now that our function was almost 4 times as fast and we had a cronjob that triggered 10k-20k emails to be triggered every morning, we needed to rate limit the lambda function to avoid hitting the SES rate limit. How to calculate and set a suitable limit? Let’s say the SES soft limit is 100 emails/second, what should be the Lambda concurrency limit? Our average execution time was 220 ms, which means one function could send roughly 4.5 emails per second. That gives us a maximum of 22 concurrent executions.

Concurrency limit = downstream rate limit * average execution time

There are currently two methods of limiting Lambda concurrent executions with SQS. Using the Reserved Concurrency of the Lambda function or using the Maximum Concurrency Limit of the SQS Event source. In the first iteration, only the former was available but it has its downsides when used together with SQS.

If you set the Reserved Concurrency of the Lambda function to a low value, it will set a maximum number of simultaneously running Lambda functions. The Lambda that polls SQS behind the scene will keep polling and forwarding messages to your functions and if there are no available functions, the function will be throttled and the message will go back to the queue with an increased Receive count. If that same message is throttled again and again until it hits the Maximum receive limit, it will eventually hit the Dead Letter Queue instead of being sent. This is generally bad.

By requesting an increase in the SES rate limit and tweaking the Lambda Reserved concurrency, we managed to get rid of all those throttles, but the risk of throttled messages hitting the DLQ was still there.

In early 2023, AWS announced the Maximum Concurrency Limit, where messages will be held back on queue until there is an available lambda function. With AWS SAM this is configured with the MaximumConcurrency property in the ScalingConfig of the event configuration. This setting works better with SQS than the Lambda Reserved concurrency and is something that we switched to recently.

Batch size

With SQS it is possible to configure how many messages will be forwarded to the Lambda function at a time using the Batch size directive. However, the default behavior of Lambda is that when one of those messages fails, the whole batch fails and will be put back in queue.

In terms of our use case, that could lead to unwanted situations. Let’s say the Lambda function processes a batch of 10 emails, 9 are sent successfully but the 10th fails to send for some reason. The entire batch is failed and the 9 successfully sent emails are added back to the queue, processed and sent again, giving the user another exactly the same email in the inbox. Due to this, the service ran with a batch size of 1, to avoid resending already sent emails.

In November 2021, AWS announced a great feature to allow better handling of failed messages within a batch of messages, the Partial Batch Failure feature. It is a feature where the Lambda function can report back to SQS what messages failed to have only those messages in the batch be retried. AWS has a guide on how to enable Partial Batch Failure and one needs to be careful about the success and failure criteria when the ReportBatchItemFailures feature is enabled. To avoid confusion, always return a JSON object with the property batchItemFailures set as an empty array when all messages were successful, or an array of objects with property ItemIdentifier set to the SQS MessageId of the failed message.

Using this functionality with our handler would look something like this

const builder = require('./lib/builder');
const sender = require('./lib/sender');
exports.handler = async (event) => {
  const promises = event.Records.map(async (record) => {
    try {
      const body = JSON.parse(record.body);
      const mail = await builder.render(body);
      await sender.send(mail, body);
      return null;
    } catch (err) {
      return { itemIdentifier: record.messageId };
    }
  });
  const batchItemFailures = (await Promise.all(promises)).filter((e) => !!e);
  return { batchItemFailures };
};

Hitting the budget

In terms of cost, the service is hitting the budget and SES is the main cost driver. All resources that the service uses are tagged with that tag that is also enabled as a cost allocation tag. You can find and edit cost allocation tags under Billing in your AWS console.

SES costs can sadly not be tagged and won’t show up in the AWS Cost explorer when querying the cost allocation tag. The service sends approximately 4.2 to 4.5 million emails per month at a total cost of about $55 for the service and $420 to $450 for SES, which is hitting the estimated cost of approximately $110 per million emails.

The service has been running without hiccups for a couple of years now, much thanks to Serverless architecture on AWS with Lambda and SQS. At Schibsted we always strive to run our services cost-efficiently, without waste and easy to maintain. Flame graphing and profiling heavy functions to find optimizations and cost tagging all resources to be able to track what each service actually costs is a winning concept.

PS: We’re hiring and have exciting positions in all our locations across the Nordics and Poland. Check out our open positions at https://schibsted.com/career/.