Ensuring SaaS reliability with synthetic traffic — ‘SynthMail’

Published in

Glasswall Engineering

5 min readMay 22, 2020

As a member of Glasswall’s SRE team, the reliability of our products is my number one concern. Reliability of our Rebuild for Email product is especially important, as it’s responsible for ensuring our customers never receive any unsafe and untrusted email attachments. Our system works as a mail transfer agent that scans and remediates email attachments with Glasswall’s d-FIRST system, a form of content disarm and reconstruction technology.

A graph of our email throughput for a cluster. See how its peaks and troughs align with workdays and weekends.

Ensuring the reliability of such a system is no small feat. We have set up a full monitoring suite, integrated with Datadog and PagerDuty to make sure we can detect any abnormalities in the system within minutes of occurrence, but this monitoring depends on email flowing through the system. When a problem occurs it can mean customers will observe delays in mail delivery — which is a bad thing for business when prompt communication is everything in our current fast-paced environment. So, how can we detect problems and delays in the system before they affect any customer?

In this post I will describe a system we have designed and implemented that sends synthetic emails through the system at a high rate, in order to trigger our monitoring for problems as early as theoretically possible, and ensure we can remediate any problems before they start affecting our customers.

Background

We’re an SRE team, so we try to follow Google’s ‘scripture’ as often as we can. In a nutshell, this means we define ‘service level objectives’ (SLOs) for the products we manage. These SLOs are based on specific metrics for a system, and are backed by hard data. Our SLOs for Rebuild for Email can be found on this page. In this post we’re focusing on the mail latency metrics.

If mail starts getting delayed, then we start eating into our ‘error budget’ for a particular SLO. If we use up more than 100% of the error budget, then we have broken the SLO. Ideally we don’t want this to happen, so we should be alerted that we’re burning the error budget very soon after it starts happening.

Google’s SRE workbook states that artificial traffic can be useful for detecting issues before they affect your end users:

A system can synthesize user activity to check for potential errors and high-latency requests. In the absence of real users, your monitoring system can detect synthetic errors and requests, so your on-call engineers can respond to issues before they impact too many actual users.

That sounds like exactly what we need, so ‘SynthMail’ was born.

Design

A high-level overview of the Synthetic Mail system.

This is a diagram showing the overview of the proposed solution. It will be deployed as a set of microservices into our ‘worldwide’ Kubernetes cluster, which will run continuously and send/receive email from our SaaS clusters, essentially acting as a real customer’s mail server. It consists of three components:

Test sender — creates emails based on config and sends them on a cron schedule to a list of SMTP endpoints. Informs test receiver every time an email is sent.
Postfix relay — receives emails from the SaaS cluster and relays them to test receiver. Postfix was chosen for reliability, because it will queue mail if for some reason it can’t be sent, and it can also be tuned to induce extra SMTP faults to further test the system.
Test receiver — waits to receive emails after being informed about them by test sender. When an email is received, computes flight time metrics and sends them to Datadog.

It’s just a huge loop of email, with some extra metadata being sent through via MIME headers. The metrics are sent to Datadog as a histogram, so we can easily query for percentiles and compute our SLOs.

Operation

The tests are defined in a YAML config file loaded as a ConfigMap by the test sender microservice. Here is an example that sends a plain email without an attachment, as well as an email with a PNG attached:

endpoints:
  ukprod: smtp.uk.ourendpoint.io
tests:
  - name: png_attached
    cron: "0 * * * * * *"
    spec: ./specs/attach_png.yml
    endpoints: [ukprod]
  - name: no_attachments
    cron: "0 * * * * * *"
    spec: ./specs/no_attachments.yml
    endpoints: [ukprod]

As you can see, SMTP endpoints are defined at the top, and then an array of tests is given. The tests are given a name (to appear in metrics and logs), a cron schedule of how often they should be sent (both here are every minute), a path to a spec file (which describes how to generate the email), and then which endpoints to send the test to.

A spec file is essentially a YAML syntax that can be converted directly into a MIME email:

headers:
  To: [ test@synthetic.mail ]
  Date: {{ now() }}
  From: [ test@synthetic.mail ]
  X-FileTrust-Tenant: <removed-for-security>
body: This contains a PNG
attach:
- png.png

It’s a really flexible way of generating emails, and can be hooked up to a cloud storage solution to access a shared library of attachments. This is very helpful for ensuring our system is correctly rejecting/sanitising bad files. We are hoping to open source this mail generation code soon.

The end result of the SynthMail system — the metrics! This is how they look in Datadog.

When the test mails are received at the end of the system, the metrics are computed and submitted to Datadog as a histogram. This lets us directly see how many mails are meeting our SLOs and how many are not.

We can then use Datadog’s SLO monitor system to track our SLOs and error budget!

Pictured: a stable system, meeting its targets!

Conclusion

Ultimately, the synthetic mail system has been a huge success for us. We can now be 100% sure that we can detect any delays in mail flow before they begin affecting our customers, and we can mitigate any issues before they start burning large amounts of our error budget.

It’s also had a few fringe benefits:

We can now accurately measure delays when correlated with external events such as cloud service failovers, and deployments.
We can get a realtime view of the system performance according to our baseline of expected mail throughput. Previously this was difficult with the aforementioned ebb and flow of traffic.
When we do have incidents, we can use the synthetic mail system to quickly assess the degree of mailflow impact. This greatly reduces operational complexity when it comes to call-outs.

Ultimately, if you are running a SaaS platform I would encourage using synthetic data to assist in monitoring wherever possible. It’s well worth the initial time investment of developing a system to generate the data.

Ensuring SaaS reliability with synthetic traffic — ‘SynthMail’

Background

Design

Operation

Conclusion

Written by Sam Gibson