How to solve problems with transactional email API deliverability
There are good reasons to believe that email is the most effective online marketing medium. Even more than Facebook and Google advertising.
Unless you live in a country like Brazil or China where WhatsApp and WeChat are the default mediums, chances are you are going to start most online conversations over email. Even Facebook and LinkedIn make excessive use of email to keep their users “engaged” because it makes a huge, measurable, difference to their bottom line.
The effectiveness of email campaigns is also much easier to measure than other marketing channels. Click-fraud and robo-views are infamous problems with online advertising, but email open-rates are generally great signals.
Now the mom-and-pop pizzeria owners may have never heard of an Email API, but they likely use one without knowing it. One does not simply press a button to send an email.
There are decades of tack-on solutions to problems that no one thought of when Simple Mail Transfer Protocol (SMTP) was first standardized in 1973. Sender Policy Framework (SPF) and DomainKeys Identified Mail (DKIM) complicate the protocol significantly.
When the pizzeria owners compose an email through an email database or marketing or ecommerce platform, it is handled by an Email Service Provider (ESP) who specializes in building the delivery infrastructure.
If you’re a software engineer, your first instinct might have been to use an SMTP library with Gmail credentials. After all, that’s how you send a personal email. But Gmail wasn’t designed for this. There are limits on how many emails you can send — 500/day last time I checked. It simply wasn’t designed for programmatic, auto-generated communication. In fact, Google actively tries to prevent it being used that way. So at some point, you or your marketing tool needs to use an email API (or spend millions buying one out).
All the leading traditional email APIs emphasize “reliability.” ESPs tend to throw this word around a lot in their marketing copy. But the word “reliable” is ambiguous, and often used in misleading ways, when speaking about email delivery.
If you’ve ever had to oversee production systems sending large volumes of email over the years, you know how difficult it is to do reliable email delivery. And even if you’ve never managed production systems, just pay attention to your inbox.
You’ve probably seen all sorts of strange problems, like late arrivals and mistaken spam filters, duplicate notifications, and botched ‘unsubscribe’ flows which somehow never stop the marketing emails you don’t want, while affecting the essential notifications you do want.
All of these cause terrible customer experiences. It doesn’t matter whether you are a ski resort or a bank. It’s not difficult to find reports of businesses that found direct revenue loss because of email deliverability issues, see for example: 1, 2, 3.
Email APIs have been around for almost a decade, created to solve this very problem. But in this article, I list five major issues that affect email reliability when using traditional email APIs and I propose a new kind of solution.
Full disclosure: I wrote this article before building my company’s flagship product, Flute Mail, the better email API for small to medium-size senders.
Problem 0: Defining “reliability”: How to judge the quality of a provider
The first problem is in defining the problem. Every email company claims to be “reliable”, but what does that mean? It’s difficult to judge the quality of an Email API provider, and so instead they try to gain your trust by showing a carousel of greyed-out Fortune 500 logos on their website. SendGrid claims to handle emails for Uber and Yelp — if it’s good enough for them, it’s good enough for you, right? Wrong. Large companies may have 24/7 phone lines or large customer support departments to deal with miscommunications that happen over email — and they have engineering departments which can build alternative features to handle the odd case when email fails. But if you’re a ski resort, for example, you can ruin your customer’s experience by failing to deliver the onboarding email to your students before they start their journey, which explains what they need to pack. You simply don’t have the resources to follow up on every email to make sure everyone got it, or to build alternative customer education platforms. Hence email reliability is even more important for medium-size businesses. What’s good enough for Uber and Yelp may not be good enough for your business — you need to take a close look at how many critical customer interactions happen over email.
When email service providers speak about “reliability”, they could mean one of many things. In this article, I focus on three technical aspects of reliability which may be used to judge the quality of a provider:
- Spam reputation management
- Paper trails
And there are other problems with traditional Email APIs which are not directly related to outbound sending, such as:
- Vendor lock-in
- Lack of post-delivery guarantees and follow-ups
- High costs
It turns out that abstracting away the underlying provider can help solve all of these problems, giving you much better reliability.
Problem 1: Spam reputations (and useless, expensive, “dedicated” plans)
If you send less than 20,000 emails per month, you will qualify for the free-tier shared plans that most ESPs offer these days. But “free” comes at a price: you would be paying an unknown amount in support costs due to unreliable delivery. One example of a problem I’ve frequently seen: if the IP address you happen to be assigned today was used by a spammer previously, your emails are likely junk. As a free-tier customer, you wouldn’t have access to industry-standard delivery analysis data from 250ok, or deliverability experts who can help you take action on that data. And you get little support from your ESP, who has bigger fish to fry.
Worst of all, you’ll be silently losing business, because most of the time spam bounces go unreported. It’s not in your provider’s best interest to notify you that it is failing at its job — I found this out the hard way. And you may be left guessing as to why your marketing campaigns are failing to get clicks. One thing I appreciated about Mailgun is that they would notify me about spam bounces, at least more often than other providers. I learned that many privately-managed email servers from Ivy League colleges (where email was first implemented) had poorly configured spam filters which would reject a full 100% of our messages, much to our despair. But at least Mailgun would helpfully notify us of bounces on an individual basis, so that we could investigate, and even contact IT departments at those institutions. (We would contact them, and the missed students, through our GMail accounts, which were pretty reliable.) But I was annoyed by the pervasive spam bounces reported by Mailgun, and I made the switch to SendGrid, hoping they would offer better deliverability. For the first month, I was happy to see a smiley “100% delivered” graph on their shiny dashboard. But time passed, and customer complaints about missing emails came in, unabated. I dug deeper into SendGrid’s technical logs and found hundreds of unreported spam bounces and other errors. Their shiny analytics dashboard was entirely useless, dare I say downright dishonest. Worse yet, there was no traceability or unique searchable ID assigned to individual emails — I could not easily follow up on particular customer support cases to figure out what went wrong. Lesson learned: a provider which reports more spam failures is not necessarily worse at deliverability. And a “delivered” sign on a dashboard doesn’t mean that it reached the customer’s inbox. Some recipient services like Gmail never report spam classification back to the sender, so it’s impossible for the sender to know if the email reached the inbox (unless you get a read notification).
Dedicated plans are *not* the solution. Many providers want you to throw money at the problem by buying dedicated IPs. In my opinion, there is no strong technical argument for why this will actually result in better delivery rates, except perhaps for the largest senders. Others agree with me, suggesting that ‘dedicated’ plans are a cash grab.
In fact, I think that dedicated IPs can actually increase the risk of deliverability problems. It takes time to build reputation for your dedicated IPs, and that reputation is precarious because risk is not spread across multiple large senders. A cranky ISP’s automated filters might choose to block a single IP for some silly reason, but it can’t do that for shared IPs which send millions of legitimate emails, at least not without noticeable consequences. Shared IPs give you a kind of safety in numbers. Furthermore, you can completely ruin your dedicated reputation if your overzealous-but-well-intentioned marketing department accidentally sends out a single email campaign that gets mistakenly interpreted as spam. Dedicated plans are expensive features which are out of the reach of the majority of small to medium-sized businesses, and are easily ruined by a single misstep. And it is not clear how they help anyone but the largest senders.
There is a better way to protect your sending reputation.
Solution 1: Avoid sending critical and non-critical email from the same environment
(EDIT: 4 days after I wrote this article, Postmark affirmed and clarified my advice in a blog post here).
Managing delivery reputation is a huge problem. Much ink has been spilled on this subject. But in my experience, assuming you are following the industry’s well known best practices, the single best thing you can do for managing your reputation is to send your critical messages through a different environment (provider & domain). Use a vendor that specializes in transactional email only (e.g., Postmark) for your critical email. Ask yourself: if this email is not delivered, is it likely to result in a customer support case, or a lost sale? Plenty of email does not fall into this category, but if it does, you need to take proactive measures to ensure it is delivered. Sending it through an environment designed ONLY for your critical emails ensures that it maintains a stellar open rate and reputation.
Remember: Any email sent from your domain can affect your reputation, even testing emails, and even if it is sent from a different service provider. That’s why dedicated IPs are precarious.
Since transactional-only services like Postmark don’t allow marketing on its service, and they have a vetting process for every customer, their shared pool of servers can maintain an excellent reputation. Again, I am in no way affiliated with Postmark, just a happy customer.
Some providers like SendGrid try to position themselves as the go-to provider for both critical notifications and marketing campaigns. Understandably, they want to capture both sides of the market. But in doing so, they silently sacrifice reliability, because marketing will never have the same reputation as critical notifications. Reputation loss is inevitable on their shared IPs. They don’t publicly report their delivery data across their entire service, but my experience with them suggests that their spam reputation is sub-optimal because they handle a lot of marketing.
One of the key features of Flute Mail is that it allows you to setup different environments (provider & domain) for different types of email — even testing environments. This is the best way to guard your domain’s reputation and get dramatic improvements in reliability, while managing all of your logs and analytics under one umbrella.
Problem 2: Downtime and silent outages
Since SMTP is such a complicated and ancient protocol, email providers suffer from downtime and service delays surprisingly often. All seven of the email API providers I have mentioned above have reported significant service incidents within the past 45 days at the time of this writing (20 Sep 2017), with the sole exception of Amazon SES. Amazon SES is probably more reliable because their service is so barebones — see problem #3 — they don’t even have a percentage uptime number to compare, and deliverability metrics are very difficult to glean from them. Mandrill has a reported SMTP uptime of 99.7% in their US East regions, which means up to 2 hours of downtime monthly, or 24 hours every year. That much of email downtime is entirely unacceptable for most businesses. If you happened to send an email during a downtime period, it’s likely gone down the drain.
Postmark has the most advanced status reporting page by far, but they also have the greatest number of incident reports in the past 45 days, compared to all six of the other providers I studied. Either they suffer technical issues more often than other providers, or they have much more precise monitoring equipment. I suspect it’s the latter — just compare their advanced, custom-built, status page with the generic, mostly blank, status pages of most other providers. I take that as a signal of competence and commitment to reliability.
Email delivery is a mission-critical service for most businesses. A few minutes of downtime or sending delays could mean an unknown number of frustrating customer support cases. One single missed email could cost you a 30-minute call with customer support, or worse, direct loss in revenue. And because service incidents happen almost every month with most email API providers, you should not completely rely on any single vendor.
Solution 2: Redundancy and automatic failover
All technical services go down sometimes. You need to be able to take action and quickly switch to another provider when your primary provider goes down. One easy way for engineers to do this is to use configurable functions in your app (such as serverless-mailer) which are not hardcoded to a single provider. You can then quickly switch to another provider by changing a config value in your app, for example.
Ideally, you should do this automatically, so that not a single email is missed. Then, even if your provider fails to report the incident immediately, your systems will detect delivery problems and seamlessly failover to another provider.
Fortunately, that’s possible to build with a bit of clever monitoring. That’s what our product, Flute Mail, does. You can also set custom Return-Path headers to handle all of your domain’s bounces, and take action on spam bounces, not just hard bounces. Most providers also offer a way to identify individual emails and retrieve post-delivery information about it, which enable you to respond to individual failures automatically.
And you might ask: what if Flute Mail goes down? That is a legitimate concern: however, from a reliability engineering perspective, the fewer moving parts in a system, the less likely it is to fail. Since Flute Mail is a much simpler product than an ESP, has fewer dependencies, and does not execute the difficult business of actually sending email, it is more likely to be reliable than any ESP can ever be.
Problem 3: Lack of a paper trail
An important signal of a provider’s commitment to reliability is whether they implement mechanisms for identifying and logging emails, so that you can quickly dig up a paper trail later in case of an issue. If an ESP doesn’t offer a way to identify and retrieve logs related to a particular email, that is a huge red flag. One should ask: how do they monitor their own reliability? If there’s no paper trail, how can anyone figure out what went wrong on a particular customer support case? Some providers claim to offer detailed analytics, but if you dig deeper, you find the information is mostly useless (e.g., SendGrid). One of my major frustrations with SendGrid is that they don’t return a unique ID when sending an email, and there is no way to query their system about a particular email — all they have is a firehose of unreliable “webhook events” which are very difficult to parse to create precise paper trails, not to mention that they disappear after 24 hours. As an engineer, I can understand why they did this: to improve throughput and scalability, and reduce storage costs. This seems to be an intentional design decision of theirs, according to an early slide deck from them. But this “spray and pray” attitude sacrifices reliability. Lack of unique email identification and query mechanisms make one-off failures harder to detect and correct, and queue duplicates much more likely. And independent reliability monitoring becomes nigh impossible.
Amazon SES has one of the lowest prices per-email out there, but they don’t have detailed logging, full body content retrieval, or individual message information. You get what you pay for. You have no way of measuring their deliverability, or measuring the effectiveness of their spam reputation management.
Solution 3: Independent logging and tracking
Fortunately, it is technically possible to implement your own logging, open tracking, and delivery stats without relying on a particular provider. Again, this can be done by abstracting away the actual email delivery of your app from the logging features, and capturing your provider’s notifications (usually by webhooks). You can also configure your domain’s Return-Path header to collect bounces. There are companies that specialize in delivery data alone: see 250ok.com.
One of the benefits of Flute Mail’s provider-agnostic abstraction is that you can switch to a low-cost provider like Amazon, which is feature-poor, while maintaining your own paper trails, read tracking, and other logging features.
Maintaining your own paper trails is important for audit, compliance, and customer support issues. It is also a good signal of your provider’s commitment to reliability.
Problem 4: Vendor lock-in and high costs
If you’ve read this far, you can guess that I take issue with all email providers out there. That’s because I’ve been a customer of almost all of them. There is no single provider I can recommend which solves all of the problems discussed so far. Furthermore, email delivery is a highly fragmented market. Uber and Yelp use SendGrid. Asana and 1Password use Postmark. Github and Shopify use Mailgun. Comcast and Oracle use Sparkpost. Siemens and HBO (the TV channel) use Amazon SES. What on earth should you use? The cheapest ones lack important features. The ones with the most advanced reliability mechanisms (i.e., Postmark) are restricted to transactional email only. And if you send millions of emails per month, all of this gets to be quite expensive. Fortunately, delivery service is commodified enough that you don’t have to be locked into a single provider.
Solution 4: Different providers for different types of email
All email was not created equal. The easiest way to save on email costs, while dramatically improving reliability, is to use multiple providers. Critical email should go through a highly reliable, expensive, and dedicated provider like Postmark. You should failover to another provider if Postmark goes down. Not-so-important emails can go through a low-cost provider like Amazon SES or Pepipost. You should have dedicated testing environments, and multiple marketing environments, so that you don’t accidentally damage your reputation. And you shouldn’t be locked into a single provider — make them compete for your business.
Doing all of that is not trivial. That’s why we built Flute Mail.
Problem 5: Post delivery follow-up
After all is said and done, sometimes your customer doesn’t read an email, even if it was properly delivered. This can happen for simple reasons like typos in addresses. For certain critical emails, you need to be notified about this so that your sales/support team can follow up with the person, and ask them what’s up. Whether you run a ski resort or an educational institute, you need to know if one of your students is out of the loop, so that you can make sure they receive the necessary information they need, or unenroll them from the class if they’ve decided to drop out, so that your numbers are not skewed.
A post-delivery follow-up process is the only way to guarantee that your customers are still active, and furthermore that all of your delivery infrastructure is working properly.
Solution 5: Get notified when critical emails remain unread
Building post-delivery follow-up mechanisms is technically doable, but very difficult to build reliably. Most providers rely on webhooks to notify you about status of a delivered email. Building infrastructure that can reliably capture, record, and notify you about the status of an email, in a provider-agnostic and seamless manner is our primary expertise at Flute. The feature can be as simple as this: “if the recipient does not open this critical email within 3 days, notify our support team”. Of course this feature will only apply to a small subset of the emails you send out.
Boomerang-style follow-ups have long been a part of the sales/support process in CRM software and other types of personal email, but I believe we are the first company to implement follow-up mechanisms for automated, transactional email (such as onboarding info — after a customer has gone through your sales funnel).
Sending email at scale is a nuanced problem. Traditional email APIs try hard to win your trust with a lot of talk about “reliability”, because they know how important reliable email is for any modern business. Herein I describe three concrete metrics by which to judge the reliability of a provider: 1. spam reputation management, 2. uptime and 3. paper trails. Furthermore, all providers have the problem of vendor lock-in, high costs, and lack of post-delivery follow-up. All of these problems can be addressed by abstracting away the underlying providers, and using different environments for different types of email.
If you’d like a plug-and-play solution instead of building all of the above yourself, have a go at the product I’ve been working on over the past year at FluteMail.com. Let us know what you think!