3 Principles for Re-architecting Subsystems for Reliability and Growth

Simplify and Scale Out. Test Aggressively. Measure Everything.

Grafana dashboard showing MTA server performance over time (Y-axis labels redacted)

The Team and The Project

The Site Reliability Engineering (SRE) team at Sailthru is focused on three main goals:

1. Increase system stability
2. Reduce incidents and alerts
3. Improve monitoring

To achieve these goals, we spend a lot of our time proactively re-architecting subsystems before they hit their limits. By monitoring traffic patterns and subsystem performance, we can identify systems that need to be rebuilt to accommodate our accelerated growth and scale. Then the team will hit the whiteboards and redesign the system from scratch so that it can handle the load we expect for the foreseeable future.

RE-ARCHITECTING THE MAILING SYSTEM

The SRE team identified the legacy email system as the top priority this year, before the holiday season rush. Our customers use this product and its underlying subsystems to create and send billions of personalized messages per month.

Year-over-year comparison of our mailing volume growth, almost doubling since last year — illustrating the need to re-architect the mailing subsystem. (Y-axis labels redacted)

As we monitored the aggressive growth in message volume shown above, we anticipated that our legacy, sequential processing system would soon be pushed to its performance limits. So we needed to build a faster, parallelized system that would ultimately let our customers be more responsive to their changing needs while also getting more messages sent.

We successfully completed this year-long project on time (before the massive holiday rush) by keeping these 3 guiding principles top of mind:

1. Simplify and scale out
2. Test aggressively
3. Measure everything

Rebuilding the Mailing System

SIMPLIFY AND SCALE OUT

There are two major processes for personalizing email content:

  1. Define the audience — the dynamic list of people that will receive the email. The list is generated by a series of complex queries run across each client’s entire customer base. For example, a marketer looking to send out a new coupon for 20% off a $100 purchase at an LA store may want to find all customers that: 1) have opened in the last 30 days, 2) perform the majority of their opens in Los Angeles, 3) have not purchased anything recently, and 4) are statistically likely to spend $100 or more.
  2. Construct and send a personalized message — the right content to the right person at the right time. Messages are composed in a Turing-complete templating language, so the possibilities for granular control are almost limitless: for example, using a few {if}…{else} statements, a campaign sent to 1M people can send 1M completely different messages, each message personalized with the most relevant content to that specific customer’s interests and click history.

Legacy systems at Sailthru tend to revolve around a single MongoDB cursor, where a single PHP process loops through the results of a query and performs business logic in each iteration. This serial processing has a low ceiling for scaling, and also carries with it some serious resiliency problems: a DB failover can cause long-running processes to restart from the beginning.

In the legacy mailing system, the user-defined query for a campaign audience would result in a complex MongoDB query. Each profile returned would have to undergo application-level matching (since not all the query criteria can be expressed within the MongoDB query language), and is written to disk in preparation for mailing. This serial process started to approach many hours for large, multi-million-user campaigns, and required a stateful handoff that would block until the cursor was fully iterated.

The pattern that we have found to be very successful in addressing this type of legacy architecture is two pronged: Simplify and Scale Out. To simplify, the MongoDB query itself is reduced to something that will execute fast against a known DB index, and will only extract _ids from the database. To scale out, these _ids are put on a distributed queue (we use ActiveMQ) and consumed by stateless workers. The workers then perform a findOne to fetch the entire database document. With the full document in hand, the workers can do any business logic for matching and sending the message.

These changes allowed us to create a “stateless” process that was no longer sequential, so sending could happen as soon as the _ids were extracted from the database. We’ll review the performance gains in the concluding section.

TEST AGGRESSIVELY

Like many startups, Sailthru started with a dynamically typed scripting language (PHP) and is now moving new services to a more scalable language (Java). Java provides a plethora of concurrency paradigms that help to flexibly address all sorts of problems, and its ability to tightly control pooled connections to downstream resources is a huge win for our database team.

Porting code from dynamically typed single-threaded languages to statically typed multi-threaded languages comes with incredible challenges. That == (double-equals comparison operator) that used to “just work” in PHP gets expanded into a dozen lines of Java code. The cursory call to strtotime becomes a dizzying combination of regex and Joda-Time. PHP tends to ignore nullity and fail silently, while Java tends to be highly sensitive, throwing NullPointerException any time something is a little bit off. These issues are exacerbated by schema-less MongoDB documents that can have all sorts of data types mixed into the same fields.

The mailing re-architecture required porting an extensive library of code from PHP to Java, and we ran into countless edge cases. For this reason, we developed an aggressive testing strategy.

  1. The developers wrote extensive unit tests, using 95% code line coverage as an initial metric to know when to stop.
  2. Then, our QA team wrote an automated testing suite with 1800 test cases that compared the PHP and Java results for any discrepancies.
  3. Finally, we conducted read-only testing in production using real client data.

MEASURE EVERYTHING

The SRE team tries to measure as much as we can, but in some legacy systems it isn’t always easy. One major advantage of rebuilding a system is that measurement can be baked in from the beginning.

For Java services like the rewritten mailing system, we like to use the Dropwizard Metrics framework. It pumps metrics directly into Graphite, and then we use the Grafana dashboard tool as the frontend for monitoring what’s happening in real time. We like to measure all the external dependencies such as different databases, the cache, and many others so we can get anticipate potential bottlenecks in the system — and then address them.

Partial view of the mailing dashboard (Y-axis labels and legends redacted)

There is a common question about how to alert based on Graphite metrics. We use a simple solution of PHP scripts that read the Graphite JSON web API, iterate through the results, and then do some simple statistics to detect if a threshold is exceeded. Finally, an automated process will trigger an alert via the PagerDuty API.

Measuring the performance in aggregate is very useful, but sometimes we want to dive into the details of what exact code is running slow. To tackle this, we use Flame Graphs. This visual profiling tool helps identify, at a glance, which methods the CPU is spending time in. This has helped us identify quite a few bottlenecks in the Java standard library, as well as our own code.

These kind of changes are definitely micro-optimizations, but they tend to aggregate up the stack, so from a system level we’re being as smart as possible with our resources and code execution. We’ll revisit the flame graph after making the change, then measure and compare the performance or efficiency gain.

Conclusion

As a result of re-architecting and parallelizing the mailing system, we were able to horizontally scale and achieve an order of magnitude improvement in overall system performance — reducing hardware use by half, while increasing throughput.

Comparing the legacy mailing system with the “stateless”, re-architected system

The above example shows the performance difference for a medium-sized email campaign for ~500k people: on the legacy system it took about 24 minutes to generate and then finish sending, but on the re-architected system, the entire campaign finishes within 2 minutes. As noted earlier, the big difference is that the new system does not block during the query (aka “generate”) step, so sending happens in parallel to those _ids being extracted from the database.

Comparison to a competitor system, on a very large and complex campaign

Most other competing email systems we’ve seen require a lengthy pre-generation step before sending, just as our legacy system once did. If a marketing team wants a campaign for 2M customers to start sending by 10am, they will often have to get everything set up and prepped by 8am or earlier. This severely limits the flexibility of the business, since publishers cannot react to breaking news and e-commerce companies cannot adapt to rapid changes in their product inventory.

In contrast, the new Sailthru mailing system has two big advantages:

  1. Marketers have more flexibility to make changes without incurring any re-generation delays
  2. End customers get their highly personalized messages in their inboxes much faster, lifting engagement metrics like click-through and purchases

The same guiding principles we used to re-architect the mailing system have now also been used to speed up a nightly aggregation subsystem by 10x. Moving forward, the team plans to apply the same 3 principles to other subsystems which are most in need of performance and scalability boosts.


Rob Williams is a Senior SRE Engineer at Sailthru. His main responsibilities include making performance/reliability improvements to existing legacy code, implementing new high performance distributed subsystems that replace old ones in order to enable greater system scalability, responding to incidents while on-duty, and investigating root causes of any production problems across the full stack.