Moving large scale task processing from Google App Engine to Kubernetes

Published in

Bluecore Engineering

7 min readSep 1, 2020

This year Bluecore migrated more than 90% of our personalized-message processing from Google App Engine (GAE) to Google Kubernetes Engine (GKE) to save on infrastructure cost and gain finer control on scale. This blog discusses the reasons we made this change, how we did it, and what the results are.

Google App Engine to Google Kubernetes Engine

Why did we do this?

Bluecore reached a scale that made it more cost-effective to move CPU heavy loads from GAE to GKE. The operational overhead incurred in moving from a managed service to self-operated Kubernetes clusters was worth it. Moreover, the engineering effort required to upgrade the Python 2.7 codebase to Python 3 was already significant even if we stayed on GAE, making the decision to switch to GKE simpler.

How did we do this?

To make sure the project was feasible, we first spent a few weeks on a proof of concept implementation on GKE, building a scaled-down but representative message pipeline implementation to confirm both cost savings and that a potentially rebuilt version would provide the required scale. The proof of concept was successful on both accounts so we started on a long journey.

We thoroughly reviewed the GAE email pipeline code base, architecture, and functionality. While sending emails seems simple superficially, there are many aspects to sending emails such as:

Building the audience to target
Email personalization
A/B Testing/Control Groups
Minimum time between email (MTBE)
Coupons
Heat maps
Rendering HTML
Email link processing and shortening.
Analytics
Delivery
Opt-in/Unsubscribe management
Accounting/stats

Once we had a good overview of all the parts that needed migrating (and what parts could be left behind) we started drawing up a new architecture and a coarse work break down.

Architecture

Since the bulk of email related to Google Cloud CPU spending occurs after audience generation, we decided to first keep the audience generation on GAE and from that point transfer control to the new email pipeline on GKE. The result of audience generation is a BigQuery table, providing a great boundary between the old and new platforms. In other words, the pipeline on GKE only has to read the audience from the previously generated BigQuery table, removing the need for large data transfers between the two platforms.

Google App Engine to Google BigQuery to Google Kubernetes Engine

PubSub

Most of the Python code could be implemented in a straightforward manner including switching from the GAE API to Google Cloud API calls. In contrast, the biggest architectural change was the way scaling was implemented.

Given that sending millions of emails in a serial fashion would take a very long time and the task to process an individual email can be considered “Embarrassingly parallel”, we made use of GAE Task Queues to scale sends. Switching to GKE required us to look for an alternative way to do this. We chose Google Cloud PubSub to accomplish similar parallel task processing. Scaling worker pods up and down relative to PubSub subscription backlog size provided similar behavior. This required more engineering work to implement compared to using GAE Task Queues, so while moving to GKE saves CPU cost, there is “no free lunch.”

Worker pods pull Pubsub messages with tasks to process and may publish other Pubsub messages as a result, for example in the case of reading the audience from the BigQuery table.

Google PubSub and Google Kubernetes Engine interaction

We made use of the larger allowed payload size of PubPub messages compared to task queue tasks to convey campaign wide settings and other input information. This shared information is then passed by and to worker pods in all PubSub messages so they do not have to look up the same information for each individual task. On GAE, these repeated lookups caused slowdowns at best and strained external caching services at worst. Making sure that we prevent duplication of effort on a per email task basis became one of the main design principles of the project and has been beneficial to our ability to scale. In fact, the GKE email pipeline does not use external caching at all.

Microservices

Given the scale of the change and the critical nature of email processing, the move from GAE to GKE had to be gradual instead of a “big-bang” changeover. To be able to share services between both worlds and also allow usage of functionality not yet migrated, we introduced microservices running in both GAE and GKE, implemented in Go and Python. The shared services also ensure consistency in behavior regardless of where emails are processed. For more information on microservice call retry challenges, read this blog.

Development

To simplify the movement of Datastore access related code to GKE, we decided to port the GAE Python DB Client Library to GKE by replacing the Datastore API calls to Google Cloud Libraries usage. Since some Datastore entities would still be shared between GAE and GKE, reusing the same db.Model definitions ensured consistency across the two platforms.

Code that could be shared between GAE and GKE was moved into shared Python libraries to prevent copied code going out of sync over time. Bigger chunks of functionality were pulled into shared microservices, some written in Go instead of Python depending on performance needs.

Having ordered the email pipeline feature set by priority, engineers worked through analyzing the existing behavior of a given feature, porting the relevant GAE code to GKE, adding unit tests and eventually validating the behavior in QA.

Shadow Testing

We realized that unit and integration tests only would not suffice to make sure email campaigns migrated from GAE to GKE would behave exactly the same (HTML content, links, headers, etc.). The reason for this is the sheer number of settings and customization possible. For example, email templates can contain Python code snippets. The required amount of manually created tests would be near infinite.

To decrease the likelihood of unforeseen regressions during migration, we followed a process called “shadow testing.” During shadow testing, a given email campaign would still send on GAE, but the same campaign would also execute on GKE up to the point of sending. The GKE side would instead write the emails and associated information to Google Cloud Storage. Once both platforms complete processing, a comparison takes place that makes sure both results match and log differences for further investigation.

Successful shadow testing was the gating factor before migration could take place. Since shadow testing was such a critical part of the migration effort and has many interesting facets, watch for an upcoming blog post with more detail.

Rollout

We split the rollout of the migration into two large parts:

Sending of “static” emails
Sending of “personalized” emails

Static, non-personalized, emails (think “20% off of everything!”) are relatively plain and simple to produce, requiring the least amount of feature porting work once the initial architecture for the new email pipeline was developed. At campaign send time, depending on migration status and campaign type, sends would be taken over by GKE from GAE. During the 2019 Black Friday/Cyber Monday (BFCM) period, the new pipeline provided a great way to offload the spikes in sends from GAE. At the same time, this production load proved the new architecture and allowed us to work out any kinks before moving personalized campaigns.

While rolling out static email sends on GKE, we continued porting the personalized email processing. Here we kept some of the complexities of generating recommendations for a given email recipient on GAE by having the GKE pipeline call back into GAE microservices. Examples were dependencies on Memcache, Google Search, and other parts not easily ported without a significant redesign (and risk). This approach provided early cost gains on personalized sends without having to spend a relatively large engineering effort on porting this part.

The results

Bluecore reduced the Google Cloud spending on sending emails by a factor 10. While we calculated the cost savings by combining the Google billing information and code instrumentation, last July we obtained more direct proof by an (unplanned) event during which a large percentage of email processing moved back to GAE from GKE. The App Engine product billing increase during the three days showed without a reasonable doubt the savings to the company (note that Bluecore’s usage of GAE goes far beyond email):

The caveat here is that the savings are for this particular use case; your mileage may vary. By rethinking the email pipeline architecture we can now better meet our scaling challenges and are now running on a less restricted platform. While switching from GAE to GKE comes with more operational responsibilities, Bluecore has grown to the point where the cost of this additional overhead is offset by the gains by a wide margin.

The decision and effort to move platforms are difficult. GAE is a great platform for operational hands-off development as the Bluecore founders experienced in the early years. The choice to move to GKE depends on a lot of factors and we recommend readers to do careful due diligence before starting a project like this.

We would like to thank everyone that has participated in one of the biggest projects in Bluecore’s history.