Hi Ho, Hi Ho, It’s Off to Spanner We Go

Honey
The PayPal Technology Blog
6 min readOct 3, 2018

by Sam Aronoff

As an engineer at Honey, I’ve done lots of cool stuff that I’m both proud of and was on my professional bucket list! Since elementary school, I’ve been consumed with solving digital problems, starting with video games (I’d like to take this opportunity to officially tell my parents, “I told you so…” who doubted the time spent playing games would be helpful later in life).

All kidding aside, slaying Armos in Zelda is nothing compared to solving problems in real life. And, there are no Nintendo Power magazines to assist you. At Honey, there was no playbook for building the most impactful and scalable e-commerce ecosystem. For us, that meant building Honey from the start 100% cloud native exclusively on Google Cloud. We deploy all of our services using Kubernetes, and we develop the majority of our software using Node.js. This gives us the ability to deliver a minimum viable product to users quickly, while continuing to iterate based on user and usage feedback.

Honey has over ten million users worldwide across our five supported platforms: Chrome, Safari, Firefox, Opera, and Edge. It took us nearly four-and-a-half years to get to our first five million and less than ten months to get the next five million. The rapid pace of user growth, while great for our business, exposed cracks in our architecture, most significantly in how we use CloudSQL, Google’s hosted MySQL offering.

The way we integrated CloudSQL into our Kubernetes-based services stopped working well as our infrastructure grew in size. Our MySQL databases were seeing sustained large numbers of connections and higher-than-desired load surges. This forced us to maintain large numbers of read replicas that started experiencing sporadic issues during sharp upticks in traffic. These databases have fulfilled their role in supporting Honey’s systems, development, and rapid growth for a few years now, but to support our next tens of millions of users, we needed to make fundamental changes to our infrastructure; we needed to take our core databases to the next level.

There is no shortage of choices for horizontally scalable database solutions and our primary requirements remained the same: a reliable platform that is able to manage our current and future traffic and is flexible enough to accommodate any new products and functionality. While we were exploring our technical options, Google introduced its new Spanner product. Spanner met our technical requirements, (highly available ACID-compliant relational database, while relieving glitches we were experiencing with scaling), while also being a Google Cloud-based managed database infrastructure, which was critical for a company of our size to scale and grow.

One question we had when migrating to Spanner was: do we throw out the old for the new? Could we replace our current library calls with Spanner calls? The short answer is no, we did not want to take any shortcuts and we knew we had some tech debt we wanted to pay down or eliminate in this process.

This was no easy task and took the entire engineering team’s buy-in. This process led us to split our architecture into micro-services even further, build brand new RESTful APIs to abstract our subsystems, and a GraphQL API gateway as a central aggregation point. While we are hoping to avoid having to replace our database backend again, we designed this version of our system with the future in mind, by ensuring the persistence layers are abstracted by APIs, and thus maintaining as much architectural independence as reasonably possible. In doing so, we were able to minimize the places we’ll need to change for any future migration.

After we fully understood our technical requirements, we then had to figure out how Spanner structured data. To begin with, Spanner organizes its data in splits across multiple servers. It does this by ordering your rows lexicographically and making horizontal cuts in the data. This arrangement requires the primary key be random to avoid hot spotting database requests. If, for example, all of your new data is added at the end of the last split, that split will get all the writes and will cause slowdowns. Given Spanner was our solution to alleviate glitches and slowdowns, we needed to figure out how to work within this new paradigm. Our solution was introducing an extra column to use in concert with the primary key we used in Cloud SQL. More information on splits can be found here. The concept of splits isn’t introduced by Spanner but it’s an interesting read.

The next adjustment in thinking was around how Spanner conceptualized foreign key relationships between tables, in Spanner parlance, “interleaving”. This maps to actual filesystem level hierarchical relationships and requires all the primary keys from the parent table be present in the child table along with at least one extra column to form the child table’s primary key. This is more-or-less the same as Cloud SQL with the extra requirements around primary key choices.

The last mental adjustment required during our migration was how we thought about secondary indexes. Each Spanner secondary index is, in most ways, a table unto itself. When a secondary index is used in a query, unlike Cloud SQL, the index must be included in the query using a special syntax to ensure it’s used. A secondary index can also have added columns beyond the ones used in the index colocating the data on the filesystem. I think we’ll find ways to make better use of this feature but we haven’t yet. Still, pretty nifty!

The “great Spanner migration” started at the beginning of 2018, and we are on track to complete the process by November 2018. First and foremost, this process reminded us migrating live data is hard, really hard! Lifting up your entire organization’s live data and moving it from one place to another exposes undocumented assumptions. It exposes patches you’ve made along the way no one remembers. It exposes all of your skeletons. I have done these kinds of migrations before, but yet I always forget. I always think this time it will be different.

Second, not all our systems were well documented or understood. For instance, when we set out to build the API for our store subsystem, the business logic that drives it was scattered across different tasks, jobs, APIs and ad-hoc scripts, an unfortunate byproduct of a fast growing company and dozens of iterations over products and technology. We thought we had successfully migrated this system far sooner than was actually the case.

Last, we should have deployed smaller parts and done them sooner. Deploying our migrated backend systems sooner would have exposed issues sooner. Additionally, we could have addressed them with a smaller, less complicated code base. We could have even refactored or changed our architecture based on the load.

As we are winding down the migration, there are a few things I have come away with, looking back on the overall process. In terms of Spanner itself, we paid an early adopter tax. The client library for Node.js was not in a great place when we started, and we did not realize that until we were down on the road on the project. That said, Google’s willingness to work with us to fix the issues we encountered and the amount of access to the internal teams responsible for each of their cloud products (including Spanner) has been fantastic, and this has validated our choice of using Google Cloud Platform as the cloud platform for Honey.

GraphQL was also a learning experience for us. In the early stages, it seemed straightforward, but once we got into the minutiae it was far more complicated. There are definitely ways we could be using the technology better. Even still, GraphQL has provided the exact function we hoped in enabling us to make sparse requests for data and bring together information from multiple subsystems in one place.

While the process has undoubtedly been bumpy, the end result is a clean, modern, scalable, virtually tech-debt-free, maintainable, and future-proof core system. As a systems architect, those are the adjectives I want attached to the core platform that powers most of Honey’s products. The whole endeavour was a success!

For the future, we will be perfecting our use of GraphQL. We will also be looking at Golang as an alternative implementation for our current APIs, as well as at more efficient alternatives to RESTful HTTP for micro-services. The journey to building a stable system providing 99.9% uptime is an ongoing, ever-improving learning process, and it will never be complete. But that’s the fun part!

--

--

Honey
The PayPal Technology Blog

The tech, products and people helping 10 million members save time and money while shopping online. joinhoney.com