Adding SQL to Spinnaker

Rob Zienert
13 min readAug 15, 2018

--

In Q2 and Q3, the Netflix Spinnaker team worked on developing and releasing a new SQL storage backend for Orca, our orchestration service.

Strap in, this may be my longest post yet. My goal in this post is to outline end-to-end how the sausage is made on some larger efforts within Spinnaker, as well as sort of advertise how extensible it can be. In this post, I’ll be going over why we did this as well as how and why we did it the way we did, then put a bow on it with some retrospection.

First, a big thanks to Asher and Chris for their help!

Background

To date, Spinnaker has used Redis for its primary datastore across all micro services. Admittedly, it’s an odd choice, so why did we do it that way? To keep complexity down: Spinnaker is a large enough beast as it is and requiring a handful of databases to get started would be detrimental to evaluation and onboarding. That said, I don’t think any core committer believes that Redis is the right tool for the job in a lot of cases. Redis’ strengths and shortcomings, specifically around resilience, are well understood.

Spinnaker stores all active and history execution (pipelines and tasks) data within Orca; losing Orca’s Redis causes running executions to become orphaned (thereby failing) and all history of what has happened in the past will be lost. It’s highly disruptive to customers and can potentially cause production incidents. Having Redis as a backing datastore for Orca is a ticking time bomb that lays in wait to ruin your day when you just want to have a productive day. We’ve gotten quite good at operationalizing our own Redis but our implementation isn’t very transferrable to the rest of the open source community, nor does Redis lend itself well to our regionally active-active efforts. On top of all this, even though we are good at Redis operations at Netflix, it doesn’t escape the fact that Redis simply isn’t the right tool for the job in a production environment.

Goals

Execution state of Orca needs to be consistent and persistent but we can deal with some availability loss: We’re okay with a little bit of degraded service, but not okay with data loss.

The database we chose also needed be managed. We run our own Redis instances because we found operating our own Redis instances far more reliable than Elasticache. Unfortunately, this higher reliability comes at the cost of more operational burden. Our new solution must be a managed service, which narrowed things down to basically RDS, DynamoDB and Cassandra (Netflix’s Cloud Database team manages Cassandra as a service).

Performance must be as good as or better than Redis. For me, personally, I elected a personal goal of 10x read latency improvement. For larger customers, who create tens of thousands of executions a day, getting execution history via Redis was painful.

Lastly, whatever we do should be beneficial to the open source community. We open sourced Spinnaker to benefit from innovations from the community, but we must also continue to give back!

Why SQL?

When I first used Spinnaker when it was open sourced (I’m a community hire), I had a lot of questions, many of them shared by the newcomers today, such as, “Why Redis?” I’ll always answer, more or less, like I did above. But why SQL? A common suggestion by people in Slack is, “What if we used Redis for running executions, and then shoveled executions to S3 for cold storage?” A well-intentioned idea, but S3 is prohibitively slow.

From the perspective of being beneficial to the OSS community at large, it immediately ruled out something like DynamoDB. However, something like Cassandra could still very well meet our goals, but is non-trivial to manage (coming from someone who made their paycheck managing Cassandra for a couple years).

SQL knowledge is fairly ubiquitous and checks our core requirements and also has the added benefit of query flexibility, which Cassandra doesn’t really offer. Query flexibility isn’t a hard requirement, but does offer us and the community some movement to develop internal-only schema extensions fairly easily. On top of this, offerings similar to RDS or Aurora check the box on reliability for us.

All this said, I personally would not reject a well-considered Cassandra. I think Cassandra could definitely have a place in Spinnaker’s ecosystem long-term, but it can be a burdensome leap.

Design & Implementation

Taking lessons learned from my work on making clouddriver Dynomite-capable, I elected to do all of the SQL development behind closed doors. Doing so would allow me to tighten the development loop and keep broken code away from the community while we iterate and allow us to deliver a hardened implementation to GitHub. This also had the added benefit of forcing me into putting stronger consideration into making the persistence backend truly pluggable: With this work, you can now provide your own storage backend, with TCKs to validate your logic against if Redis or SQL don’t match your needs¹.

The first few days of development was removing Orca’s core dependency on Redis. I also took this as an opportunity to change the ID generator from UUIDv4 to ULID. This was done so we could have the optionality of sorting executions by their ID rather than a separate column. With this work out of the way, we were in a position to start developing and deploying SQL without cluttering OSS releases.

We elected to use MySQL, specifically Aurora MySQL, as our initial target. Why MySQL and not PostgreSQL, which is “clearly” better? Well, for Aurora, Postgres really isn’t better: The MariaDB JDBC driver is the only Aurora “smart client” which takes care of the legwork around clustering and master failovers, meaning we didn’t need to write and maintain that logic. That alone wins it for me. Aurora is not MRMM (Multi-region multi-master) like Spanner is, but we’ll be able to failover masters in the same region and perform our own regional failovers if need be, which is a good step in the right direction².

The SQL schema will raise SQL purist eyebrows so I’ll get it out of the way now: We serialize objects as JSON³ and store them as blobs! Executions are stored as a pair of tables: one for the root execution object and another for that execution’s stages. For both tables, the only additional columns we have are for indexing and making performant lookups.

This design was chosen to keep most of the schema in the application layer, where we already have well proven patterns for performing live data migrations and support for proprietary data extensions. Furthermore, the data stored within an execution and its stages varies wildly from one to another (and based on cloud provider) and we didn’t want to go down the rabbithole of a bunch of data modeling and, as a result, more complex queries.

The costs of serializing objects and having denormalized data are worthwhile to us. Due to Orca’s data access patterns, we also don’t need to worry about write conflicts or data corruption by two processes writing to the same rows concurrently. Orca is not currently written to support partial object loading, so we’ll be hydrating an entire execution just like we are doing today in Redis-land. We’ll probably need to add support for this in the future, however so we can make UI loading processes more efficient, but we don’t see the current schema strategy blocking this effort.

From a programming perspective, I have a very strong distaste for Hibernate and anything that tries to build queries for the developer. I chose jOOQ, as it provides a typesafe query building API. I steered away from using the code generation stuff jOOQ optionally provides, which I feel wouldn’t provide much benefit considering our odd schema. JDBIv3 was also briefly considered, but not having an abstraction over the actual SQL query generation would’ve made adding other SQL vendors besides MySQL more difficult. Database migrations are handled via Liquibase. The default connection pool is HikariCP, but is optional and we don’t use it at Netflix since it doesn’t play well with MariaDB’s Aurora functionality and instead use MariaDB’s built-in connection pool. We won’t be shipping MariaDB with Orca when we open source the SQL module due to conflicting licenses, but I’ll write some documentation on how we wire up MariaDB or just make a separate GitHub repo showing our exact code.

SQL and the Work Queue

The work queue (Keiko) will still be backed by Redis, and I have no desire to back the work queue with SQL. This means by using SQL, you’ll still have a Redis instance. From an operations perspective, the work queue is transient data so we don’t need to lose sleep over losing the Redis, but I’d still like to get rid of operational burden wherever possible, so I’m working on an SQS backend for Keiko.

This topology does raise a couple new failure scenarios, however:

1. Connectivity to SQL fails, but not Redis.
2. Connectivity to Redis fails, but not SQL.

In the event that SQL connectivity is interrupted, Orca will automatically stop processing work off the queue on a per-process basis. Orca will start processing again automatically once SQL connectivity is re-established. This is done via a write healthcheck (so it’ll also detect master failovers) similar to how a load balancer works. Of course, there will be some messages that will fail in the timeframe that Orca is discovering its failure condition, but the queue already handles message failure & redelivery well.

When Redis is lost, Orca will stop processing work because the queue is gone. New work will continue to be accepted, but will remain in a NOT_STARTED state. When connectivity is re-established, the queue will be empty and any NOT_STARTED executions will begin immediately. For this scenario, I wrote a hydrate queue admin API, which will allow you to re-populate the queue based on the existing state in SQL.

Migration

For Netflix, bringing Orca down (and thus, Spinnaker) to migrate from Redis to SQL is a non-starter, so we needed to build a way to perform a zero-downtime migration. We already had agents to active-active migrate from one Redis master to a new one and the RedisExecutionRepository knew how to route read & write traffic accordingly while in a dual write mode.

When running a migration, currently active executions continue going to the old Redis, with new executions going to the new Redis. Executions are only shoveled over to the new Redis instance once the execution has a completed state (be it successful, terminal…). I like this strategy a lot, it’s easy to reason about and works in both a roll-forward and rollback scenario.

I refactored the migration agents to be ExecutionRepository implementation agnostic, so now they support migrating from anything to anything following the same pattern. I also created a backend agnostic DualExecutionRepository which has the same routing characteristics I outlined above. You can migrate from Redis to SQL, SQL to Redis or SomethingElse to SomethingElse2.

Originally we had written a migrator that did live migrations and used some fancy write conflict resolution but when push came to shove, I rewrote the migrator last week for simplicity. Migration went from a fairly complex orchestration to, “Run this other Orca instance in migration mode until it’s not doing work anymore, then shut it down.”

Performance

As I mentioned earlier, we wanted performance to compete with Redis and I had a personal goal to 10x the read performance — I like the 10x principle a lot when it comes to refactors. Orca’s write performance isn’t really a huge deal: At-peak per-second, we only do hundreds of writes but thousands of reads. To compare apples-to-apples, I wrote an instrumented execution repository so we could compare the two strategies end-to-end, rather than only at driver call-sites.

Is this the SQL impl or Redis? Well, both. It’s the DualExecutionRepository.

Redis is fast: Not many would argue this, but it isn’t fast for how Orca’s schema is designed. To retrieve a single Execution by ID, it takes between 2 and 5 Redis commands: That’s round-trip, non-pipelined commands. If you’re trying to get the most recent execution for an application, it’s another handful of round-trip commands. This roundtrip charade really starts to add up when you need to load a bunch of executions (even with the pipelining we do for list operations).

For SQL read operations, it’s one indexed query which blows Redis’ implementation out of the water especially when your requests ask for more executions. We’ve seen improvements up to 20x in various 99.9% operations. Pages like the Tasks view, which for our larger customers may have 30,000 executions, dropped from loading in 40+ seconds to ~4 seconds.

It’s not all rosy, though. Read performance still needs love. Object serialization costs a lot. In a separate, future effort I’ll be looking to switch from JSON to a binary format that is faster than Jackson and will come with the added benefit of less network transfer time.

When I deployed SQL, I took a moment to overlay RedisExecutionRepository invocation timings against the SqlExecutionRepository. This graph is rendered logarithmically. This is comparing Redis (red) vs SQL (blue), both of which are handling the same volume of production traffic.

red: Redis, blue: SQL

Testing & Deployment

I’ve really enjoyed testing this work. Since this was a project borne out of reliability, I got to break things a lot and who doesn’t like taking a hammer to things you’ve worked on?

Aside from the TCK work, my first objective was getting SQL deployed into our test environment. We essentially have 3 primary environments: Test, Staging and Main. No one really cares about Test, you could think of it as our long-lived experimental environment. I deployed here first once we had all of the TCK methods passing and things were generally working on my laptop. The migration code was just being started at this point so we just deleted all of our old execution data. This environment caught a whole host of bugs and also let me kick the tires on reliability tests while we iterated on migration strategies.

Using the spinnaker-performance repository and a bunch of 1-minute cron pipelines, I was able to sustain ~1200 active executions of varying complexity (from simple wait stage pipelines to full server group deployments). During this time I would pull the plug on random things and watch server logs, Atlas and poke around in the UI & API. My main questions were:

  1. What happens when I failover the MySQL master?
  2. What happens when I blackhole network traffic to MySQL or Redis?
  3. What happens when flush MySQL, Redis?

Eventually, all of the bugs that were coming up got squashed and I was able to focus more on making things faster, specifically around failover behavior and minimizing impact. Once we had the migration code written, I was able to go through this whole process again at varying stages of the migration.

We deployed to the staging environment with the migration code. Staging is on the critical path for prod deployments and is where all of our automated testing is continually run. Throwing the performance repo at this environment plus the automated and manual testing that’s always happening here was able to sniff out some more bugs. Once the environment was stable, sometime in late May, I called for running in staging for all of June: If there were no major bugs, we’d go to production in July. It took longer than that. My co-worker Asher was a huge help while we were in staging helping tune things.

While we were doing this work, a new customer was asking us for SLA numbers on new execution throughput. I had some fun thrashing our staging environment (much smaller than production):

“…sending 100 pipelines per second over 120 seconds … Of the 11,992 pipelines submitted over 120 seconds, 100% of them were successfully received and executed. During this test, we started with 11 executions running and peaked at 5,630 active executions...”

I later turned it up to 11 for my own late-night giggles until the system finally caved at 45,000 active executions (almost an order of magnitude more than our highest production peak).

Remaining Work

The SQL work isn’t open source yet. What gives? Well, there’s some more work yet that I want to do before delivering it.

I want to let SQL bake in our production env longer. Asher and I are currently iterating on some better monitoring and tackling some rough edges which I want buttoned up first.

There’s an ungodly number of undocumented features that we’ve delivered, but this is one that I want to deliver with a bow on it. This means docs for setup, our Aurora/MySQL configuration, MariaDB driver setup, operations, and monitoring. This is a lot of work but fortunately is really making public-friendly some stuff we already have. I’ve also never done any work with Halyard, so I need to learn how to do that and get configurations exposed in there for you Halyard users.

Retrospective

If I were doing this all over again, I’d definitely do it all behind closed doors again: We were able to make aggressive changes to implementation without worrying about early adopters.

The schema we chose was one of familiarity and speed of development, however if I were to do it again, I’d likely switch to an insert-only, event sourcing model, snapshotting aggregates on execution completion. I wouldn’t rule out this design being a thing in the future.

Wrap Up

Hopefully this was an interesting read for you. When can you expect it for use? I’m planning to open the initial set of PRs into Orca in late August / early September.

So, the rest of Spinnaker is still running on Redis. Are we looking at using SQL anywhere else? Maybe, we’ve kicked around the idea for Front50 but haven’t commit to anything. Make no mistake, I still love Redis and think it’s the right fit in quite a few places within Spinnaker even while we’ll continue to make some more architectural adjustments in the future!

Footnotes

  1. You will see this level of support in future platform changes that we’re making across Spinnaker. I suspect its an inevitability that Spanner support will come in the future as well.
  2. For the multi-region story, we’ve written the schema to support regional partitioning. We haven’t implemented this bit yet.
  3. JSON serialization is slow. I plan to switch this to a form of binary serialization in the not-too-distant future.

--

--