Spinnaker Orchestration

Author: Rob Fletcher

When the Spinnaker project first started more than two years ago we implemented Orca — Spinnaker’s orchestration engine µservice — using Spring Batch. It wasn’t an entirely unreasonable fit at the time. It gave us atomic, compartmentalized units of work (tasks in a Spinnaker pipeline), retry semantics, the ability to define branching and joining workflows, listeners that could get notified of progress and many other things we needed. However, in the long run, that choice—and our implementation on top of it—imposed a number of constraints.

For a long time some of the operational constraints of Orca have been a source of frustration and not something we were proud of or keen to advertise.

Yesterdayland

The most obvious constraint was that Orca was a stateful service—it pinned running pipelines to a single instance. Halting that instance, whether due to a failure or a normal red-black deploy, would unceremoniously stop the pipeline in its tracks with no easy way to continue it.

In addition, Orca locked a thread for the entire duration of the pipeline, which although typically minutes long, are not infrequently hours or days. It did this even when the pipeline was doing nothing more than polling every so often for a change, waiting for a predefined duration or even awaiting manual judgment before continuing.

When deploying a new version of Orca we’d have to allow work to drain from the old server groups. Although we automated this process (monitoring instances until they were idle before shutting them down) it wasn’t uncommon for a handful of instances to be hanging around for days, each one draining one or two long running canary pipelines.

Because of the way we mapped pipelines to Spring Batch jobs we had to plan the entire execution in advance, which is very limiting. We were forced to jump through all kinds of hoops to build functionality like rolling push deployments on top of such a static workflow model. It was also very hard to later implement the ability for users to restart pipelines after a stage failed or to automatically restart pipelines dropped in the case of instance failure as the mapping of pipeline to Spring Batch job initially wasn’t idempotent.

As an aside, I should point out that most of the constraints we struggled with are not inherent limitations of Spring Batch. It’s very good at what it does. But it’s really not intended as a general-purpose workflow engine and certainly not designed with distributed processing in mind.

Sabrina’s Christmas Wish

Despite these issues, things hung together well enough. We were aware of the limitations and chafed against them, but they never bit us so badly that we prioritized tackling them over some of the new features we were working on. Although, as the engineer who implemented most of Orca in the first place, I was desperate to fix what I saw as being my own mess.

I finally got that chance when the volume of internal use at Netflix hit a point that we decided it was time to tackle our resiliency and reliability concerns. The fact that Orca is ticking over when running between 2000 and 5000 pipelines per day (peaking at over 400,000 individual task executions some days) is not too shabby. However, with Spinnaker existing as the control plane for almost the entire Netflix cloud and the project growing in use in the wider community we felt it was time to harden it and make sure resiliency was something we had real confidence in.

To that end, we recently rolled out a significant change we dubbed “Nü Orca”. I’d like to take some time to introduce the changes and what makes them such an improvement.

Brand New Couch

Instead of using Spring Batch to run pipelines, we decided to implement our own solution using a simple command queue. The queue is shared across all the instances in an Orca cluster. The queue API has push operations for immediate and delayed delivery and a pop operation with acknowledgment semantics. Any message that goes unacknowledged for more than a minute gets re-delivered. That way, if we lose an Orca instance that’s in the process of handling a message, that message is simply re-delivered and will get picked up by another instance.

Messages on the queue are simple commands such as “start execution”, “start stage”, “run task”, “complete stage”, “pause execution”, etc. Most represent desired state changes and decision points in the execution while “run task” represents the atomic units of work that pipelines break down into. The intention is that messages should be processed quickly — in the order of seconds at most when running tasks that talk to other services such as CloudDriver.

We use a worker class — QueueProcessor — to pop messages from the queue. It is invoked by Spring’s scheduler with a 10ms delay between polls. The worker’s only job is to hand each message off to the appropriate MessageHandler. Once a handler has processed a message without any uncaught exceptions, the worker acknowledges the message. The call to the handler and the acknowledgment of the message happen asynchronously using a dedicated thread pool so they do not delay the queue polling cycle.

Message handlers can add further commands to the queue. For example:

  • StartStageHandler identifies the sequence of sub-stages and tasks and then sends StartStage or StartTask commands to set them running.
  • StartTaskHandler records that a task is running then queues a RunTask command.
  • RunTaskHandler executes a task once and then either queues the same RunTask command with a delay if the task is not complete (for example, a task polling until a server group reaches a desired number of healthy instances) or a CompleteTask if the execution can move on.
  • CompleteStageHandler figures out what downstream stages are able to run next and queues a StartStage message for each or a CompleteExecution message if everything is finished.

…and so on.

This design allows work to naturally spread across the Orca cluster. There’s no requirement for a queued message to be processed by any particular Orca instance. Because un-acknowledged messages are re-queued, we can tolerate instance failure and aggressively deploy new versions of the service without having to drain work from older server groups. We can even turn Chaos Monkey loose on Orca!

Message handlers can also emit events using Spring’s application event bus that keep other listeners informed of the progress of a pipeline. We use this for sending email / Slack notifications and triggering downstream pipelines — things Orca was doing already. We’ve also added a log of the activity on a particular pipeline since trying to track distributed work using server logs will be next to impossible. Processing of these pub/sub events does currently happen in-process (although on a different thread).

We have in-memory, Redis and SQS queue implementations working. Internally at Netflix we are using the Redis implementation but having SQS is a useful proof-of-concept and helps us ensure the queue API is not tied to any particular underlying implementation. We’re likely to look at using dyno-queues in the long term.

Why Redis? Partially because we’re using it already and don’t want to burden Spinnaker adopters with further infrastructure requirements. Mainly because Redis’ speed, simple transactions and flexible data structures give us the foundation to build a straightforward queue implementation. Queue data is ephemeral — there should be nothing left once a pipeline completes — so we don’t have concerns about long-term storage.

Fundamentally the queue and handler model is extremely simple. It maps better to how we want pipelines to run and it gives us flexibility rather than forcing us to implement cumbersome workarounds.

Yes And

Ad-hoc restarts are already proving significantly easier and more reliable. Pipelines can be restarted from any stage, successful or not, and simultaneous restarts of multiple branches are no problem.

We have also started to implement some operational capabilities such as rate limiting and traffic shaping. By simply proxying the queue implementation, we can implement mechanisms to back off excessive traffic from individual applications, prioritize in-flight work or urgent actions like rollbacks of edge services, or pre-emptively auto-scale the service to guarantee capacity for upcoming work.

Let’s Find Out

If you want to try out the queue based execution engine in your own pipelines, you can do so by setting queue.redis.enabled = true in orca-local.yml.

Then, to run a particular pipeline with the queue, set “executionEngine”: “v3” in your pipeline config JSON.

To run ad-hoc tasks (e.g. the actions available under the “server group actions” menu), you can set the configuration flag orchestration.executionEngine to v3 in orca-local.yml.

Within Netflix, we are migrating pipelines across to the new workflow engine gradually. Right now, Nü Orca exists alongside the old Spring Batch implementation. Backward compatibility was a priority as we knew we didn’t want a “big bang” cut-over. None of the existing Stage or Task implementations have changed at all. We can configure pipelines individually to run on either the old or new “execution engine”.

As we gain confidence in the new engine and iron out the inevitable bugs, we’ll get more and more high-risk pipelines migrated. At some point soon I’ll be able to embark on the really fun part — deleting a ton of old code, kludgy workarounds and cruft.

After the Party

Once we’re no longer running anything on the old execution engine, there are a number of avenues that open up.

  • More efficient handling of long-running tasks.
  • Instead of looping, the rolling push deployment strategy can “lay track in front of itself” by adding tasks to the running pipeline.
  • Cancellation routines that run to back out changes made by certain stages if a pipeline is canceled or fails can be integrated into the execution model and surfaced in the Spinnaker UI.
  • The implementation of execution windows (where stages are prevented from running outside of certain times) can be simplified and made more flexible.
  • Determination of targets for stages that affect server groups can be done ad-hoc, giving us greater flexibility and reducing the risk of concurrent mutations to the cloud causing problems.
  • Using a state convergence model to keep an application’s cloud footprint in line with a desired state rather than simply running finite duration pipelines.

I’m sure there will be more possibilities.

I’m relieved to have had the opportunity to make Orca a better fit for Netflix’s distributed and fault-tolerant style of application. I’m excited about where we can go from here.

Show your support

Clapping shows how much you appreciated Netflix Technology Blog’s story.