Image for post
Image for post

Long, long ago, in an internet that I barely remember, I wrote about monitoring Orca. I haven’t managed to take the time to write another post about a specific service — it’s a lot of work! Instead of going deep this time around, I want to paint with broader strokes: What are the key metrics we can track that help quickly answer the question, “Is Spinnaker healthy?”

Spinnaker is comprised of about a dozen open source services that may vary widely based on configuration, and as such, there’s no singular metric to rule them all. This makes the question, “Is Spinnaker healthy?” a particularly bothersome question since not all services are equally important. If Igor — the service that is responsible for monitoring CI/SCM systems — is unable to communicate with Jenkins, Spinnaker will be in a degraded state, but its core behavior is still healthy. …

Airbnb recently came out with a cool article about their adoption story and how they’ve been extending Spinnaker in the process. If you haven’t already, I’d recommend checking it out before continuing!

If you lurk Spinnaker’s OSS development, you’d know there’s an active effort to introduce a true, complete plugin model for the project between Netflix and Armory. While this is still very early days, I thought it would be a fun exercise to enumerate many (yet, not all) of the ways that Netflix has extended open source Spinnaker.

What Extensions Has Netflix Built?

I’ve mentioned before we have ~30 engineers on the Netflix Delivery Engineering team (2/3 of that work on Spinnaker). That’s a big team. In addition to the OSS services, we have an additional 4 or 5 internal services we maintain, plus a ton of custom code layered on top of OSS Spinnaker. …

Back to Foundations

2019 is a year of both personal and professional rebuilding. As with any rebuilding, we need to revisit the foundations. Spinnaker is already a great product but it’s also 5 years old and some parts are starting to weather. Spinnaker can be even bester, but to get there we need to focus on the basics and rebuild the foundation where necessary to sail into the next 5 years. This isn’t an official roadmap, it’s my own — but you’re welcome to help.

A building can have all number of corners, but I like my buildings bland so my foundation has four…

Clouddriver is a crucially important service within Spinnaker: Its charter is to track and mutate state of the various cloud providers. If I had to rank services by importance, I’d say it is number two right behind Orca.

As far as scaling Spinnaker itself, however, Clouddriver is the most important to keep running smoothly, yet at Netflix’s scale, it is also currently the hardest to operate. In this post, I’m going to give a small crash course to Clouddriver’s architecture, its persistence, the various stages of deployment topologies we’ve gone through, and where we’re headed for the future. Strap in!

Clouddriver at 30,000 feet

In my mind, there are three parts to Clouddriver: Its cache, atomic operations, and the API. …

2018 has been a huge year of change for me, and for Spinnaker it’s no different! I’d like to recap a few of the highlights for me as it relates to Spinnaker both internally at Netflix as well as the community at large.


The most exciting change for me is OSS Governance. We’re not a part of a foundation but governance is an important first step to that possibility, establishing the framework for how Spinnaker will be managed in a more open manner and provides both a sense of security for corporations to adopt and invest, as well as a clear path towards greater roles within the project. …

The second inaugural Spinnaker Summit is just a little over a week away! I’m going, a whole lot of my co-workers are going, and I hope you are too. More than that, I wanted to plug a few things.

First, Rob Fletcher and I will be speaking on Tuesday at 10:45am on Declarative Spinnaker. There’s a panel talk by Google on Monday about Spinnaker and Borg: It’s quite related. I’d recommend you attend both if you’re into the declarative delivery scene.

Second, I’ll be at two Office Hours on Monday. …

In Q2 and Q3, the Netflix Spinnaker team worked on developing and releasing a new SQL storage backend for Orca, our orchestration service.

Strap in, this may be my longest post yet. My goal in this post is to outline end-to-end how the sausage is made on some larger efforts within Spinnaker, as well as sort of advertise how extensible it can be. In this post, I’ll be going over why we did this as well as how and why we did it the way we did, then put a bow on it with some retrospection.

First, a big thanks to Asher and Chris for their help! …

It’s been awhile since my last update. If I’m not telling people what I’m doing, am I really doing anything at all? Let’s talk Spinnaker.

Q2, Q3

I’ve been in NYC this week hanging out with the Google Spinnaker team and meeting users who are based out of NYC. There’s a lot of common strands in feedback that I get, mostly around performance and observability. That’s good, I think, because it validates what I believe is important for me to focus on above all else.

As I mentioned in my last post, performance and reliability weigh heavily on our priorities. We’ve done a lot of great work here and even this week had a nice win on the performance side of things just via configuration: We tightened our Redis replication lag in Clouddriver so performance is more predictable by reducing our Redis batch command size — it’s not necessarily faster yet predictability goes a real long way for reliability. …

Q2 has been all about performance and reliability improvements for the Netflix Spinnaker team. This may come as a surprise after my last post, where I said I’d be focusing on Declarative.

First, a note on 5.1. Last weekend I wrote an Ed 5, then deleted it shortly after publishing. The summary of it was essentially outlining how we’ve had to de-prioritize feature development in favor of strict focus on performance, reliability and overall quality improvement. I deleted the article because I felt the post needed to be reworked and re-presented, I also wasn’t enthralled with how little I actually said. …

In a recent post to my dev log, I mentioned I wanted to write about scaling strategies for Redis within Spinnaker, our primary storage engine. But before we can jump into that, a far more important topic is necessary: Monitoring and alerting. Without measuring your applications, how can you actually be sure it’s behaving correctly, let alone know what part of the system needs your attention to continue growing?

So, we’ll learn about monitoring Spinnaker first, service by service, while taking a look at graphs of (as far as I know) the largest Spinnaker installation: Netflix’s production deployment. Irrespective of your Spinnaker’s deployment footprint, the metrics I’ll detail in this series will be valuable to you. …


Rob Zienert

Sr Software Engineer @ Netflix

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store