Dev Journal, Ed. 3

4 min readFeb 23, 2018

We’re over half of the way through 2018Q1. Yikes! As it pertains to my quarterly objectives, I’m currently at the bargaining stage of grief.

Redis & Dynomite Refactor

A lot of progress here over the few weeks. I’ve standardized Redis connection configuration into kork and started upgrading each of the services to this new library. Services will no longer be forced to use a single connection pool, but will have named clients that can be connected to different Redis clusters as necessary, and drivers (Redis, Redis Sentinel, Cluster or Dynomite) can be switched without downtime. This flexibility will be really helpful as we continue down the path of improving our scalability. Neat!

Seen some resilience issues on our services while moving into Dynomite, so I’m working on improving client disconnects, etc. Will be putting best effort to including these improvements at a high enough abstraction that it benefits all drivers where reasonable.

While I want to perform some testing around failure conditions and dynomite cluster scale-ups, my semi-realistic goal is for clouddriver to go to production with Dynomite late next week. We have ~7 Redis read replicas, each of which get dedicated partitioned traffic. My plan is to run active-active caching clusters (one to Redis, one to Dynomite) and then migrate one read replica at a time. I’m not sure yet when I’ll migrate the task repository.

Scaling Spinnaker Guide

Scaling Spinnaker isn’t an easy task, especially if you’re not an engineer developing and operating it day-in and day-out.

With that in mind, I’m going to start writing a bit more on the operations side of things, since I often gravitate to that area. It’s very clear that the greater community doesn’t really know how to scale Spinnaker beyond smaller deployments and there’s a lot I can write about to help people not learn the hard way.

My first topic I’ve already started working on is scaling Redis. I plan to write this as a guide and contribute directly to spinnaker.io.

There’s a lot of “hidden” features we’ve built into Spinnaker for operating it, and more yet I’d love to migrate from Netflix-only python/bash scripts into admin APIs. These things need documentation so people know they exist as well as how and when to use them.

I’d also like to talk a bit on our inevitable incidents while operating Spinnaker. Speaking of which…

Igor & Docker Incident Review

We had an incident two days ago. We mistakenly re-indexed all of our Docker repositories, which re-triggered Docker-triggered pipelines. Not a good thing, especially since teams have pipelines that deploy based off Docker triggers.

So, what happened? The short of it is that we changed some configuration on our Docker registries in Clouddriver, which is responsible for caching Docker repositories. Igor is responsible for reading this data out of Clouddriver then calculating what tags are new. When a new tag is found, it notifies Echo to trigger any pipelines that are interested.

Unfortunately and unbeknownst to us at the time, Igor uses some non-identifiable information as part of its index keys: Information that we changed during the Clouddriver config change. Once we were trying to read index keys that didn’t exist, igor assumed all the tags are new!

We found all of the applications (from front50) that have pipelines configured with Docker triggers and looked up the owner field from those applications and we had our list of people to notify about the incident, which we did.

What’re we doing now? Well, a few things:

Circuit breakers are being added into Igor. If a cache cycle finds more items in a delta than a normative upper bound (statically defined in the Igor configuration), the breaker opens and will stop indexing. This is attached to a metric which we’ll alert on. When the alert triggers, we’ll be able to investigate and resolve whatever the problem is. Heavy handed, but having simple protections in place fast is better than something more elegant that’ll take weeks. TravisCI, Jenkins, Docker and GitlabCI all get the same protection capabilities.
Changing the key pattern for igor’s indexes to not include non-identifiable information. The key, while reviewing code, doesn’t look like it’d have a problem except that “repository” is a URI. When I change the key scheme, I’ll write it to first read for the new scheme, then read for the old scheme if it new format doesn’t exist. Included in the migration, I’ll write an on-start hook that will scan & TTL all old keys for like 30 days (rollback protection).
An operational API to close the circuit breaker from #1, as well as another API to re-index without sending notifications. Re-indexing isn’t really a problem, but sending notifications for things that were already notified certainly is.

We’re also kicking around the idea of forcing users to treat Docker tags as immutable, even though they aren’t. Then, if we could find a cost efficient way of caching Docker manifests, we could further protect ourselves by auto-ignoring old tags.

As a funny aside, a co-worker did swing by and mentioned, “Well, we know your notification service scales well, because I got like 600 emails instantly.” That’ll be another thing to fix: It’s definitely smelling like a problem if we need to send the same person a ton of emails in a short time.

Last other lesson I’ll talk about on this blog. We notified everyone by email, but it would’ve been a prudent time to page the on-call for each affected application. Netflix Spinnaker has tight integration with PagerDuty, where each app must have a PagerDuty key assigned, which makes getting a hold of the right people when things go sideways a breeze.

Dev Journal, Ed. 3

Redis & Dynomite Refactor

Scaling Spinnaker Guide

Igor & Docker Incident Review

Written by Rob Zienert