Stories by Rob Zienert on Medium

Monitoring Spinnaker: SLA Metrics

Rob Zienert — Mon, 03 Feb 2020 20:57:30 GMT

Long, long ago, in an internet that I barely remember, I wrote about monitoring Orca. I haven’t managed to take the time to write another post about a specific service — it’s a lot of work! Instead of going deep this time around, I want to paint with broader strokes: What are the key metrics we can track that help quickly answer the question, “Is Spinnaker healthy?”

Spinnaker is comprised of about a dozen open source services that may vary widely based on configuration, and as such, there’s no singular metric to rule them all. This makes the question, “Is Spinnaker healthy?” a particularly bothersome question since not all services are equally important. If Igor — the service that is responsible for monitoring CI/SCM systems — is unable to communicate with Jenkins, Spinnaker will be in a degraded state, but its core behavior is still healthy. Should Orca’s queue processing drop to zero, however, it’s time to have an elevated heart rate and quick remedy.

Service Metrics

The Service Level Indicators for our individual services can vary depending on configuration. For example, Clouddriver has cloud provider-specific metrics that should be tracked in addition to its core metrics. For the sake of this post’s length, I won’t be going into any cloud-specific metrics.

Universal Metrics

All Spinnaker services are RPC-based, and as such, the reliability of requests inbound and outbound are supremely important: If the services can’t talk to each other reliably, someone will be having a poor experience.

For each service, a controller.invocations metric is emitted, which is a PercentileTimer including the following tags:

status: The HTTP status code family, 2xx, 3xx, 4xx...
statusCode: The actual HTTP status code value, 204, 302, 429...
success: If the request is considered successful. There’s nuance here in the 4xx range, but 2xx and3xx are definitely all successful, whereas 5xx definitely are not
controller: The Spring Controller class that served this request
method: The Spring Controller method name, NOT the HTTP method

Similarly, each service also emits metrics for each RPC client that is configured via okhttp.requests. That is, Orca will have a variety of metrics for its Echo client, as well as its Clouddriver client. This metric has the following tags:

status: The HTTP status code family, 2xx, 3xx, 4xx...
statusCode: The actual HTTP status code value, 204, 302, 429...
success: If the request is considered successful
authenticated: Whether or not the request was authenticated or anonymous (if Fiat is disabled, this is always false)
requestHost: The DNS name of the client. Depending on your topology, some services may have more than one client to a particular service (like Igor to Jenkins, or Orca to Clouddriver shards).

Example of our 24/7 request fanout from Gate. One interesting tidbit: The sudden increase in traffic at 9am is the increased traffic to Clouddriver (bottom) from Chaos Monkey starting its daily light mayhem!

Having SLOs — and consequentially, alerts — around failure rate (determined via the succcess tag) and latency for both inbound and outbound RPC requests is, in my mind, mandatory across all Spinnaker services.

As a real world example, the alert Netflix uses for Orca to all of its client services is:

nf.cluster,orca-main.*,:re,
name,okhttp.requests,:eq,:and,
status,(,Unknown,5xx,),:in,:and,
statistic,count,:eq,:and,
:sum,
(,nf.cluster,),:by,
0.2,:gt,3,
:rolling-count,3,:ge

So, for people who can’t read Atlas expressions, if we have more than 0.2 failing/unknown RPS to a specific service over 3 minutes, we’ll get an alert.

Service-specific Metrics

Most of our services have an additional metric to judge operational health, but in/out RPC monitoring will go far if you’re just starting out.

Echo
echo.triggers.count tracks the number of CRON-triggered pipeline executions fired. This value should be pretty steady, so any significant deviation is an indicator of something going awry (or the addition/retirement of a customer integration).
echo.pubsub.messagesProcessed is important if you have any PubSub triggers. Your mileage may vary, but Netflix can alert if any subscriptions drop to zero for more than a few minutes.

Orca
task.invocations.duration tracks how long individual queue tasks take to execute. While it is a Timer, for an SLA Metric, its count is what’s important. This metric’s value can vary widely, but if it drops to zero, it means Orca isn’t processing any new work, so Spinnaker is dead in the water from a core behavior perspective.

Clouddriver: Each cloud provider is going to emit its own metrics that can help determine health, but two universal ones I recommend tracking are related to its cache.
cache.drift tracks cache freshness. You should group this by agent and region to be granular on exactly what cache collection is falling behind. How much lag is acceptable for your org is up to you, but don’t make it zero.
executionCount tracks the number of caching agent executions and combined with status , we can track how many specific caching agents are failing at any given time.

Here, one collection for a specific AWS service in our largest region was getting stale. In this case, while AWS availability was fine for Clouddriver, Edda was having trouble refreshing.

It’s OK that there are failures in agents: As stable as we like to think our cloud providers are, it’s still another software system and software will fail. Unless you see sustained failure, there’s not much to worry about here. This is often an indicator of a downstream cloud provider issue.

Igor
pollingMonitor.failed tracks the failure rate of CI/SCM monitor poll cycles. Any value above 0 is a bad place to be, but is often a result of downstream service availability issues such as Jenkins going offline for maintenance.
pollingMonitor.itemsOverThreshold tracks a polling monitor circuit breaker. Any value over 0 is a bad time, because it means the breaker is open for a particular monitor and it requires manual intervention.

Product SLAs at Netflix

We also track specific metrics as they pertain to some of our close internal customers. Some customers care most about latency reading our cloud cache, others have strict requirements in latency and reliability of ad-hoc pipeline executions.

In addition to tracking our own internal metrics for each customer, we also subscribe to our customers’ alerts against Spinnaker. If internal metrics don’t alert us of a problem before our customers are aware something is wrong, we at least don’t want to wait for our customers to tell us.

Continued Observability Improvements

Since Spinnaker is such a large, varied system, blog posts such as these are fine, but really are meant to get the wheels turning on what could be possible. It also highlights a problem with Spinnaker today: A lack of easily discoverable operational insights and knobs. No one should have to rely on a core contributor to distill information like this into a blog post!

There’s already been a start to improving automated service configuration property documentation, but something similar needs to be started for metrics and matching admin APIs as well. A contribution that documents metrics, their tags, purpose and related alerts would be of huge impact to the project and something I’d be happy to mentor on and/or jumpstart.

Of course, if you want to get involved in improving Spinnaker’s operational characteristics, there’s a Special Interest Group for that. We’d love to see you there!

Monitoring Spinnaker: SLA Metrics was originally published in The Spinnaker Community Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Netflix Has Extended Spinnaker

Rob Zienert — Wed, 30 Oct 2019 15:32:02 GMT

Airbnb recently came out with a cool article about their adoption story and how they’ve been extending Spinnaker in the process. If you haven’t already, I’d recommend checking it out before continuing!

If you lurk Spinnaker’s OSS development, you’d know there’s an active effort to introduce a true, complete plugin model for the project between Netflix and Armory. While this is still very early days, I thought it would be a fun exercise to enumerate many (yet, not all) of the ways that Netflix has extended open source Spinnaker.

What Extensions Has Netflix Built?

I’ve mentioned before we have ~30 engineers on the Netflix Delivery Engineering team (2/3 of that work on Spinnaker). That’s a big team. In addition to the OSS services, we have an additional 4 or 5 internal services we maintain, plus a ton of custom code layered on top of OSS Spinnaker.

As my team has mentioned before, we consume OSS Spinnaker JARs as a library and layer custom code atop. This allow us to run the same code that everyone else does while giving us full ability to add and modify functionality where we need.

First, a crash course of Spinnaker services: They’re all written on the JVM using Spring Boot. There’s a lot of power in both of those tools: If you want to do something, you can probably do it and do it relatively easily once you’re set up. Spinnaker doesn’t use a ton of the sugary add-ons of Spring Boot, but it heavily utilizes its dependency injection, which affords developers great latitude in customizing or replacing standard functionality. For example…

netflixplatform

We have a shared library, similar to kork, for integration of Spinnaker into the Netflix runtime “paved road”. Things like auto-wiring our services to send metrics to Atlas, register with Eureka, read dynamic configuration from Fast Properties, and perform secrets decryption and RPC auth with Metatron.

Most of this is accomplished through simply adding more code, then adding @Configuration classes that wire things into Spring’s Environment.

package com.netflix.spinnaker.platform.atlas;

@Configuration
public class AtlasConfiguration {
  @Bean
  public Registry registry() {
    AtlasRegistry reg = new AtlasRegistry();
    SpectatorContext.setRegistry(reg);
    return reg;
  }

  @Bean
  public AtlasPluginManager atlasPluginManager(Registry registry) {
    return new AtlasPluginManager(registry);
  }
}

Nothing exciting to see here: we’re just wiring up some internal libraries, but now any metrics produced by Spinnaker services will be correctly collected into our internal metric store, Atlas.

What’s especially interesting is that all of the customizations and extensions I’m about to outline are all enabled and wired up in similar ways: Just implementing interfaces and creating factory configuration classes and dropping the jar into the classpath. For Java developers who have used Spring, this process should be boring levels of accessible.

Adam Jordens wrote, long ago, how we extend Spinnaker. It’s the same patterns today, even if the example repos haven’t been updated.

clouddriver-nflx

For awhile, the Titus integration was internal-only. We had an entire cloud provider that was an extension. It’s now open source, but migrating it into OSS was just a lift-and-shift task.

We did all of the Clouddriver SQL backend development as an extension as well. Similar to Titus, open sourcing this work was also a lift-and-shift operation once we felt it was stable enough after running in production for a few weeks.

We also have an ElasticSearch integration for Docker. This is fairly specific to Netflix’s use cases, but it let’s us more efficiently index and search for Docker tags within our registries via ElasticSearch. This is just implemented as a new CacheProvider, similar in implementation to ProjectClustersCachingAgent.

Clouddriver has the concept of a preprocessors for mutating operations that it’s supposed to perform before execution. We’ve implemented custom checks that enforce AWS Security Groups rules, specifically around some of our own internal team security requirements. It also supports validators, which we’ve used to restrict security groups from allowing 0.0.0.0/0 ingress (requiring adding these ingress rules to go through our Cloud Network team’s tooling) and a validator to simply enforce server group name lengths to our preferences.

One cool customization we’ve created is a Lambda integration with our Security team which enforces each application gets its own AWS Instance Profile. If the Instance Profile doesn’t exist, the Lambda will create it from a blessed company default. Applications cannot use an instance profile created for another application.

Finally, we have an extension for ALBs & NLBs to auto-attach some custom security rules.

deck-nflx

Admittedly, I don’t have a lot of insight into the UI. We’ve built a HUGE amount of custom views within Deck. You’ll have to ask some frontend folks about this. 😬Sorry!

echo-nflx

Echo handles all events within Spinnaker and is the source for all triggering of executions. Auditing is very important for Spinnaker, so we have an integration point that sprays all events to Chronos, our central SRE auditing system as well as big data portals.

The biggest integration we’ve added is a new trigger type, which integrates into our Rocket build system, an internal CI system.

fiat-nflx

We source roles for authorization from an internal source of truth service, this is a pretty simple implementation by providing our own UserRolesProvider.

front50-nflx

For Netflix, we require some additional validation for applications, so we’ve added additional validators that are run when someone saves an application.

We’ve also had to perform migrations on applications, pipelines and so-on. Rather than cat wrangle all of the teams, we’ve written custom Migrations that are scheduled to incrementally rollout adoption of new features or configurations without our users having to do anything.

gate-nflx

Gate has seen a lot of integrations. For a lot of the custom work done in Deck, we also have a lot of internal-only web controllers, associated services and configuration. We also have X509 auth extensions to extract additional user data from our internal certificate manager which allows us finer-grained permission control over our inbound traffic.

igor-nflx

Nothing too crazy here, our Jenkins servers use internal security services for client auth, so we wire in our own keystores into the Jenkins clients.

orca-nflx

I’m not going to count, but we have something like 15 custom stages that integrate with various internal services. For example, our Resilience team has integrated ChAP as a first-class stage. Adoption of Spinnaker at Netflix isn’t prescriptive by any means, but simple and tight integrations like this make using Spinnaker ever-more compelling.

One interesting integration is OpenConnect: Our world-wide CDN. They perform their firmware delivery to datacenters around the world through Spinnaker and that’s orchestrated through a custom integration within Orca.

We also have an integration to automate creation of JIRA tickets for releases, if necessary, so users don’t need to create custom Deploy Strategies to automate JIRA creation or resolution. In our implementation, it’s entirely invisible to the users, but you could also use Preprocessors to automatically add stages, or build entirely arbitrary pipelines: This is how Pipeline Templates (v1 and v2) is built, actually… it’s just a preprocessor.

Platform extensions

Extending Spinnaker services directly is one thing, but doesn’t tell the whole story. Netflix uses Spinnaker as the control plane for our clouds, so we do mostly API-driven traffic, too. These integrations can be summarized as domain-specific orchestrations.

Similar to Airbnb with Deployboard, a few organizations within Netflix have written services that offer a specialized view and in some cases, deep extension features atop our orchestration primitives.

Many organizations hit the Spinnaker APIs from their applications to read the their own operational footprint at runtime. One of my favorite integrations here is integrating with our API to create and orchestrate on-top-of an internal spot market of Instance Type Reservations. The most widely-known integration, however, is likely Chaos Monkey.

Lean Core, Fat Ecosystem

Being able to extend Spinnaker is super powerful, but it does come with requiring people to reason about a lot of Spinnaker internals: It’s a high bar and we need to level up. The new plugin model will allow for a more federated development approach, and will eventually serve a crucial role in lowering the bar to contributing to Spinnaker, both for open source and as your internal use cases. Plugins will initially be in-process JVM, but we have plans to expand plugin contracts to remote plugins (RPC, containers) in the future.

More on this stuff later as it continues to shake out, but you can get started with my epic-level proposal of Spinnaker as a Platform. Of course, you can come into #sig-platform on Slack if you’re interested in helping with early development / testing.

Spinnaker Summit 19

Are you interested in this kind of stuff? Come to Spinnaker Summit in San Diego on Nov 15–17! It’s just before Kubecon, so since you’re probably headed that direction anyway, what’s another couple days? 😄

There will be a talk from the Armory folks on plugins, and Adam Jordens and I will be giving a talk on the evolution of operations and internals of Spinnaker at Netflix.

Eager to see you there and to say hello to both new and old faces!

How Netflix Has Extended Spinnaker was originally published in The Spinnaker Community Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

My “Roadmap”: Spinnaker 2019

Rob Zienert — Fri, 01 Mar 2019 18:22:16 GMT

Back to Foundations

2019 is a year of both personal and professional rebuilding. As with any rebuilding, we need to revisit the foundations. Spinnaker is already a great product but it’s also 5 years old and some parts are starting to weather. Spinnaker can be even bester, but to get there we need to focus on the basics and rebuild the foundation where necessary to sail into the next 5 years. This isn’t an official roadmap, it’s my own — but you’re welcome to help.

A building can have all number of corners, but I like my buildings bland so my foundation has four corners:

Code health
Error handling
Extensibility
Scalability

Before we get into it:

I’ll be doing a short talk at RedisConf this year in San Francisco about Spinnaker’s use of Redis. Come say hi if you’re going! I may have swag (stickers, books).
If you’re hosting a conference and want something on Spinnaker, get a hold of me: I’ll maybe come and do a thing, or recruit someone from my team.

Code Health

Like I said, Spinnaker is 5 years old! As far as codebases go, that’s good longevity but now we need to improve on lessons learned, cleanup cruft, and address technical debt. There’s a lot of areas within Spinnaker that do not spark joy. This is natural in a project experiencing rapid iteration and experimentation — now we’ve learned lessons and need to apply those learnings more broadly!

The API, for one, is hard to use. It’s largely undocumented, inconsistent, dynamic — and as a result — aggravating. At Netflix, we have a myriad of teams who have built entire applications and platforms on the Spinnaker API and we understand how painful it can be today. Again, the API exists today as a result of experimentation, so we need to iterate on lessons learned.

I believe we can do better despite a deeply polymorphic domain model, especially now as primitive feature sets are solidifying. I’d like to see more effort put into a V3 API where strong typing, automatic documentation, and tooling are focuses where third-party programmers are the customers (because they are). People shouldn’t need to open their browser’s Network Inspector to reverse-engineer API operations.

Another couple examples, both of which are underway.

Spring Boot 2 upgrade. Spinnaker is still on 1.5! This means we’re behind on critical bug fixes and Spring Boot 1.x is EOL this year.
Echo Scheduler refactor. Just one case of simplification that needs to happen across our services. This will not only simplify the codebase but also greatly improve the operational story of Echo.

We need to iterate our RPC story as well (gRPC), continue to incrementally remove Groovy, and improve our developer tools. A whole lot of stuff that is ripe for the community to bite off (if you so choose).

Thank you, old code, for all that you’ve done for us.

Error Handling

Stage failed (No reason provided).

We need to improve error handling. When errors happen, I want to equip customers to fix their own problem, not to come to “the Spinnaker team” to figure out why a pipeline failed. Spinnaker should give them context and guide them to resolution if the resolution can’t be automated.

We need a more standardized way of propagating errors across service boundaries and understanding the difference between operator- and user-facing errors. Furthermore, we need a better story for distinguishing between: 1) System errors, 2) integration errors, and 3) user errors within code. System errors are ones where Spinnaker did something bad because of bugs (probably my fault). Integration errors would be third-party services & plugins, for an end-user this is going to typically manifest as a system error, yet operators need this additional dimension.

For system errors and integration errors, we need to beef up resilience to these failures wherever possible and either retry or have fallbacks. For user errors, we should be able to directly identify the problem and offer suggestions for remediation. We also need to include more upfront validation so these runtime errors occur less frequently.

Extensibility

I recently had a meeting with a company in the community that was surprised to know anyone can extend Spinnaker via code without upstreaming the customizations. Netflix Spinnaker is built atop OSS but it has a lot of customizations to integrate with other systems within Netflix, this is not a special sauce only Netflix can do! We need to be more obvious and explicit around the insane amount of extensibility offerings Spinnaker has today, this needs to be documented well and we need to offer guidance on how to add new extension points when necessary.

Not everyone has the time, expertise, or resources to create custom builds of Spinnaker, however. We need a better drop-in system: The preconfigured Docker stage is a step in the right direction: It allows you to write arbitrary code in a container and run it as a native stage. This needs better documentation and we need to open up other areas where it makes sense for similar extension points.

We, as a community, should work to support a marketplace of plugins that people can use to discover, download, install, and configure new features without bloating the core product. I’d love to see proposals come from the community on this. Don’t force me to start a side project. 😬

Scalability

My favorite “-ility”, so I saved it for last: Continued investment in performance, reliability, and availability. In this I mean making individual services faster, the system more resilient to internal and external failure, and support a strong multi-location deployment topology.

We’ve been focusing very heavily on this foundation over the last 6–8 months, and it’ll continue to be my personal primary focus. Month-over-month, we have been getting more reliable and faster, but what can I say, it’s not good enough for me yet.

Help!

Spinnaker needs you! If any of this sounds interesting to you, I’d be happy to talk. Join a SIG or propose one if you think one needs to be created, create an RFC or help solidify one that already exists, knock out issues that you’ve found, or add enhancements you think would be valuable. Pull requests are welcome and the Reviewers and Approvers are here to help.

Interested? Come join the #dev channel on the Spinnaker team Slack!

My “Roadmap”: Spinnaker 2019 was originally published in The Spinnaker Community Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling Clouddriver at Netflix

Rob Zienert — Sun, 06 Jan 2019 23:14:37 GMT

Clouddriver is a crucially important service within Spinnaker: Its charter is to track and mutate state of the various cloud providers. If I had to rank services by importance, I’d say it is number two right behind Orca.

As far as scaling Spinnaker itself, however, Clouddriver is the most important to keep running smoothly, yet at Netflix’s scale, it is also currently the hardest to operate. In this post, I’m going to give a small crash course to Clouddriver’s architecture, its persistence, the various stages of deployment topologies we’ve gone through, and where we’re headed for the future. Strap in!

Clouddriver at 30,000 feet

In my mind, there are three parts to Clouddriver: Its cache, atomic operations, and the API.

For any Clouddriver deployment, there’s going to be dozens — if not hundreds or thousands — of CachingAgents: Background processes that index the state of the world for a cloud provider, saving the state into the cache (Redis). For example, there would be a caching agent for AWS ELBs in us-east-1 for the test account and yet another for us-west-2. While indexing, relationship graphs are kept either by explicit cloud provider relationships or implicitly via Spinnaker’s Moniker naming convention (default implementation via Frigga).

AtomicOperations are initiated via Orca business logic and perform discrete actions, such as creating a new server group, upserting security group rules, and so-on. These AtomicOperations are a bit of misnomer, however: They’re not really atomic, as there can be many underlying API requests to the cloud provider, but they should be idempotent, so they can be run more than once in the case of a transient downstream error like AWS rate limits, orin the case of a Clouddriver instance being Chaos Monkey’d.

Finally, the API, which provides a thin abstraction around common cloud concepts, namely server groups, firewalls, and load balancers. These APIs are hit on all manner of frequency and often times will return a lot of data.

Clouddriver’s main bottleneck is its persistence (or, really, cache). It’s a heavy read and write system and the application code has been written expecting high consistency of the underlying data. To make matters worse, Redis is not good at storing relational data; this isn’t really an issue until larger deployments but when you’re indexing millions of cloud resources, it really starts to show. You can add read replicas to Clouddriver’s topology, but due to the consistency assumptions, doing it incorrectly will result in unnecessarily higher error rates.

Scaling Clouddriver

When people first evaluate Spinnaker, I often see all services colocated onto a monolith server and with each service sharing a single Redis. That’s fine for evaluation, it isn’t great but it’s easy to get started. The first step after that is to get every service into their own dedicated persistence / cache servers (Redis, SQL, etc).

As a guiding rule, Clouddriver’s Redis doesn’t need a lot of resources: Netflix’s Redis footprint is only at 40 GB to store state on millions of cloud resources: Your main scaling factor will be network throughput and — eventually — managing replica lag.

Single Clouddriver, Single Master

The most common deployment. You can scale Clouddriver’s application servers out horizontally without any additional configuration. The CachingAgents will be scheduled across the cluster automatically.

For the sake of example, we’ll say our cluster is named clouddriver.

Dedicated Caching Cluster

Running CachingAgents is an expensive, necessary operation. They use a lot of threads for logic & IO. Moving these CachingAgents to their own dedicated cluster and disabling caching on an API cluster will have two benefits:

You can change your Clouddriver deploy pipeline to first deploy caching agents to populate any new caches, then deploy an API cluster to start serving this new data. Useful when upgrading, or switching to a new Redis master.
Your API requests won’t be competing for system resources with CachingAgents, which will improve performance consistency.

We’ll have two clusters clouddriver and clouddriver-api. In the clouddriver-api cluster, we’ll set caching.writeEnabled to false.

Again, both clusters can scale horizontally. We still have a single Redis master at this point. You won’t need many caching agent servers, just make sure you’re not running out of memory; which is mainly impacted by the number of caching agents you’re running on a server at any given time, controlled by the redis.agent.maxConcurrentAgents config (we set ours to 75, YMMV).

Readonly API Clusters

With enough time and use, you may need to scale your API read requests more and route different classes of read-only traffic to their own API clusters & dedicated Redis replicas. Operationally, things get a little more interesting here. At Netflix, we currently have 7 different read replica clusters, all serving different classes or shards of data.

We’ll need to create a Redis read replica, and split the clouddriver-api cluster into clouddriver-api and clouddriver-api-readonly. We'll route requests from Gate to the readonly cluster, since no write operations go directly to Clouddriver as they’re all orchestrated through Orca!

I won’t go into how to setup Redis replicas, there’s plenty of good documentation on how to set this up. We’ll just say that the Redis master is accessible at redis-master:6379 and the replica at redis-api-readonly:6379.

Aside from changing the redis.connection config, there’s no changes for Clouddriver. Instead, you’ll have Gate and any other service (Igor, for example) to the readonly cluster and Orca to the clouddriver-api cluster. The clouddriver-api cluster will still point to the Redis master.

Readonly Deck Cluster

Similar to Readonly API Clusters, having a dedicated cluster for Deck can be important if you have a lot of UI customers. Deck performs a lot of polling operations on backing services, which can add quite a lot of base overhead during working hours. Moving that traffic to its own Redis replica may help improve the quality of life of your customers quite dramatically, as well as isolate the UI from other slower Clouddriver shards and vice versa.

To do this, you’ll setup a new cluster clouddriver-api-readonly-deck and follow similar configuration as other readonly clusters.

To route Deck-only traffic to this cluster, Gate will need to change:

services:
  clouddriver:
    dynamicEndpoints:
      deck: https://clouddriver-prod-api-readonly-deck.example.com

That’s it. Gate knows what requests come from Deck, and will correctly route anything originating from Deck to your new cluster, while sending everything else to your other readonly API cluster.

Readonly Orca Cluster

Similar to Deck, if you’re doing a lot of cloud state mutations through Spinnaker, you’ll likely end up wanting to send Orca to its own dedicated Clouddriver cluster(s). Remember when I was mentioning Netflix has 7 readonly clusters? 5 of those are for different shards of Orca executions.

Orca’s request routing capabilities for Clouddriver are more advanced than Gate through (pluggable) ServiceSelectors. For any given Orca Execution (Pipelines or Orchestrations), we can use the Execution’s context to figure out what Clouddriver cluster to talk to. As of this writing, there’s a handful of different selectors you can choose from:

ByApplicationServiceSelector: Route based on the application the Execution is running against.
ByAuthenticatedUserServiceSelector: Route based on the authenticated user who started the Execution.
ByExecutionTypeServiceSelector: Route based on the Execution Type (PIPELINE or ORCHESTRATION).
ByOriginServiceSelector: Route based on if the Execution was started via an API client, a UI user, or something else...

Let’s just pretend we have a large widget team who have a dozen or so applications that provide their organization with a more purpose-built PaaS built on top of Spinnaker’s API; so they’re doing a lot of automated orchestrations. We’ll send them to their own readonly shard. Again, just like in Readonly API Clusters, setup a new replica and a new clouddriver-api-readonly-orca-widget cluster. (Our naming conventions are long, but at least we know exactly what they are for!)

In our Orca config this time:

clouddriver:
  readonly:
    baseUrls:
    - baseUrl: https://clouddriver-api-readonly-orca-widget.example.com
      priority: 10
      config:
        selectorClass: com.netflix.spinnaker.orca.clouddriver.config.ByApplicationServiceSelector
        applicationPattern: widget.*

In this configuration, any application that starts with widget will be routed to this new Clouddriver cluster. All of selector rules can be given different priorities allowing you to build up more complex rule sets as necessary.

Local Redis

The most recent development for scaling Clouddriver at Netflix has been the introduction of locally-colocated Redis replicas on Clouddriver server VMs. Spinnaker at Netflix gets a tremendous amount of UI usage, enough that multiple replicas are needed. Furthermore, network latency can be a killer, so having the replica co-located means there’s fewer bits shuffling back and forth across the datacenter.

There’s no fancy configuration needed for this, yet its the only time we’ve daisy chained replicas: A redis-replica-deck will be the master of N local deck replicas. The clouddriver-api-readonly-deck servers just talk to localhost:6379.

We haven’t explored this model for other clusters. My guess is it wouldn’t work out so hot: API requests from Deck are okay with a bit of inconsistency, however our API customers generally expect consistency in responses and Orca definitely expects consistency.

One interesting note about this deployment is that the servers shouldn’t come healthy until the locally running Redis’ replication is in-sync. This means that deployments of this type will take longer, but we find that a worthwhile trade-off at our point for much better performance under load.

Launching Clouddriver Processes

As you can imagine, this topology can get fairly crazy from a YAML configuration standpoint. This is why naming conventions are so important for us, we can derive some configuration at boot-up rather than hardcoding things. We do this by customizing the bash script that launches the java process. An example of what that sorta looks like:

#!/bin/bash -x

# a bunch of env stuff like deriving a CLUSTER var from EC2, etc...

CACHING="caching,"
CACHE_WRITE=true
if [[ ${CLUSTER} == *"-api"* ]]; then
  CACHING=""
  CACHE_WRITE=false
fi

REDIS_CONNECTION="redis://redis-master:6379"
if [[ ${CLUSTER} == *"-api-readonly" ]]; then
  REDIS_CONNECTION="redis://redis-replica-api:6379"
elif [[ ${CLUSTER} == *"-api-readonly-deck" ]]; then
  REDIS_CONNECTION="redis://redis-replica-deck:6379"
fi

# ... and so-on...

JAVA_OPTS="-Dspring.profiles.active=${CACHING}${ACCOUNT},${STACK} \
  -Dcaching.writeEnabled=$CACHE_WRITE \
  -Dredis.connection=$REDIS_CONNECTION"

# Launch the app...

Clouddriver Resource Usage

A note on Clouddriver resource usage, as this is a pretty common question. Clouddriver is a hungry beast; the hungriest of the services. It spends a lot of time shoveling data back and forth between Redis, serializing data and deserializing it again, etc. It’s inevitable that as your cloud footprint expands, so too will Clouddriver’s — that isn’t necessarily the case for the other services.

Unfortunately, I can’t tell you what the “right” settings are for Clouddriver because different deployments are going to have different needs. Having said that, here’s some tips as well as a hand-wavy snapshot of Netflix’s Clouddriver deployment from two different environments.

Allocate more CPU cores. The more concurrent API requests or caching agents you have, the more cores you will need.
Allocate more memory: We’re processing the state of the world here. I’d say 8GB is sufficient for the smallest deployments and 16GB when you’re getting into the thousands of cloud resources? Totally pulling numbers out of the air here. ¯\_(ツ)_/¯

We run Clouddriver with the G1 collector and found better performance over CMS once we went to heap sizes of 32GB and higher.

clouddriver-main

This environment is our monolith deployment that indexes everything within Netflix. Although we could, we don’t autoscale, which means we’re typically over-provisioned. [ed: We’re OK with over-provisioning if it means we can spend more time working on and stabilizing long-term improvements.]

We run m5.4xlarge (16 vCPU, 64GB RAM) for all Clouddriver clusters. We have 6 caching servers and 36 API servers distributed across 6 different API shards. There is 1 Redis master and 7 replicas, excluding the daisy chained Local Redis replicas (of which 4 of our API servers are deployed using this configuration). The 7th replica, again, is the replica all Local Redis deploys are chained to.

clouddriver-test

One of our lower environments. As far as a deployment pipeline for Spinnaker services internally goes, this is always bleeding edge and is the environment where we first test large changes (like Orca SQL and Fiat). We only index other test accounts from this environment and it doesn’t get a lot of regular use, so this may be more representative of a smaller production deploy.

We run 4 caching servers on c3.2xlarge (32 vCPU, 15GB RAM), and have a total of 4 API servers (two clusters, a write and a readonly) on m3.xlarge (4 vCPU, 15GB RAM). Redis we’re running 1 master and 1 replica, r3.large (2 vCPU, 15GB RAM).

We really need to update our instance families in test, that’ll be an easy new years task. I think m5.xlarge across the board would work fine in this env.

Scaling for the Future

This pattern of Redis replicas can only bring us so far and doesn’t provide a great availability story. In 2019, we’ll be exploring some different strategies to see us jump to the next magnitude of scale and availability, ideally with less operational burden. I’ll definitely write up something about the results of this when we’re closer to something presentable.

But what about rate limits?

“You didn’t talk at all about rate limits!” True, I didn’t. Clouddriver is a hungry, hungry hippo when it comes to rate limits on your cloud providers. Our caching agents in Clouddriver are polling based, which means Clouddriver is going to be re-requesting the state of the world even if nothing changes. We recently started experimenting with the Titus Streaming API internally, with the hopes of getting a streaming pattern paved for other cloud providers. Of course, my dream would be to get state changes streamed from AWS. A Netflix skunkworks project which may be of interest to the world is Historical, which aims to do exactly this, and something I’d love to spend some 20% time on integrating. I’m a dreamer and one of my dreams is that in 2019 we can have a better story for streaming updates from cloud providers where supported.

Like this post and want more? There’s an #operations channel in the Spinnaker Slack, come join and talk shop.

Spinnaker Year in Review: 2018

Rob Zienert — Thu, 13 Dec 2018 18:50:38 GMT

2018 has been a huge year of change for me, and for Spinnaker it’s no different! I’d like to recap a few of the highlights for me as it relates to Spinnaker both internally at Netflix as well as the community at large.

Governance

The most exciting change for me is OSS Governance. We’re not a part of a foundation but governance is an important first step to that possibility, establishing the framework for how Spinnaker will be managed in a more open manner and provides both a sense of security for corporations to adopt and invest, as well as a clear path towards greater roles within the project.

We currently have two SIGs, but as time goes on, more will be created as they’re proposed!

Spinnaker Summit

We hosted our second annual Spinnaker Summit in Seattle! We took a lot of lessons away and we’ll be having it again in 2019. I’m hoping to see everyone again as well as welcome newcomers.

Didn’t make it? All of the talks are on YouTube.

Spinnaker Book

The Netflix and Google teams collaborated to create an O’Reilly handbook on continuous delivery and Spinnaker. You can download it for free or get in contact with the Netflix team in-person for a hardcopy.

Major OSS Contributions

I’m going to miss some major things and I’m sorry, but these stick out to me:

Orca SQL Backend. SQL will likely be seeing an ever-increasing role in the greater ecosystem as a more stable out-of-the-box persistence layer than what we currently provide.
Amazon ECS Support. Initially contributed by the fine folks at Lookout, Amazon added some new functionality to the provider, including Fargate support.
Amazon Lambda Support. This just landed a couple days ago and isn’t ready for prime time, but FaaS support is on the way!
Kayenta. The collaboration of Netflix and Google to bring Automated Canary Analysis to the open source world came out.
Swabbie. The successor to Janitor Monkey, Swabbie is now a first-class Spinnaker service dedicated to cleaning up resources. It currently only supports AMI cleanup, but is extensible to cleanup any cloud or non-cloud resource.
Titus Cloud Provider. Titus was open sourced this year and along with, the Spinnaker Titus cloud provider.
Amazon Pubsub Triggers. Pipelines can now be triggered off of SNS/SQS messages.
Custom Docker Stage. You can now (as of last week!) create pre-configured stages that execute an arbitrary Docker image and consume outputs into a Pipeline’s context. It’s only been tested on Titus, but should work fine for Kubernetes, ECS, or even be adapted for Lambda.

Netflix Advancements

While this doesn’t really help the OSS community at this point, I do want to call things out because we do a lot of work outside the view of the open source community that may inspire future open source work or potentially be candidates for promotion into upstream.

OpenConnect support. OpenConnect is Netflix’s global CDN for video and metadata, this team now uses Spinnaker to orchestrate deployments of new OpenConnect Appliance firmware.
Teams Page. An internal replacement for the Projects page, which provides a higher-level view of the world, including a nice view of build provenance and delivery lifecycle. Furthermore, we also now have a cost center view allowing teams to track how much each application’s server and infrastructure is costing Netflix month-over-month and calls out how much is spent on inactive server groups compared to the total.
Library support. Applications aren’t the only thing that Spinnaker facilitates delivering at Netflix, library owners can now manage library versions, monitor usage and create deprecation campaign cycles.
Data delivery support. Last year we introduced pipeline rollout support for Fast Properties, our key/value dynamic config service. This year we’ve expanded this data delivery capability to “medium” data; delivering MB to GB of data to different systems through custom pipelines and stages.

Growing Netflix Spinnaker Team

Netflix continues to grow and invest in Spinnaker. Shocking, I know, but we’ve grown quite a lot and I want to take the time to welcome our new members from 2018.

Greg Comstock. Our new designer!
Alan Quach. UI Engineer.
Erik Munson. UI Engineer.
Eric Chiang. Engineering Manager.
Michael Galloway. Engineering Manager.
Michael Graff. Backend Engineer.
Daniel Reynaud. Backend Engineer and Doctor.
Cheryl Potter. Technical Writer.
Mark Vu. Backend Engineer. Actually doesn’t start until the new year, but welcome anyway!

We’re like, kinda big now. 3 full teams! Welcome all of you new blood, I’m elated to work with such a stunning group of people.

Armory Series A

Armory met a big milestone this year in successfully raising a Series A. This is really important for the community, as many large companies that are interested in Spinnaker want to have a corporate entity that will offer support and help them establish best practices through the organization. I’m looking forward to seeing more contributions from them and their bringing more companies into the Spinnaker community.

Looking Forward

2019 is going to be a great year for Spinnaker. Internally at Netflix, we’re currently looking to prioritize the following areas of improvement:

Reliability and performance. We’re actively working on clouddriver, but we’ll be going through all services and evaluating our architecture to ensure we’re setup for continued success into the future.
Declarative delivery. We now have an internal team dedicated to the effort, which is neat-o.
Spinnaker as a Platform. We want to make it easier to extend and modify Spinnaker without having to actually make changes to Spinnaker itself.

Spinnaker Year in Review: 2018 was originally published in The Spinnaker Community Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Spinnaker Summit 2018

Rob Zienert — Sun, 30 Sep 2018 03:46:11 GMT

The second inaugural Spinnaker Summit is just a little over a week away! I’m going, a whole lot of my co-workers are going, and I hope you are too. More than that, I wanted to plug a few things.

First, Rob Fletcher and I will be speaking on Tuesday at 10:45am on Declarative Spinnaker. There’s a panel talk by Google on Monday about Spinnaker and Borg: It’s quite related. I’d recommend you attend both if you’re into the declarative delivery scene.

Second, I’ll be at two Office Hours on Monday. The first is on Operationalizing Spinnaker at 2:15pm, and the other on Declarative Delivery at 3:30pm.

That’s all of my written responsibilities for the Summit. I’ll be attending a handful of talks, but if you see me, say hello. I’d love to talk about how you currently or want to use Spinnaker, competitors in the space, missed opportunities, successes and woes, and future roadmap stuff. We’re also hiring for the Netflix Spinnaker team, you can talk to me about that, too. You’ll be able to find me by the Netflix swag and how I look like me:

If I survive the night, of course.

I’m really excited for my talk to be done. I don’t know why conferences always put me on the morning after the party. Rude.

I haven’t posted much on technical bits on Medium in awhile. I’m laser focused on reliability of Spinnaker, namely on an active/active multi-region story. There’s a lot to talk about there but nothing yet I’m willing to lift out of a proposal and into a blog.

Adding SQL to Spinnaker

Rob Zienert — Wed, 15 Aug 2018 20:06:15 GMT

In Q2 and Q3, the Netflix Spinnaker team worked on developing and releasing a new SQL storage backend for Orca, our orchestration service.

Strap in, this may be my longest post yet. My goal in this post is to outline end-to-end how the sausage is made on some larger efforts within Spinnaker, as well as sort of advertise how extensible it can be. In this post, I’ll be going over why we did this as well as how and why we did it the way we did, then put a bow on it with some retrospection.

First, a big thanks to Asher and Chris for their help!

Background

To date, Spinnaker has used Redis for its primary datastore across all micro services. Admittedly, it’s an odd choice, so why did we do it that way? To keep complexity down: Spinnaker is a large enough beast as it is and requiring a handful of databases to get started would be detrimental to evaluation and onboarding. That said, I don’t think any core committer believes that Redis is the right tool for the job in a lot of cases. Redis’ strengths and shortcomings, specifically around resilience, are well understood.

Spinnaker stores all active and history execution (pipelines and tasks) data within Orca; losing Orca’s Redis causes running executions to become orphaned (thereby failing) and all history of what has happened in the past will be lost. It’s highly disruptive to customers and can potentially cause production incidents. Having Redis as a backing datastore for Orca is a ticking time bomb that lays in wait to ruin your day when you just want to have a productive day. We’ve gotten quite good at operationalizing our own Redis but our implementation isn’t very transferrable to the rest of the open source community, nor does Redis lend itself well to our regionally active-active efforts. On top of all this, even though we are good at Redis operations at Netflix, it doesn’t escape the fact that Redis simply isn’t the right tool for the job in a production environment.

Goals

Execution state of Orca needs to be consistent and persistent but we can deal with some availability loss: We’re okay with a little bit of degraded service, but not okay with data loss.

The database we chose also needed be managed. We run our own Redis instances because we found operating our own Redis instances far more reliable than Elasticache. Unfortunately, this higher reliability comes at the cost of more operational burden. Our new solution must be a managed service, which narrowed things down to basically RDS, DynamoDB and Cassandra (Netflix’s Cloud Database team manages Cassandra as a service).

Performance must be as good as or better than Redis. For me, personally, I elected a personal goal of 10x read latency improvement. For larger customers, who create tens of thousands of executions a day, getting execution history via Redis was painful.

Lastly, whatever we do should be beneficial to the open source community. We open sourced Spinnaker to benefit from innovations from the community, but we must also continue to give back!

Why SQL?

When I first used Spinnaker when it was open sourced (I’m a community hire), I had a lot of questions, many of them shared by the newcomers today, such as, “Why Redis?” I’ll always answer, more or less, like I did above. But why SQL? A common suggestion by people in Slack is, “What if we used Redis for running executions, and then shoveled executions to S3 for cold storage?” A well-intentioned idea, but S3 is prohibitively slow.

From the perspective of being beneficial to the OSS community at large, it immediately ruled out something like DynamoDB. However, something like Cassandra could still very well meet our goals, but is non-trivial to manage (coming from someone who made their paycheck managing Cassandra for a couple years).

SQL knowledge is fairly ubiquitous and checks our core requirements and also has the added benefit of query flexibility, which Cassandra doesn’t really offer. Query flexibility isn’t a hard requirement, but does offer us and the community some movement to develop internal-only schema extensions fairly easily. On top of this, offerings similar to RDS or Aurora check the box on reliability for us.

All this said, I personally would not reject a well-considered Cassandra. I think Cassandra could definitely have a place in Spinnaker’s ecosystem long-term, but it can be a burdensome leap.

Design & Implementation

Taking lessons learned from my work on making clouddriver Dynomite-capable, I elected to do all of the SQL development behind closed doors. Doing so would allow me to tighten the development loop and keep broken code away from the community while we iterate and allow us to deliver a hardened implementation to GitHub. This also had the added benefit of forcing me into putting stronger consideration into making the persistence backend truly pluggable: With this work, you can now provide your own storage backend, with TCKs to validate your logic against if Redis or SQL don’t match your needs¹.

The first few days of development was removing Orca’s core dependency on Redis. I also took this as an opportunity to change the ID generator from UUIDv4 to ULID. This was done so we could have the optionality of sorting executions by their ID rather than a separate column. With this work out of the way, we were in a position to start developing and deploying SQL without cluttering OSS releases.

We elected to use MySQL, specifically Aurora MySQL, as our initial target. Why MySQL and not PostgreSQL, which is “clearly” better? Well, for Aurora, Postgres really isn’t better: The MariaDB JDBC driver is the only Aurora “smart client” which takes care of the legwork around clustering and master failovers, meaning we didn’t need to write and maintain that logic. That alone wins it for me. Aurora is not MRMM (Multi-region multi-master) like Spanner is, but we’ll be able to failover masters in the same region and perform our own regional failovers if need be, which is a good step in the right direction².

The SQL schema will raise SQL purist eyebrows so I’ll get it out of the way now: We serialize objects as JSON³ and store them as blobs! Executions are stored as a pair of tables: one for the root execution object and another for that execution’s stages. For both tables, the only additional columns we have are for indexing and making performant lookups.

This design was chosen to keep most of the schema in the application layer, where we already have well proven patterns for performing live data migrations and support for proprietary data extensions. Furthermore, the data stored within an execution and its stages varies wildly from one to another (and based on cloud provider) and we didn’t want to go down the rabbithole of a bunch of data modeling and, as a result, more complex queries.

The costs of serializing objects and having denormalized data are worthwhile to us. Due to Orca’s data access patterns, we also don’t need to worry about write conflicts or data corruption by two processes writing to the same rows concurrently. Orca is not currently written to support partial object loading, so we’ll be hydrating an entire execution just like we are doing today in Redis-land. We’ll probably need to add support for this in the future, however so we can make UI loading processes more efficient, but we don’t see the current schema strategy blocking this effort.

From a programming perspective, I have a very strong distaste for Hibernate and anything that tries to build queries for the developer. I chose jOOQ, as it provides a typesafe query building API. I steered away from using the code generation stuff jOOQ optionally provides, which I feel wouldn’t provide much benefit considering our odd schema. JDBIv3 was also briefly considered, but not having an abstraction over the actual SQL query generation would’ve made adding other SQL vendors besides MySQL more difficult. Database migrations are handled via Liquibase. The default connection pool is HikariCP, but is optional and we don’t use it at Netflix since it doesn’t play well with MariaDB’s Aurora functionality and instead use MariaDB’s built-in connection pool. We won’t be shipping MariaDB with Orca when we open source the SQL module due to conflicting licenses, but I’ll write some documentation on how we wire up MariaDB or just make a separate GitHub repo showing our exact code.

SQL and the Work Queue

The work queue (Keiko) will still be backed by Redis, and I have no desire to back the work queue with SQL. This means by using SQL, you’ll still have a Redis instance. From an operations perspective, the work queue is transient data so we don’t need to lose sleep over losing the Redis, but I’d still like to get rid of operational burden wherever possible, so I’m working on an SQS backend for Keiko.

This topology does raise a couple new failure scenarios, however:

1. Connectivity to SQL fails, but not Redis.
2. Connectivity to Redis fails, but not SQL.

In the event that SQL connectivity is interrupted, Orca will automatically stop processing work off the queue on a per-process basis. Orca will start processing again automatically once SQL connectivity is re-established. This is done via a write healthcheck (so it’ll also detect master failovers) similar to how a load balancer works. Of course, there will be some messages that will fail in the timeframe that Orca is discovering its failure condition, but the queue already handles message failure & redelivery well.

When Redis is lost, Orca will stop processing work because the queue is gone. New work will continue to be accepted, but will remain in a NOT_STARTED state. When connectivity is re-established, the queue will be empty and any NOT_STARTED executions will begin immediately. For this scenario, I wrote a hydrate queue admin API, which will allow you to re-populate the queue based on the existing state in SQL.

Migration

For Netflix, bringing Orca down (and thus, Spinnaker) to migrate from Redis to SQL is a non-starter, so we needed to build a way to perform a zero-downtime migration. We already had agents to active-active migrate from one Redis master to a new one and the RedisExecutionRepository knew how to route read & write traffic accordingly while in a dual write mode.

When running a migration, currently active executions continue going to the old Redis, with new executions going to the new Redis. Executions are only shoveled over to the new Redis instance once the execution has a completed state (be it successful, terminal…). I like this strategy a lot, it’s easy to reason about and works in both a roll-forward and rollback scenario.

I refactored the migration agents to be ExecutionRepository implementation agnostic, so now they support migrating from anything to anything following the same pattern. I also created a backend agnostic DualExecutionRepository which has the same routing characteristics I outlined above. You can migrate from Redis to SQL, SQL to Redis or SomethingElse to SomethingElse2.

Originally we had written a migrator that did live migrations and used some fancy write conflict resolution but when push came to shove, I rewrote the migrator last week for simplicity. Migration went from a fairly complex orchestration to, “Run this other Orca instance in migration mode until it’s not doing work anymore, then shut it down.”

Performance

As I mentioned earlier, we wanted performance to compete with Redis and I had a personal goal to 10x the read performance — I like the 10x principle a lot when it comes to refactors. Orca’s write performance isn’t really a huge deal: At-peak per-second, we only do hundreds of writes but thousands of reads. To compare apples-to-apples, I wrote an instrumented execution repository so we could compare the two strategies end-to-end, rather than only at driver call-sites.

Is this the SQL impl or Redis? Well, both. It’s the DualExecutionRepository.

Redis is fast: Not many would argue this, but it isn’t fast for how Orca’s schema is designed. To retrieve a single Execution by ID, it takes between 2 and 5 Redis commands: That’s round-trip, non-pipelined commands. If you’re trying to get the most recent execution for an application, it’s another handful of round-trip commands. This roundtrip charade really starts to add up when you need to load a bunch of executions (even with the pipelining we do for list operations).

For SQL read operations, it’s one indexed query which blows Redis’ implementation out of the water especially when your requests ask for more executions. We’ve seen improvements up to 20x in various 99.9% operations. Pages like the Tasks view, which for our larger customers may have 30,000 executions, dropped from loading in 40+ seconds to ~4 seconds.

It’s not all rosy, though. Read performance still needs love. Object serialization costs a lot. In a separate, future effort I’ll be looking to switch from JSON to a binary format that is faster than Jackson and will come with the added benefit of less network transfer time.

When I deployed SQL, I took a moment to overlay RedisExecutionRepository invocation timings against the SqlExecutionRepository. This graph is rendered logarithmically. This is comparing Redis (red) vs SQL (blue), both of which are handling the same volume of production traffic.

red: Redis, blue: SQL

Testing & Deployment

I’ve really enjoyed testing this work. Since this was a project borne out of reliability, I got to break things a lot and who doesn’t like taking a hammer to things you’ve worked on?

Aside from the TCK work, my first objective was getting SQL deployed into our test environment. We essentially have 3 primary environments: Test, Staging and Main. No one really cares about Test, you could think of it as our long-lived experimental environment. I deployed here first once we had all of the TCK methods passing and things were generally working on my laptop. The migration code was just being started at this point so we just deleted all of our old execution data. This environment caught a whole host of bugs and also let me kick the tires on reliability tests while we iterated on migration strategies.

Using the spinnaker-performance repository and a bunch of 1-minute cron pipelines, I was able to sustain ~1200 active executions of varying complexity (from simple wait stage pipelines to full server group deployments). During this time I would pull the plug on random things and watch server logs, Atlas and poke around in the UI & API. My main questions were:

What happens when I failover the MySQL master?
What happens when I blackhole network traffic to MySQL or Redis?
What happens when flush MySQL, Redis?

Eventually, all of the bugs that were coming up got squashed and I was able to focus more on making things faster, specifically around failover behavior and minimizing impact. Once we had the migration code written, I was able to go through this whole process again at varying stages of the migration.

We deployed to the staging environment with the migration code. Staging is on the critical path for prod deployments and is where all of our automated testing is continually run. Throwing the performance repo at this environment plus the automated and manual testing that’s always happening here was able to sniff out some more bugs. Once the environment was stable, sometime in late May, I called for running in staging for all of June: If there were no major bugs, we’d go to production in July. It took longer than that. My co-worker Asher was a huge help while we were in staging helping tune things.

While we were doing this work, a new customer was asking us for SLA numbers on new execution throughput. I had some fun thrashing our staging environment (much smaller than production):

“…sending 100 pipelines per second over 120 seconds … Of the 11,992 pipelines submitted over 120 seconds, 100% of them were successfully received and executed. During this test, we started with 11 executions running and peaked at 5,630 active executions...”

I later turned it up to 11 for my own late-night giggles until the system finally caved at 45,000 active executions (almost an order of magnitude more than our highest production peak).

Remaining Work

The SQL work isn’t open source yet. What gives? Well, there’s some more work yet that I want to do before delivering it.

I want to let SQL bake in our production env longer. Asher and I are currently iterating on some better monitoring and tackling some rough edges which I want buttoned up first.

There’s an ungodly number of undocumented features that we’ve delivered, but this is one that I want to deliver with a bow on it. This means docs for setup, our Aurora/MySQL configuration, MariaDB driver setup, operations, and monitoring. This is a lot of work but fortunately is really making public-friendly some stuff we already have. I’ve also never done any work with Halyard, so I need to learn how to do that and get configurations exposed in there for you Halyard users.

Retrospective

If I were doing this all over again, I’d definitely do it all behind closed doors again: We were able to make aggressive changes to implementation without worrying about early adopters.

The schema we chose was one of familiarity and speed of development, however if I were to do it again, I’d likely switch to an insert-only, event sourcing model, snapshotting aggregates on execution completion. I wouldn’t rule out this design being a thing in the future.

Wrap Up

Hopefully this was an interesting read for you. When can you expect it for use? I’m planning to open the initial set of PRs into Orca in late August / early September.

So, the rest of Spinnaker is still running on Redis. Are we looking at using SQL anywhere else? Maybe, we’ve kicked around the idea for Front50 but haven’t commit to anything. Make no mistake, I still love Redis and think it’s the right fit in quite a few places within Spinnaker even while we’ll continue to make some more architectural adjustments in the future!

Footnotes

You will see this level of support in future platform changes that we’re making across Spinnaker. I suspect its an inevitability that Spanner support will come in the future as well.
For the multi-region story, we’ve written the schema to support regional partitioning. We haven’t implemented this bit yet.
JSON serialization is slow. I plan to switch this to a form of binary serialization in the not-too-distant future.

Dev Journal, Ed. 6

Rob Zienert — Thu, 02 Aug 2018 16:35:59 GMT

It’s been awhile since my last update. If I’m not telling people what I’m doing, am I really doing anything at all? Let’s talk Spinnaker.

Q2, Q3

I’ve been in NYC this week hanging out with the Google Spinnaker team and meeting users who are based out of NYC. There’s a lot of common strands in feedback that I get, mostly around performance and observability. That’s good, I think, because it validates what I believe is important for me to focus on above all else.

As I mentioned in my last post, performance and reliability weigh heavily on our priorities. We’ve done a lot of great work here and even this week had a nice win on the performance side of things just via configuration: We tightened our Redis replication lag in Clouddriver so performance is more predictable by reducing our Redis batch command size — it’s not necessarily faster yet predictability goes a real long way for reliability.

Specifically, we dropped multi-op size from 500 to 200, scan size from 25000 to 200, batch size from 5000 to 200. You can find these configs here: RedisCacheOptions (“caching.redis.batchSize” for example).

Q2 had a lot of wins, but we’re not calling the successes of that quarter good enough. What about now?

Organizational changes

The Netflix Spinnaker team has been growing a lot — we’re at 20 people now and continuing. We could be an entire company unto ourselves: 3 managers and 17 engineers. It’s often repeated that if we doubled our size, we’d still not have enough people to get everything done that we want to. I don’t think it’s hyperbole and is a great problem to have, yet with all this growth come new challenges.

For Q3, we’re experimenting with a new internal org structure where we’re now three focused teams: Experience (design, frontend), Integrations (feature development, ancillary service ownership) and Platform (core architecture and operations). Previously, we were split into two teams, experience and backend. What we’ve found is that effectively operating Spinnaker and fielding incidents requires deep understanding across the breadth of the product. It’s unreasonable to expect everyone on the team to carry the burden of this knowledge, while also being expected to move forward on product commitments. Triaging inbound requests and metering what we commit to requires having a laser focus, so having a team dedicated to building functionality atop a solid foundation is paramount to continued success and ensuring customers feel their needs are taken care of.

I’ve landed on the platform team and as a result passed off technical ownership of the Declarative Spinnaker effort to Rob Fletcher. I’m really happy with this transition specifically because he’s technically a lot more capable than I am and the end result will be better because of that. That frees me (and my team) up to focus on core technical direction and implementation.

Whether or not this structure stays long-term is purely dependent on if the team as a whole thinks it’s working well. Not everyone is as enthusiastic as I am about it (there’s other good candidates for how to structure ourselves), but it’s something to try and we’ll iterate until we land on something right. Incremental improvement and iteration isn’t just for the things we work on, it’s for how we work and organize ourselves, too.

Stuff I’m Doing

Most of my hands-on-keyboard time has been with Orca SQL. I have a rather long post queued about the development of this feature, the design decisions, lessons learned and all that for later (I want to release the feature first so no last-minute changes void the post). It’s important to note again that Redis support isn’t going anywhere: SQL is optional.

Using Redis as the primary (persistent!) datastore in Spinnaker raises eyebrows. And why shouldn’t it? It’s not in Redis’ wheelhouse to be used the way we use it. Although, tangentially, I think it’s a wonderful testament to its power and flexibility.

Why then did we select such an odd choice? It’s versatile and easy to get running. Spinnaker is involved enough to get going on its own, people shouldn’t have to spend hours getting this-and-that database setup just to check Spinnaker out. That said, it’s obviously not a fit for everything. Orca is the first of the services where I’m breaking any hard dependencies on Redis to allow providing new data backends: Expect at some point in the future more movement in this space for the other services.

Observability

I need to find a better way to articulate this, but I hold a belief that if you’re not monitoring your system well, you don’t really have a production (suitable) application. We’ve been making incremental improvements to our monitoring and I’m currently working on an SLO dashboard on a per-service basis. In doing so, I’ve been finding gaps. One such facepalmy gap is that while we have a metric for Orca task status, we don’t have one for overall execution status. So we know the percentage of successful tasks inside of an execution, but no realtime insight into overall execution success rates across a Spinnaker deployment.

It was asked, “Why would we want to track execution status as a system-level metric? Executions can fail for all sorts of reasons that aren’t a result of something wrong in Spinnaker, like a user not configuring a pipeline correctly.” Indeed, but if we were to assume 90% of our pipeline executions are successful even with user errors, but adding this metric shows us actually closer to 50%, it would be a strong indicator that we need a concerted effort to improve error messaging, validation, and diagnostic tools to help users get their jobs done. Furthermore, it’d be interested to see how, if at all, performance regressions in other services impact execution success rates. Although, I would argue we need better of error messaging, validation and diagnostic tools even without having data to back me up — I use Spinnaker, too. :)

All of this is well and good that we’re doing this, but I realize it’s not immediately helpful to community members that we add all of this telemetry, but don’t document what we’re adding or what metrics mean and what ones are important for end-users. Maybe it’s totally cool that a metric spikes every now and then, but maybe it’s catastrophic — without an understanding of the internals there’s no way to make an informed decision. I will not be making a Part 2 of my Monitoring Spinnaker series, but instead will work to aggregate all of this information into spinnaker.io (where it should be) as operator guides for the array of services we have.

Active-Active

If you’re unfamiliar, active-active is the way we talk about a multi-regional deployment where N regions are actively taking and processing live traffic. In such a world, if a region falls over, we can fail out of the bad region and happily continue our work in the other active region(s).

As it exists today, Spinnaker is not capable of an active-active topology, yet this is one of the things we’re currently designing. One goal that we’re tentatively going after is a locality concept, where if your operation is actuating a server group in us-east-1, the Spinnaker deployment in us-east-1 would be the preferential system to perform that action. I’m really excited for this work, as it will only help to increase people’s confidence in Spinnaker as a system they can depend on to work when they need it, regardless of whatever fresh hell has graced our or our dependencies’ systems.

We won’t get there overnight. I’m currently working on a story working backwards from our desired end goals to introduce incremental improvement until one day we flip a bit and ta-da, a truly active-active system.

Wrapping Up

A small plug: The Slack community for Spinnaker is booming with over 5000 members. Even just a year ago it was roughly half that. You should join if you’re not there already and if you’re interested in chatting with me and like minded people, we’ve recently split off a new #operations room to get more focused discussion on operating Spinnaker now that you have evaluated and gotten past the setup aspect. I’d love to long-form an answer for you if I can. :)

Thanks for reading!

Dev Journal, Ed 5.1

Rob Zienert — Fri, 04 May 2018 22:44:53 GMT

Q2 has been all about performance and reliability improvements for the Netflix Spinnaker team. This may come as a surprise after my last post, where I said I’d be focusing on Declarative.

First, a note on 5.1. Last weekend I wrote an Ed 5, then deleted it shortly after publishing. The summary of it was essentially outlining how we’ve had to de-prioritize feature development in favor of strict focus on performance, reliability and overall quality improvement. I deleted the article because I felt the post needed to be reworked and re-presented, I also wasn’t enthralled with how little I actually said. Take two!

As I’ve mentioned in previous posts, Netflix Spinnaker is always teasing the boundaries of what Spinnaker is capable of; we regularly orchestrate hundreds of thousands of pipelines & tasks every day and these must be dependable. New and old customers will start exercising our product in new ways and volumes we haven’t yet seen, which lead us to put out inevitable fires. This rather painful process for us is one of the tremendous value propositions for adopting Spinnaker for your own organization: We are constantly battle-hardening against Netflix’s ever increasing scale.

In a blossoming startup’s lifecycle, there (hopefully!) comes a time when a focus needs to transition from driving customer acquisition through feature development to focusing on quality. Unfortunately, we have stayed focused on feature development for too long to the detriment of our customers through gradual degrading reliability. While we continue to eat new worlds within Netflix, we’ve certainly achieved and surpassed critical mass. Performance, reliability and quality are now the most important features above all else.

We’ve been recently having more and worse performance problems, which has occasionally cascaded into reliability issues that could lead to impacting Netflix-proper, so we’ve refocused to pay down some long-needed tech debt, working on tactical changes in order to focus on bigger, more strategic architectural and team process changes that will pave the road for continued success both within and for the company.

Without further ado…

What’s Wrong?

Some hard things and some easy things. Andy Glover, my boss, mentioned that it’s easy to replace an engine when the car is stopped, but a different matter altogether when the car is already running and can’t stop.

An example of an easy thing to fix is that some of our non-core services aren’t highly available: Igor and Echo have configurations that can’t run with multiple instances. When these services fail, it’s mostly inconvenient but can still impact users. Many of our core services have implemented their own locking mechanisms to help with this. We’re working on a pluggable consensus interface for kork, our standard library, which will have a Redis-backed implementation to start. I’d love to see a ZK, etcd and/or consul implementations as well (for Netflix, ZK makes the most sense). Our applications highly value consistency, so more support for consensus just makes sense to me.

Hard things? We’ve got a few big ones that we’re currently tackling.

Front50 Eventual Consistency

Front50 has eventual consistency issues that our services sometimes do not account for which lead to awkward bugs (ex: “I just saved my pipeline and immediately ran it but it used the previous config version”). Tactically, we’ve added some multiple-reads to verify the latest version is on a higher percentage of Front50 servers; non-ideal but gives us breathing room for a longer-term fix. We’re looking at 3 viable longer-term candidate solutions: SQL, Redis write-through cache, Front50-side consensus. No strong opinions from me.

Orca Redis Persistence

Orca is the bread and butter of Spinnaker and has undergone significant reliability changes over the last year. Just a year ago, we were EAP testing Orca v3, which was replacing a stateful Spring Batch engine with Keiko, our distributed queue library. Today, Orca still gives us heartburn because it stores all of its state, including pipeline execution history, in Redis and losing a Redis causes a percentage of running pipelines to fail. It’s an embarrassing way to fail.

I’ve made Orca’s persistence backend pluggable. We’re currently iterating on an internal SQL implementation that we plan to deploy against Aurora for higher availability and simpler consistency guarantees. As a bonus, I’m expecting a fairly dramatic improvement in performance of read APIs due to some internal refactors that I’m approach with, “Well, while I’m in here, I may as well.” All of this is being built in private so we can iterate more quickly, but I’m experimenting with using GitHub Projects to offer up a public-facing progress report. Once we’ve migrated our production install and worked out the operational kinks, we’ll be open sourcing this work along with our zero-downtime migration code.

We’re looking to generally reduce operational burden, so getting off Redis entirely for a service may not be bad for us. Keiko was built from the beginning to have different persistence backends. It’s way, WAY lower priority for us, but we’ve been entertaining an SQS backend.

Clouddriver Persistence

We’re experiencing pain when our large customers do big volume things. In Clouddriver, higher load often manifests poor behavior through stale data and slow pipeline execution. In some places, we’ve just written inefficient code that interacts with persistence, while in other cases our data model just isn’t a good fit in our current topology and scale.

We’re outgrowing write performance of a single Redis master. We’re currently doing exploratory overhaul work on quite a few fronts. The one that I’ve been working on, Dynomite, is currently on hold while we try some other competing & conflicting ideas and I’m doing that Orca SQL work.

We’d all like to have Clouddriver support a streaming cache strategy instead of our current polling strategy. The main blocker here is getting events for all of the things we need to cache and in a latency that’s acceptable. These issues are continually being improved and are starting to come close to what I would consider minimally required to start evaluating. I’m excited!

Traceability

Distributed system development is hard and not having good visibility into what’s happening is simply asking for trouble. We’ve got good and ever-improving telemetry and logs, but no distributed tracing. Work is currently underway to add Spring Sleuth (Zipkin) into our services. Having this will make it a lot easier to hunt down and squash problems.

Optimizations

While a handful of us are working on these strategic initiatives, most of the team is focusing on the tactical work of tracking down inefficiencies and fixing them. This is the stuff actually digging us out of the current problem. Many of the teams I’ve been on in the past have been fairly binary in focusing on either tactical- or strategic-only scaling strategies. We’ve got a nice blend going on right now and it’s really great to see these smaller fixes making a big impact.

A lot of performance improvements have been getting dumped into clouddriver and orca in the last few weeks. While we’re mostly focusing on AWS and Titus cloud providers, some changes are making positive impacts across the board. We recently pushed out a change that has cut some of our deploy timings in half!

Operations

We’re improving the operations of the system as well. A new (and yet production tested) quality of service extension is being added to Orca. This system currently is a binary on/off system, but is planned to eventually meter inbound orchestration work in response to the rest of Spinnaker’s health. It’s inevitable we’ll be caught out by some unforeseen performance issue, so we want to have systems in place to automatically enforce a minimum quality of service.

We’re also testing out Redis Sentinel for our existing Redis instances. It would likely surprise you we’ve been performing our own fail overs to date, which hasn’t been “the worst”, but we’re hoping to add some automatic safety here. Minimizing the burden of running in production is always a good thing.

Process Changes

We currently measure our reliability and performance success criteria on gut feeling instead of something objective: We have good monitoring and alerts, but we don’t have SLI/SLO/SLAs. By the time an alert fires, it’s well too late to be preventative. As we identify what these should be, we’ll end up working on our own quality objectives and with our customers to nail down their expectations as agreements. Having these will help keep us more accountable and afford some more scientific means of budgeting errors vs feature development.

UI

Last but certainly not least, the UI. I have no business talking about this because I’m hardly an expert here, so I won’t. The team has been working on migrating from Angular to React, but amidst this, we’re also working on API calls to reduce payload sizes, cutting out superfluous requests and hunting down load time improvements. That’s all I can say, they’re doing wonderful work, though.

What Next?

We have a huge list of things we want to improve just for performance and reliability, let alone all other quality aspects. Right now, for me, I’m leaning towards improving some of our existing test harnesses and introducing some new ones.

There’s so much to do, it’s sort of a dream for my delighted ADD brain.

I should note, we don’t really know which of these efforts will work / work the best / totally not work. I’d be doing a terrible disservice to you if I didn’t make a followup at some point to talk about what didn’t work, and where we’ve landed on some things, so look out for that in the future!

I haven’t forgotten my series on Monitoring Spinnaker. I’d like to wait continuing it until some of this dust settles, as I suspect our monitoring for Clouddriver will change. Stay tuned!

Monitoring Spinnaker, Part 1

Rob Zienert — Wed, 21 Mar 2018 05:03:13 GMT

In a recent post to my dev log, I mentioned I wanted to write about scaling strategies for Redis within Spinnaker, our primary storage engine. But before we can jump into that, a far more important topic is necessary: Monitoring and alerting. Without measuring your applications, how can you actually be sure it’s behaving correctly, let alone know what part of the system needs your attention to continue growing?

So, we’ll learn about monitoring Spinnaker first, service by service, while taking a look at graphs of (as far as I know) the largest Spinnaker installation: Netflix’s production deployment. Irrespective of your Spinnaker’s deployment footprint, the metrics I’ll detail in this series will be valuable to you. At the end of the series, I’ll detail the alerts that we have setup, along with the remediation steps we have for each.

Monitoring Orca

What can I say? I wanted to start off with my favorite service. For me, Orca is the most interesting of them all. Moreover, making sure it’s happy is critical to making sure your users are happy.

From Orca’s README:

Orca is the orchestration engine for Spinnaker. It is responsible for taking a pipeline or task definition and managing the stages and tasks, coordinating the other Spinnaker services.

In other words, Spinnaker is known and often loved for enabling accessible, highly configurable and composable workflows through its pipelines and tasks. This is largely thanks to the code that makes up Orca.

It wasn’t random. It was orca-strated. rimshot

A super-fast internals crash course

The primary domain model is an Execution, of which there are two types: PIPELINE and ORCHESTRATION. The PIPELINE type is, you guessed it, for pipelines while ORCHESTRATION is what you see in the “Tasks” tab for an application. At this point in Spinnaker’s life, these two types are basically the same, but it wasn’t always that way. For our purposes, that’s all you need to know.

Orca uses a purpose-built distributed queue library, Keiko, to manage its work. A Message is a granular unit of work that is re-deliverable. For example, StartStage, will verify that all upstream branches are complete and then plan the next stage, it won’t actually do anything as far as a user is concerned. A RunTask is really the only message that actually performs work that a user would be familiar with. Some message types are redelivered and duplicated often, whereas others are not.

If the queue is unhappy, everyone is unhappy.

The queue could be broken into two parts: The QueueProcessor, and its Handlers. Orca uses a single thread (in a thread pool shared with all other Spring-scheduled threads) to run QueueProcessor, which polls the oldest messages off the queue. A separate worker thread pool is used for Handlers. Orca’s ability to scale to process its work is dependent on availability of threads in the worker pool, which can be tuned either by adjusting the worker thread pool size, or by increasing the number of Orca instances: Ignoring a persistence or downstream service bottleneck, Orca can be scaled out horizontally to meet the work demands it is given.

If the queue starts to back up, Pipelines and Tasks will begin to take longer to start or complete. Sometimes the queue will backup because of downstream service pressure (Clouddriver), but often times it will be due to Orca not having enough threads. You’ll know if Orca or Clouddriver is the problem based on Clouddriver metrics real fast (ha, I’m such a tease, come back for Part 2!)

For all of these graphs, unless otherwise mentioned, I’ve chosen a time window starting at 1pm, ending at 3pm of a normal activity day.

Active Executions

executions.active (gauge, grouped by executionType)

This is really just good insight for answering the question of workload distribution. Since adding this metric, we’ve never seen it crater, but if that were to happen it’d be bad. For Netflix, most ORCHESTRATION executions are API clients. Disregarding what the execution is doing, there’s no baseline cost difference between a running orchestration and a pipeline.

Controller Invocation Time

controller.invocations (timer, grouped by statusCode, controller, method)

Run of the mill graph here, to be honest. If you’ve got a lot of 500’s, check your logs. When we see a spike in either invocation times or 5xx errors, it’s usually one of two things: 1) Clouddriver is having a bad day, 2) Orca doesn’t have enough capacity in some respect to service people polling for pipeline status updates. You’ll need to dig elsewhere to find the cause.

Similarly, we have another graph that is the same metric, but filters on status=5xx, which is useful to see if things are really going sideways.

Task Invocations

task.invocations (gauge, grouped by executionType)

This will look similar to the first metric we looked at, but this is directly looking at our queue: This is the number of Execution-related Messages that we’re invoking every second. If this drops, it’s a sign that your QueueProcessor may be starting to freeze up. At that point, check that the thread pool it’s on isn’t starved for threads.

Running Task Invocations by Application

task.invocations (gauge, filter status=RUNNING, grouped by application)

This is handy to see who your biggest customers are, from a pure orchestration volume perspective. Often times, if we start to experience pain and see a large uptick in queue usage, it’ll be due to a large submission from one or two customers. In the graph above, it’s all normal business for us, but if we were having pain, we could bump our capacity, or look to adjust some rate limits.

Here’s an example of it being a little more dicey from the day before. You can see our teal green friends causing a bit of a stir at ~16:10. In this case, Orca was still perfectly healthy, but had we not been aware ahead of time, we could’ve easily sourced where our new found load was coming from.

Message Handler Executor Usage

threadpool.* (gauge), orca.nu.worker.pollSkippedNoCapacity

This is our thread pool for the Message Handlers, plus another metric from the QueueProcessor. Green is good, blue is actual active usage, and red is when a thread is blocked. Red (blockingQueueSize) is bad, especially because the yellow guy should always block blockingQueueSize being changed from 0.

If you’ve got green, it means you have capacity to take on more orchestration work. If you’re seeing yellow, it means your threadpools are saturated and the QueueProcessor has skipped trying to take on more work.

A note about the QueueProcessor, there’s some default-off config options we use at Netflix (so dubbed “turbo orca”) that will cause the QueueProcessor to be more aggressive. Enabling it may cause us to see more yellow, but that’s what we want: We want to be processing as fast as possible, but not going overboard on capacity.

An occasional blip of yellow isn’t bad, the QueueProcessor polls on each Orca instance (by default) every 10ms. Missing a poll cycle isn’t the end of the world by any means, but sustaining a value over 0 is bad: Scale up or make your threadpool bigger.

Queue Messages Pushed / Acked

queue.pushed.messages (gauge, grouped by instance)

queue.acknowledged.messages (gauge, grouped by instance)

These two graphs come as a couple. If messages pushed is out pacing acked, you’re presently having a bad time. Most messages will complete in a blink of an eye, only RunTask will really take much time. If you see an uptick in messages pushed, but not a correlating ack’d, it’s a good indicator you’ve got a downstream service issue that’s preventing message handlers completing: Take a look at Clouddriver, it probably wants your love and attention.

Queue Depth

queue.depth (gauge, gray), queue.ready.depth (gauge, yellow), queue.unacked.depth (gauge, blue)

I like this graph, especially when it looks like this. Keiko supports setting a delivery time for messages, so you’ll always see queued messages outpacing in-process messages if your Spinnaker install is active. Things like wait tasks, execution windows, retries, and so-on all schedule message delivery in the future, and in-process messages are usually in-process for a handful of milliseconds.

Operating Orca, one of your life mission is to keep ready messages at 0. A ready message is a message that has a delivery time of now or in the past, but it hasn’t been picked up and transitioned into processing yet: This is a key contributor to a complaint of, “Spinnaker is slow.” As I’ve mentioned before, Orca is horizontally scalable. Give Orca an adrenaline shot of instances if you see ready messages over 0 for more than two intervals so you can clear the queue out.

Here’s an example from just yesterday where we got hit with a bunch of orchestration work. You can see that we hit a couple bumps of ready messages, but cleared them out quickly. Once the ready messages were a steady state, we scaled up and the ready messages dropped away and the world went back to a happy place.

If you want to see what it looks like when it’s really bad, take a look at this post from my personal blog.

Queue Errors

queue.retried.messages (counter, yellow), queue.dead.messages (counter, orange), queue.orphaned.messages (gauge, red)

It sounds worse than it is. Retried is a normal error condition by itself. Dead-lettered occurs when a message has been retried a bunch of times and has never been successfully delivered. A dead message gets dropped into its own sorted set in Redis which, if you want, can be redriven with some Redis-fu, but there’s no current tooling around that. We know we need better tooling around dead messages — but we typically do not redrive messages.

Orphaned messages are bad. They’re messages whose message contents are in the queue, but do not have a pointer in either the queue set or unacked set. This is a sign of an internal error, likely a troubling issue with Redis. It “should never happen” if your system is healthy, and likewise “should never happen” even if your system is really, really overloaded. It’s worth a bug report.

Message Lag

queue.message.lag (timer)

This is a measurement of a message’s desired delivery time and the actual delivery time: Smaller and tighter is better. This is a timer measurement of every message’s (usually very short) life in a ready state. When your queue gets backed up, this number will grow. We consider this one of Orca’s key performance indicators.

A mean message lag of anything under a few hundred milliseconds is fine. Don’t panic until you’re getting around a second. Scale up, everything should be fine.

Read Message Lock Failed

Look at that boring graph. Good thing, too. I’m not sure what this graph would look like if you have “turbo orca” disabled (enable with keiko.queue.fillExecutorEachCycle=true), but this graph is an indication that a poll cycle had messages in a ready state to acquire a lock on, but couldn’t. Scaling out should get you out of this situation, but if it doesn’t, then open a Pull Request because the setting that dictates this is currently hard coded (sorry).

Other Things

The other things to monitor would be your Redis instance. Running out of Redis memory on Orca is generally considered a really bad thing because running executions will fail and you’ll lose your queue state. If you’re using echo to publish Spinnaker events to a log (like Kafka), you can rebuild the state, but it’s generally good to just… avoid that from happening.

At Netflix, we have a replica that we’ll promote in the event of a failure, or can recover from S3, or alternatively, re-hydrate from our big data store. We’re currently working on moving to Dynomite so instance failures are less of an ordeal on this front.

To manage your Redis memory size, especially at a scale like ours, you need to have some background jobs running to keep the memory size down. Orca ships with just those things! Here’s what we have enabled:

pollers.oldPipelineCleanup.enabled=true: Cleanup old pipeline executions over a certain threshold. It has some additional logic to keep past executions over the threshold for low-frequency pipelines. Source & configs are here.
pollers.topApplicationExecutionCleanup.enabled=true: Cleanup orchestrations from especially busy applications. We have some apps that run tens of thousands of orchestrations a day: A nice recipe for Redis OOM by themselves. Source and configs are here.
tasks.daysOfExecutionHistory=180: We clean out any execution that’s over 180 days old. Period.

If you’re in a company that must meet things like HIPAA, PCI compliance or… any of those things that need long-term auditing, just make sure you’re shipping events to a long-term store via echo’s event firehose.

Wrap up

I hope this was useful for you. Getting started with Spinnaker is a daunting task, and keeping it running well is a hard task, especially as demand grows. From an orchestration perspective, this should give you enough to at least know when things are going sideways before your customers do, and hopefully be able to stay ahead of the curve before issues even become apparent to your users.

Next up: Clouddriver!

Monitoring Spinnaker, Part 1 was originally published in The Spinnaker Community Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.