Scaling Clouddriver at Netflix

9 min readJan 6, 2019

Clouddriver is a crucially important service within Spinnaker: Its charter is to track and mutate state of the various cloud providers. If I had to rank services by importance, I’d say it is number two right behind Orca.

As far as scaling Spinnaker itself, however, Clouddriver is the most important to keep running smoothly, yet at Netflix’s scale, it is also currently the hardest to operate. In this post, I’m going to give a small crash course to Clouddriver’s architecture, its persistence, the various stages of deployment topologies we’ve gone through, and where we’re headed for the future. Strap in!

Clouddriver at 30,000 feet

In my mind, there are three parts to Clouddriver: Its cache, atomic operations, and the API.

For any Clouddriver deployment, there’s going to be dozens — if not hundreds or thousands — of CachingAgents: Background processes that index the state of the world for a cloud provider, saving the state into the cache (Redis). For example, there would be a caching agent for AWS ELBs in us-east-1 for the test account and yet another for us-west-2. While indexing, relationship graphs are kept either by explicit cloud provider relationships or implicitly via Spinnaker’s Moniker naming convention (default implementation via Frigga).

AtomicOperations are initiated via Orca business logic and perform discrete actions, such as creating a new server group, upserting security group rules, and so-on. These AtomicOperations are a bit of misnomer, however: They’re not really atomic, as there can be many underlying API requests to the cloud provider, but they should be idempotent, so they can be run more than once in the case of a transient downstream error like AWS rate limits, orin the case of a Clouddriver instance being Chaos Monkey’d.

Finally, the API, which provides a thin abstraction around common cloud concepts, namely server groups, firewalls, and load balancers. These APIs are hit on all manner of frequency and often times will return a lot of data.

Clouddriver’s main bottleneck is its persistence (or, really, cache). It’s a heavy read and write system and the application code has been written expecting high consistency of the underlying data. To make matters worse, Redis is not good at storing relational data; this isn’t really an issue until larger deployments but when you’re indexing millions of cloud resources, it really starts to show. You can add read replicas to Clouddriver’s topology, but due to the consistency assumptions, doing it incorrectly will result in unnecessarily higher error rates.

Scaling Clouddriver

When people first evaluate Spinnaker, I often see all services colocated onto a monolith server and with each service sharing a single Redis. That’s fine for evaluation, it isn’t great but it’s easy to get started. The first step after that is to get every service into their own dedicated persistence / cache servers (Redis, SQL, etc).

As a guiding rule, Clouddriver’s Redis doesn’t need a lot of resources: Netflix’s Redis footprint is only at 40 GB to store state on millions of cloud resources: Your main scaling factor will be network throughput and — eventually — managing replica lag.

Single Clouddriver, Single Master

The most common deployment. You can scale Clouddriver’s application servers out horizontally without any additional configuration. The CachingAgents will be scheduled across the cluster automatically.

For the sake of example, we’ll say our cluster is named clouddriver.

Dedicated Caching Cluster

Running CachingAgents is an expensive, necessary operation. They use a lot of threads for logic & IO. Moving these CachingAgents to their own dedicated cluster and disabling caching on an API cluster will have two benefits:

You can change your Clouddriver deploy pipeline to first deploy caching agents to populate any new caches, then deploy an API cluster to start serving this new data. Useful when upgrading, or switching to a new Redis master.
Your API requests won’t be competing for system resources with CachingAgents, which will improve performance consistency.

We’ll have two clusters clouddriver and clouddriver-api. In the clouddriver-api cluster, we’ll set caching.writeEnabled to false.

Again, both clusters can scale horizontally. We still have a single Redis master at this point. You won’t need many caching agent servers, just make sure you’re not running out of memory; which is mainly impacted by the number of caching agents you’re running on a server at any given time, controlled by the redis.agent.maxConcurrentAgents config (we set ours to 75, YMMV).

Readonly API Clusters

With enough time and use, you may need to scale your API read requests more and route different classes of read-only traffic to their own API clusters & dedicated Redis replicas. Operationally, things get a little more interesting here. At Netflix, we currently have 7 different read replica clusters, all serving different classes or shards of data.

We’ll need to create a Redis read replica, and split the clouddriver-api cluster into clouddriver-api and clouddriver-api-readonly. We'll route requests from Gate to the readonly cluster, since no write operations go directly to Clouddriver as they’re all orchestrated through Orca!

I won’t go into how to setup Redis replicas, there’s plenty of good documentation on how to set this up. We’ll just say that the Redis master is accessible at redis-master:6379 and the replica at redis-api-readonly:6379.

Aside from changing the redis.connection config, there’s no changes for Clouddriver. Instead, you’ll have Gate and any other service (Igor, for example) to the readonly cluster and Orca to the clouddriver-api cluster. The clouddriver-api cluster will still point to the Redis master.

Readonly Deck Cluster

Similar to Readonly API Clusters, having a dedicated cluster for Deck can be important if you have a lot of UI customers. Deck performs a lot of polling operations on backing services, which can add quite a lot of base overhead during working hours. Moving that traffic to its own Redis replica may help improve the quality of life of your customers quite dramatically, as well as isolate the UI from other slower Clouddriver shards and vice versa.

To do this, you’ll setup a new cluster clouddriver-api-readonly-deck and follow similar configuration as other readonly clusters.

To route Deck-only traffic to this cluster, Gate will need to change:

services:
  clouddriver:
    dynamicEndpoints:
      deck: https://clouddriver-prod-api-readonly-deck.example.com

That’s it. Gate knows what requests come from Deck, and will correctly route anything originating from Deck to your new cluster, while sending everything else to your other readonly API cluster.

Readonly Orca Cluster

Similar to Deck, if you’re doing a lot of cloud state mutations through Spinnaker, you’ll likely end up wanting to send Orca to its own dedicated Clouddriver cluster(s). Remember when I was mentioning Netflix has 7 readonly clusters? 5 of those are for different shards of Orca executions.

Orca’s request routing capabilities for Clouddriver are more advanced than Gate through (pluggable) ServiceSelectors. For any given Orca Execution (Pipelines or Orchestrations), we can use the Execution’s context to figure out what Clouddriver cluster to talk to. As of this writing, there’s a handful of different selectors you can choose from:

ByApplicationServiceSelector: Route based on the application the Execution is running against.
ByAuthenticatedUserServiceSelector: Route based on the authenticated user who started the Execution.
ByExecutionTypeServiceSelector: Route based on the Execution Type (PIPELINE or ORCHESTRATION).
ByOriginServiceSelector: Route based on if the Execution was started via an API client, a UI user, or something else...

Let’s just pretend we have a large widget team who have a dozen or so applications that provide their organization with a more purpose-built PaaS built on top of Spinnaker’s API; so they’re doing a lot of automated orchestrations. We’ll send them to their own readonly shard. Again, just like in Readonly API Clusters, setup a new replica and a new clouddriver-api-readonly-orca-widget cluster. (Our naming conventions are long, but at least we know exactly what they are for!)

In our Orca config this time:

clouddriver:
  readonly:
    baseUrls:
    - baseUrl: https://clouddriver-api-readonly-orca-widget.example.com
      priority: 10
      config:
        selectorClass: com.netflix.spinnaker.orca.clouddriver.config.ByApplicationServiceSelector
        applicationPattern: widget.*

In this configuration, any application that starts with widget will be routed to this new Clouddriver cluster. All of selector rules can be given different priorities allowing you to build up more complex rule sets as necessary.

Local Redis

The most recent development for scaling Clouddriver at Netflix has been the introduction of locally-colocated Redis replicas on Clouddriver server VMs. Spinnaker at Netflix gets a tremendous amount of UI usage, enough that multiple replicas are needed. Furthermore, network latency can be a killer, so having the replica co-located means there’s fewer bits shuffling back and forth across the datacenter.

There’s no fancy configuration needed for this, yet its the only time we’ve daisy chained replicas: A redis-replica-deck will be the master of N local deck replicas. The clouddriver-api-readonly-deck servers just talk to localhost:6379.

We haven’t explored this model for other clusters. My guess is it wouldn’t work out so hot: API requests from Deck are okay with a bit of inconsistency, however our API customers generally expect consistency in responses and Orca definitely expects consistency.

One interesting note about this deployment is that the servers shouldn’t come healthy until the locally running Redis’ replication is in-sync. This means that deployments of this type will take longer, but we find that a worthwhile trade-off at our point for much better performance under load.

Launching Clouddriver Processes

As you can imagine, this topology can get fairly crazy from a YAML configuration standpoint. This is why naming conventions are so important for us, we can derive some configuration at boot-up rather than hardcoding things. We do this by customizing the bash script that launches the java process. An example of what that sorta looks like:

#!/bin/bash -x# a bunch of env stuff like deriving a CLUSTER var from EC2, etc...CACHING="caching,"
CACHE_WRITE=true
if [[ ${CLUSTER} == *"-api"* ]]; then
  CACHING=""
  CACHE_WRITE=false
fiREDIS_CONNECTION="redis://redis-master:6379"
if [[ ${CLUSTER} == *"-api-readonly" ]]; then
  REDIS_CONNECTION="redis://redis-replica-api:6379"
elif [[ ${CLUSTER} == *"-api-readonly-deck" ]]; then
  REDIS_CONNECTION="redis://redis-replica-deck:6379"
fi# ... and so-on...JAVA_OPTS="-Dspring.profiles.active=${CACHING}${ACCOUNT},${STACK} \
  -Dcaching.writeEnabled=$CACHE_WRITE \
  -Dredis.connection=$REDIS_CONNECTION"# Launch the app...

Clouddriver Resource Usage

A note on Clouddriver resource usage, as this is a pretty common question. Clouddriver is a hungry beast; the hungriest of the services. It spends a lot of time shoveling data back and forth between Redis, serializing data and deserializing it again, etc. It’s inevitable that as your cloud footprint expands, so too will Clouddriver’s — that isn’t necessarily the case for the other services.

Unfortunately, I can’t tell you what the “right” settings are for Clouddriver because different deployments are going to have different needs. Having said that, here’s some tips as well as a hand-wavy snapshot of Netflix’s Clouddriver deployment from two different environments.

Allocate more CPU cores. The more concurrent API requests or caching agents you have, the more cores you will need.
Allocate more memory: We’re processing the state of the world here. I’d say 8GB is sufficient for the smallest deployments and 16GB when you’re getting into the thousands of cloud resources? Totally pulling numbers out of the air here. ¯\_(ツ)_/¯

We run Clouddriver with the G1 collector and found better performance over CMS once we went to heap sizes of 32GB and higher.

clouddriver-main

This environment is our monolith deployment that indexes everything within Netflix. Although we could, we don’t autoscale, which means we’re typically over-provisioned. [ed: We’re OK with over-provisioning if it means we can spend more time working on and stabilizing long-term improvements.]

We run m5.4xlarge (16 vCPU, 64GB RAM) for all Clouddriver clusters. We have 6 caching servers and 36 API servers distributed across 6 different API shards. There is 1 Redis master and 7 replicas, excluding the daisy chained Local Redis replicas (of which 4 of our API servers are deployed using this configuration). The 7th replica, again, is the replica all Local Redis deploys are chained to.

clouddriver-test

One of our lower environments. As far as a deployment pipeline for Spinnaker services internally goes, this is always bleeding edge and is the environment where we first test large changes (like Orca SQL and Fiat). We only index other test accounts from this environment and it doesn’t get a lot of regular use, so this may be more representative of a smaller production deploy.

We run 4 caching servers on c3.2xlarge (32 vCPU, 15GB RAM), and have a total of 4 API servers (two clusters, a write and a readonly) on m3.xlarge (4 vCPU, 15GB RAM). Redis we’re running 1 master and 1 replica, r3.large (2 vCPU, 15GB RAM).

We really need to update our instance families in test, that’ll be an easy new years task. I think m5.xlarge across the board would work fine in this env.

Scaling for the Future

This pattern of Redis replicas can only bring us so far and doesn’t provide a great availability story. In 2019, we’ll be exploring some different strategies to see us jump to the next magnitude of scale and availability, ideally with less operational burden. I’ll definitely write up something about the results of this when we’re closer to something presentable.

But what about rate limits?

“You didn’t talk at all about rate limits!” True, I didn’t. Clouddriver is a hungry, hungry hippo when it comes to rate limits on your cloud providers. Our caching agents in Clouddriver are polling based, which means Clouddriver is going to be re-requesting the state of the world even if nothing changes. We recently started experimenting with the Titus Streaming API internally, with the hopes of getting a streaming pattern paved for other cloud providers. Of course, my dream would be to get state changes streamed from AWS. A Netflix skunkworks project which may be of interest to the world is Historical, which aims to do exactly this, and something I’d love to spend some 20% time on integrating. I’m a dreamer and one of my dreams is that in 2019 we can have a better story for streaming updates from cloud providers where supported.

Like this post and want more? There’s an #operations channel in the Spinnaker Slack, come join and talk shop.