2017 Spinnaker Summit Debrief

45 min readSep 15, 2017

200 engineers and engineering leaders from over 50 companies attended the first ever Spinnaker Summit this week, hosted at Netflix in Los Gatos, CA, with happy hours sponsored by Google and Armory.

Below is an abstract of the talks from the Summit, with presentations, summaries, and videos, as available (I’ll add more as I compile them).

PS — Graphs, graphs, graphs! Fifty four people filled out the pre-Summit survey, indicating how they use Spinnaker. I graphed some of the responses at the bottom of this post.

Opening Keynote

ANDY GLOVER, NETFLIX and STEVEN KIM, GOOGLE

Andy and Steven gave a brief keynote to kick the conference off. Andy’s main point was that Spinnaker isn’t a software deployment platform. It’s a delivery platform. It’s about the delivery of software, not just deployments. Some examples:

ACA — Automated Canary Analysis, which plugs into Spinnaker within Netflix (and will soon be open-sourced).
FastProperties are Runtime properties at Netflix. “This is how we flip content live, by having dynamic properties and proposed using deployment pipelines [in Spinnaker] to deliver the properties.”
Some teams [at Netflix] are using Spinnaker to validate library dependencies.

“Most people see Spinnaker as a deployment platform. But it’s much more of a delivery platform.” — Andy Glover, Netflix

Andy’s other main point was that the open source community is at the starting line with Spinnaker; that this is a point of inflection. The community has been growing fast, but he and Steven would love to see an increasing number of contributors, saying “it’s a win-win when we innovate, even if it’s kept private, because conversations and ideas are shared within the community.”

Spinnaker at Target

EDWIN AVALOS, TARGET

Abstract: Spinnaker is one of the foundations for our “Target Application Platform”, TAP for short. Why Spinnaker was the choice a year ago, why we continue to invest in it, why we are building the rest of the TAP. And how Target’s model is something that larger companies might want to emulate to provide the ubiquitous experience across multiple compute providers.

Presentation and/or video:

Reap unneeded pods in Kubernetes: https://github.com/target/pod-reaper
We also have this for reaping pods when they go into an unwanted state, https://github.com/k8guard/k8guard-start-from-here here you can write rules that deal with problems in general or violations

Collaborative notes from audience:

No audience notes were captured from this talk

Continuous Delivery at Cloudera

CHUKA OKOYE, CLOUDERA

Abstract: At Cloudera we have a preference for immutable infrastructure components and also believe in everything as code mantra. This talk will provide more context on the where, how, why we’ve used some foundational components (Spinnaker, Jenkins, Terraform, Cloudformation and Lambda) to drive Continuous Delivery. We’ll also discuss how we’ve extended Spinnaker to manage non-instance related resources in a secure manner.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

No audience notes were captured from this talk

Scaling Spinnaker at Netflix

ADAM JORDENS, NETFLIX

Abstract: A deep dive on how Netflix utilizes Spinnaker to deploy thousands of times per day across dozens of accounts and two cloud providers (AWS and Titus). This talk will cover how we operate and extend Spinnaker to meet the needs of more than a thousand internal applications. We will share details on how Netflix has managed to integrate Spinnaker with numerous internal systems that are not themselves open source (hint: we do not fork!).

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

Went from 400 deployments per day to 4,000 per day

“We don’t want individual engineers at Netflix to worry about where to deploy.” — Adam Jordens, Netflix

Why did Netflix open source Spinnaker?

We get a lot of benefits from the community
We wanted a provider model to push to multi cloud (Titus, AWS, GCE)
We’ve made better decisions because we OSS
Recruitment and retention for Netflix

Why not just use Puppet, Chef, Ansible?

We needed instant scaling

Spinnaker at Netflix

No forks or any prebuilt public images
We are version locked with OSS
We extend it, and the community can do it too, just takes effort
Every service is deployed independently
See for Gradle file of version locking
Every service has its on Redis
Metrics are grabbed from each service and Redis too
We want a 0 time downtime to Redis also

Clouddriver:

10k ops/s (500k+ if you flatten multi key)
Dynomite might allow us to scale linearly
Orca used to have read/only and mutating
Orca now will shard read only req to dedicated clouddriver-orca clusters
… by execution types (orchestration vs pipeline)
… by origin (UI vs API), allows UI to take precedence over API
… by Application, one of the applications was bogging down global traffic
WIP: authenticated users

Metrics (Atlas):

includes dashboarding and alerting
Logs ELK stack ~1 week worth

Testing and Promotion:

Test/Prestage/main
prestage + main: run validation on pipelines on cron. like promotion
always red/black with old one about an hour
… they have a reaper script that looks for disabled server groups looking for older than an hour, but they resize to 0
30–40 min from code commit → staging (TravisCI is a huge portion), but we’re trying to shrink it

War stories:

Jenkins masters were flaky sometimes and we didn’t have circuit breakers on them
lost orca redis a few times, some accidents, some thresholds
we now have alerts that are 3/4 weeks ahead of time
extremely large clusters
large objects in s3 (front50).
… some pipelines and mutations were crashing the cache
… solved using s3 events (sns topics), front50 listens to these events to keep cache updated
… GCE might have this issue, but hasn’t seen any issues yet.

Cloud Security:

Spinnaker works in its own account
Uses IAM roles
User Data Signing
Netflix OSS: Lemur (cert management)
diagram of cert trusts

Capacity Planning:

spot market and cost reductions
encoding team sends 10–20k orchestrations per day to to scale fleet up/down
market usage is at in the late evening

Other points re: Spinnaker at Netflix:

ChaosMonkey works with Spinnaker
2500 applications with > 1 server group
9,500(!!!) pipelines
… hopes pipeline templates will drop this
67% have deploy stages
33% use expressions
… fixed a bunch of security issues

Netflix Spinnaker Team:

12 people, from 8 people when it was OSS’d

Future:

more focus on containers and other deployment targets, like CDN

Questions:

Are containers used for Spinnaker or App developers?

app developers on Titus

Do you use halyard?

no, we don’t.

How do you manage the single failure of single failure of 1 AWS account?

There’s talks about it
Security concerns about baking in different accounts

Is Spinnaker running in multiple regions?

no, not yet, but baking is. We don’t copy bakes across regions

Matt Dufller is a great resource for Rosco / Orca v3

ROB FLETCHER, NETFLIX

Abstract: We recently did a major rewrite of the foundations of Orca in order to improve the service’s operational capabilities. This talk will discuss the limitations of the old Orca, the changes we made, how we operationalize the new implementation, and the opportunities we now have to take things further.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

Orca 101:

All stages are “synthetic” stages comprised of multiple stages. i.e. deploy → deploy us-west 1, us-west-2
… Deploy in us-west-2: create server group → monitor → wait for up instance → disable old server group
Canary stages at Netflix sometimes runs days, this cause huge issues for V1

History of Orca (V1):

V1 used Spring Batch for workflow
Was built after Asgard, and its main goal was to uncouple from AWS
Worked for 2 years, but wanted to swap Spring Batch out for something else.
… See “Mismatch on intent” in video
Chaos Monkey didn’t work on Orca V1
Couldn’t red/black Orca V1 because Spring boot was stateful

“Netflix doesn’t really do stateful services”

Goals for V2:

resilient to instance loss, able to restart pipelines easily, red/back orca, distributed work across the cluster.

V2 issues:

Don’t change any behavior, but isolate Spring batch because it leaked

Orca V3:

replace spring batch with a message/queue system.
Tried Arca, but ran into issues of testing, integration into Netflix workflow
Supports SQS, but they’re using Redis
Queues: queue, message, processing
… Queue.push: push message to message, push message_id to queue.
… Queue.poll
uses a single redis

The future of v3:

inflight work is not prioritized, they’re all flat right now. will change soon.
v2 → v3 migration is still unclear.Spinnaker on Oracle Bare Metal Cloud

OWAIN LEWIS, ORACLE

Abstract: An overview of Oracle’s Bare Metal Cloud. We’ll look at why Spinnaker is important to Oracle, our experiences contributing a cloud provider, and the long term roadmap for contributing to the Spinnaker community.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

Oracle BMC is Oracle’s 2nd generation cloud — hired a bunch of engineers from AWS, Azure, GCP re-imagine a next-gen cloud

Goals of Oracle BMC:

Radically improved cloud security and governance
Network performance → customers have said 10x performance increase
Network design → only 2 hops max
Choice of bare metal or VM
52 core X7s … good for animation; video rendering

Why Oracle + Spinnaker:

Started investigating CI/CD at start of year; instead of building own tool, decided Spinnaker was the best product; so decided to contribute to that
For external and internal customers wanting to deploy to Bare Metal
Oracle is massively investing in Kubernetes; Support for k8s was a big reason

Progress:

Team of 4 people; growing but still small
Most of the initial cloud providers merged
Packer builder for Bare Metal merged yesterday; waited 7 months for that
Halyard support fo easy install
Front50 suport for Bare Metal

Work left to go:

Finish work on load balancers (Oracle doesn’t have autoscaling on LBs), UI to TypeScript, Rosco changes

Does Oracle use Spinnaker internally?

Not currently; trials starting soon
Software deployments at Oracle are diverse and complex; Spinnaker can help
Someinternal teams are familiar and use Spinnaker on other platforms

Delivery challenges at Oracle:

Lots of legacy processes; very diverse software delivery w/ challenging requirements
Tens of thousands of developers; consistent tooling is very import
Largest Artifactory instance in the world; hundreds of TBs of software
Large binary file based deployments (10+gb) alongside traditional appliation deploykents that don’t fit Spinnaker model
Oracle Continuous Delivery
Traditiopnal simple web applications
Binary delivery
… Data center components
… RPMs
… Custom images (for PaaS services)
… Software patches
… On-prem software
… Highly secure environments w/ no external network access
Would get strong adoption inside Oracle if Spinnaker would work for:
… non-traditional use cases
… common unit of deployment is non standard (not always a web application)
… software needs to be installed on premise; datacenter software etc
… could use webhooks or similar to integrate

How can Oracle help Spinnaker?

helpful contributors
driving adoption
contribute to core

Roadmap for Oracle + Spinnaker:

Cloud provider completed
start testing cloud provider w/ internal teams
contribute documentation and tutorials
CITest integration
Scale up team to contribute more to the community
Become more active in design and development of core features

Spinnaker at Schibsted

GARD RIMESTAD, SCHIBSTED

Abstract: It has been almost two years since we started out with Spinnaker in Schibsted. We are doing 250+ production deployments each week against AWS, k8s, ECS, Mesos and our internal paas. In this talk we will go through what parts of Spinnaker we are using and what we have learned along the way. How we are handling rpm and deb packages and how we would like to improve it. How the webhook stage has unblocked with k8s deployments and made it possible to have reliable deployments to our internal PaaS. Our experience with declarative pipelines and some examples of how we are using it.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

Lesson learned: Pick one OS and version and stick with it!

Ubiquitous Delivery: Plugging Into Spinnaker

CHRIS THIELEN, JEREMY TATELMAN, NETFLIX

Abstract: In this talk, Chris Thielen, UI Engineer, and Jeremy Tatelman, UX Designer, will share their experiences integrating disparate tools into a low-friction, optimized user experience within Spinnaker. Topics include “knowing when to integrate,” “development best-practices” and “UX do’s and don’ts.”

Presentation and/or video:

We’ve created a repository with a custom build of Deck, which is based on how Netflix builds it internally: https://github.com/spinnaker/deck-customized

Collaborative notes from audience:

No audience notes were captured from this talk

Cerner’s DC/OS-Spinnaker

WILL GORMAN, ROBERT FARR, CERNER

Abstract: Will from Cerner will discuss DC/OS features and Robert will go over how Spinnaker’s being managed for multiple teams at Cerner as well as some of the Splunk and monitoring features they’ve added to Spinnaker.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

A number of stages and operations supported on DC/OS

Server group exposes same settings as Marathon app

Jobs via Metronome (like Jenkins)

Run Job limitations

Metronome doesn’t support secrets
No Docker parameters = no Splunk logging
Permissions issues w/ sandbox file access in strict mode

Ability to have Spinnaker be single point to look at status of deployments multi-region has been extremely useful

Change cluster just by changing region
Set up in Clouddriver

Load Balancer Woes

DC/OS doesn’t have an official built-in load balancer to handle external traffic and LB to clusters
Reommended LB from Mesosphere is Marathon-lb; but can’t assume that’ll be present in every DC/OS cluster
LB rule changes require redeploying server groups
Dealbreaker w/ Marathonlb: No way to put v000 and v001 in the same backend
Traefik works better; built w/ microservices in mind

Challenges with Spinnaker for DC/OS at Cerner:

Security:

Jenkins Stage authorization
Audit trail

Authorization:

Spinnaker deploys an admin instance
Then separate instances for Electronic Medical Records, Population Health and internal tools

Auditing:

Extended Echo to create echo-splunk

Monitoring:

Added statsd metric store

Image management:

One of more difficult challenges with AWS
Encrypted AMIs don’t share
Does anyone else have this problem w/ Encrypted AMIs??? If so, log a feature request w/ AWS and tag it onto Cerner’s
Getting around it at the moment by hijacking the bake stage; created a set of scripts for managing images

Future plans:

Create first-class stages for image management
Deploy DC/OS to OpenStack w/ Spinnaker
Add support for DC/OS service account key rotation
Pipeline triggers from external Spinnaker instance
Restrictions master’s to Jenkin Stage by account

What Armory Has Learned

DRODIO, BEN MAPPEN and ISAAC MOSQUERA, ARMORY

Abstract: Armory has engaged with over 300 engineering leaders on their CI/CD pain points and installed Spinnaker across multiple enterprise customers. In that process Armory has learned how to operationalize Spinnaker in vastly differing environments, what the biggest blockers are to adoption, and what features are most valuable to (and most requested by) enterprise customers. They’ll share their learnings, along with tactical pro-tips to help global companies succeed with Spinnaker.

Presentation and/or video:

Armory's Spinnaker Summit Slides

hello@armory.io | 1.888.222.3370 Our Playbook to WIN WITH SPINNAKER In Your Enterprise

docs.google.com

Collaborative notes from audience:

No audience notes were captured from this talk

Spinnaker at Gogo Air

ALEX KING, DOUG CAMPBELL, JOEL VASALLO, AND STEVE BASGALL, GOGO AIR

Abstract: Gogo has embraced use of Spinnaker for deployments across its infrastructure over the past 18 months or so, sometimes in innovative ways. This session will be an overview of how we utilize Spinnaker, and dive into foremast, static content deployments, and lambda.

Presentation and/or video:

Spinnaker at Gogo

Spinnaker at Gogodocs.google.com

Foremast repo for anyone interested in learning more: https://github.com/gogoair/foremast

Collaborative notes from audience:

No audience notes were captured from this talk

Kubernetes + Spinnaker

LARS WANDER, GOOGLE

Abstract: Since the first iteration of the Kubernetes provider was announced over a year and a half ago, we’ve seen the Kubernetes ecosystem mature, and the requirements of the Kubernetes provider in Spinnaker change. Lars led the initial implementation of the Kubernetes provider, and has written a proposal for its next iteration. In this talk we’ll be focusing on how the Kubernetes provider will be improved, and what new workflows it unlocks.

Presentation and/or video:

Kubernetes + Spinnaker Special Interest Group: https://groups.google.com/forum/#!forum/spinnaker-sig-kubernetes

Collaborative notes from audience:

History

we first implemented it in the “Spinnaker” way
it had a lot of cool features, but there were costs
Manifest based deployments
It had a lot of cool features, but there were issues that wouldn’t work

Problems

How do we avoid restricting a user’s resource naming?

Why?

Annoying
Replica Sets must be suffixed uniquely (-v000, -v001, …) for deployment strategies to work
Naive renaming at deploy time breaks non-label relationships (e.g. autoscaling)
Cluster relationship is not always inferable from deployment … makes canary harder
Relationships between disabled server groups and load balancers

Generify integration with Frigga within Spinnaker

app-stack-detail
Spinnaker relationships diagram
relates clusters, server groups, applications, load balancers, security groups
Annotate explicitly rather than infer from naming

How do we specify which artifacts in a manifest need to get to be updated in a pipeline?

Artifact Decoration
… More generic definition of what an artifact is within Spinnaker
.. type, name, version (e.g. type: docker, name: myapp, version: 1.0.0)
Multiple artifacts changing at once
Existing spinnaker triggers don’t support multiple artifacts
Triggers should be able to support multiple artifacts
… User needs to be able to handle missing artifacts
… fail
… use string constant/SPEL expression
… use prior execution’s artifact (e.g. latest supplied artifact version)

How do we deploy manifests stored in git (or GCS or S3 or …)

What about updating docker images & config maps?

expected artifacts
2 docker images
configuration map files in github
Version your config maps to avoid issues with rollbacks
Resolves dependency graph

Implementation Roadmap:

Deploy and annotate a Kubernetes manifest supplied as JSON
Cache new Kubernetes resources and surface them in the UI
Expand manifests sing supplied artifacts
Deploy manifests stored in Git, GCS, etc…
Deploy multiple manifests at once
Jsonnet

Questions

Q: Will this be backwards compatible?
A: No, the current (old) version does not comply with the kubernetes client spec
Q: How should secrets be managed?
A: Use a separate service for secret storage, have it as a dependency

Halyard Deep Dive

LARS WANDER, GOOGLE

Abstract: Spinnaker is a difficult system to run and operate; it takes many organizations weeks to get a PoC running, and whole teams to keep Spinnaker operational. We wanted to reduce this operational complexity both in onboarding and maintenance, and came up with “Halyard” to do so. Lars led the Halyard implementation, and will go into the weeds of what it does and how it works.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

Halyard documentation is auto-generated
Halyard works off a a cli and deamon, plans to have UI component
Halyard Validation
… validates gce scopes
there’s support for version, so it’ll change syntax on newer Spinnaker versions

Deploying Spinnaker with Spinnaker:

first deploys a bootstrap (smallest bit of spinnaker), clouddriver, orca, redis
then create custom pipelines for each
There’s some special cases
… Orca, you need to disable it
… Rosco, you need to wait for draining
(not currently) but hal should be able to flush redis

Hal does log collection
configuration monitoring (Spinnaker monitoring sidecar)
hal config metric-store collect
(coming soon) easy scaling of Spinnaker Services
ie clouddriver readonly, polling, muttating

Questions?

What’s it look like to migrate old managed to new hal?

No ways to do it, but there should be docs get to parity.
Make sure to point it to existing resources

How can you do CI system to run hal?

hal backup create — this takes the config that are existing and puts it in a tarball
Lars will write up a doc on this.

Does hal work for aws?

no, not unless someone wants to write it

Can hal just be used only config generation?

yes, there’s docs to do it

I saw there’s some mac stuff on Friday?

yep, its coming soon.

Is there a command to deploy just a single service?

yes, there’s some docs on how to do it. Hal isn’t smart enough yet because of some weird settings

Spinnaker Release Process

JAKE KIEFER, STEVEN KIM, GOOGLE

Abstract: Spinnaker microservices used to be released on their own cadences without cross-service compatibility validation for any set of service versions. Jake led the design and implementation of the new release process used to drive the Spinnaker OSS versioned releases, which are distinct sets of service versions validated through integration tests (citest), and installed via Halyard. Steven will talk about motivations for the release process, and Jake will go into the technical details of the release process design and how it solved problems existing previously.

Presentation and/or video:

Collaborative notes from audience:

Old Release Process:

ad hoc updates due to ad hoc releases often pulling unwanted changes
no way to identify and track microservice versions from one update to the next
bug reports required reporting several microservice versions
no broadcast notifications of new features or fixes — changes communicated informally to bug reporters
rollbacks required lots of manual intervention
no insurance that deployed microservice versions work together
no two deployments looked the same
no easy rollback (have to go back and read commits to figure out what to roll back, and how far)

What we want in a release process:

way to identify/version components that work together
validation prior to artifact publication
release documentation
encapsulated fixes
easy rollback/upgrade
easy configuration and management

New release process:

Build, publish, artifacts & accounting/metadata — New OSS release process
configuration, deploy, rollback with Halyard

Bill of materials (BOM):

contains metadata about the microservices included in a release
version
artifact location
commit
creation date
assigned an overarching “top-lvel” version == Spinnaker version
provides all context necessary for Halyard to configure and manage a Spinnaker deployment
can reconstruct source from a BOM

Release branch:

upstream git branch created on each major/minor release for each microservice
can cherry-pick or backport fixes from master into release branches to address issues
patch fixes added to release branches are released as patch releases
isolates fixes to releases from rapid development on master
patch releases contain fixes only

Changelog (Snippet)

Requires special format of commit messages

Questions:

What’s the fail rate? How many point releases don’t make it thru integration tests.

We are still ad hoc on making releases. It’s hard to say. Out of the 7 nightly builds last week, 3 passed.

What do the integration tests look like?

We’re deploying to GCP, EC2, Azure, and Kubernetes. All using Halyard. All of the integration tests are open source in the repo.

Can you share thoughts around the types of testing you do before moving to production?

We have several types of integration tests. Generic smoke tests — create a load balancer, create a server group, test pipelines.

How satisfied are you with the current state of integration tests?

We don’t test Deck. It’s an opportunity for other folks in the community. There are extensive unit tests and mocks. But those tests won’t guarantee they will work on every provider.

Kayenta: Automated Canary Analysis from Google and Netflix

MICHAEL GRAFF, NETFLIX; MATT DUFTLER, GOOGLE

Abstract: Netflix and Google teams have been working together to build Spinnaker’s new general purpose canary capability. We will shortly be releasing a new Spinnaker microservice, Kayenta, that will be capable of integrating with a host of metric stores (e.g. Atlas, Stackdriver, Prometheus, OpsMx) and allow for pluggable canary judges. Netflix’s internal Automated Canary Analysis (ACA) service served as the inspiration for this new service, and the ACA canary judge is configured by default. In this talk we will cover the overall architecture of Kayenta, and explore some use cases end-to-end.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

You’ll need to setup a canary config that’s owned by Kayenta
There’s swagger and entry points to Kayenta

How does it work?

fetch the baseline metrics
transform to common format (GCS, s3, in memory, …)
its not in front50, the goal was to move fast
Merge the metrics
perform the analysis
store the analytics

First class spinnaker integration:

new caanary stage defined in orca
the goal is to get both Google and Netflix to use it internally, then release to public
Full UI + API support
(WIP) drilling down into the results

Demo:

there’s a new tab in the app view “Canary”
there’s a UI and there’s a JSON editor for the canary config
You can do multiple canary runs.
Finding the base group and canary group requires expressions, but there should be helper functions soon

Is there a separate process to do cleanup?

There’s a seperation of ACA and canary stages
ACA — relies on existing resources i.e. deploy first, then look
Canary stage — create the baseline and canary groups
with the retro stage, you can run it on any metrics found from the store, potentially for fine turning
(WIP) different type of metrics, error metrics, …

Questions?

When is it ready?

hopefully soon, after its been tested internally at netflix
There’s going to be more advance features coming soon

What does Kayenta mean?

there’s a coal mine in AZ, and some other towns. But there’s nothing in UrbanDictionary
This will be used 100% in Netflix

How do you plan to migrate Netflix canary to Kayenta?

There’s some challenges of the way we generate the metrics, but we plan on having them ran simultaneously. They’re going to A/B the canary at netflix.

How do we make it deploy and a basline and canary group?

Kayenta will be able to synthesize those stages and send it to orca to do the work.

Spinnaker at Under Armour

KEVIN CHUNG, STEPHEN SCHMIDT, UNDER ARMOUR

Abstract: One of the key benefits of Spinnaker is how it fit in with allowing us to automate the creation of development environments that mirror our staging/production environments for our teams (we combine terraform, kops and Spinnaker to have a self-service approach for our dev teams where they can spin up a new k8s cluster and see our stack deployed out in about an hour as well as reset/refresh an environment automatically by way of Spinnaker).

Presentation and/or video:

Collaborative notes from audience:

No audience notes were captured from this talk

Extending Spinnaker

CAMERON FIEBER, NETFLIX

Abstract: This talk addresses how to manage the packaging and deployment of custom plugins and what extension points exist in the various services within Spinnaker.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

Fleet Management with Dominator

RICHARD GOOCH, SYMANTEC

Abstract: This talk will describe the design of a robust, reliable and efficient architecture which can scale to the very largest fleet of machines (physical or virtual). The design target is that a single system and administrator can manage the content of at least 10,000 systems with negligible performance impact on the managed systems, fast response to global changes and nearly atomic changes on those systems. The software system that implements this architecture is called the Dominator.

Presentation and/or video:

Code is here: https://github.com/Symantec/Dominator

Collaborative notes from audience:

Problem Statement:

Managing a large fleet of stateful services that are not deployed frequently

Guiding Principles:

Immutable Infrastructure, no logging in to configure/fix
Golden images
Fast, robust transitions, keep your risk window (the time while a change is being deployed) small and rare
Puppet could take up to 20 minutes to apply changes at times, making the risk window large

Components:

Subject (sub) — Server that is a member of the fleet
Image — Image containing kernel, OS, cloud foundation stack, and configuration, application(s)
Filesystem Tree Representation
Filters file (regex of files to not touch e.g. fstab, /tmp)
Triggers (JSON file, contains filepaths and service(s) to restart when those filepaths change)
Image Server — Stores images
Machine Database (MDB) — Source of truth for which machines should be using which images
Dominator — Central server that enforces all subs

Architecture:

Dominator checks MDB for desired state, checks actual state from subs, forcing convergence by applying changes from Image Server
Same system is used to ‘birth’ new machines
File Generators are used by the MDB

Performance:

Limit network usage to 10% capacity
Limit disk throughput to 2%
1000/machines per second
Bottleneck tends to be Image Server
Scanning a system to determine if it needs changes takes longer, ~17 minutes on HDD, ~2 minutes on SSD

Safety Features:

Intrusion Detection
Because you know the desired current filesystem state, and all previous desired states, and you’re constantly scanning for changes, if you see a state that was never desired you can use that as a lightweight tripwire.

Computed Files:

Some files do need to be different between machines:
Netowrk configuration files
Hypervisor configuration files
Machine certificates
e.g. /etc/hosts, /etc/passwd, /ect/resolv.conf, /etc/ssl/CA.pem
These cases can be solved with computed files:
You can write a simple server to generate the files on demand
Data can be sourced from a local file (or files, such as a directory with one per machine)
Template files can be used to generate data from MDB information
Arbitrary algorithmic generation of data (i.e. SSL certs)
There is an easy-to-use Go library that takes care of the ugly details, you just provide a simple generator
GoLang templates

Imageserver Content Distribution Network:

Uploads to local region replica (or master if it’s local)
Master does hash collision detection
Replicates (diffs only) to local and remote replicas / master

Upcoming features:

High performance content builder
Spinnaker “deploy” stage
Differences from stripped down Puppet/Chef
Safety
Faster transitions
Won’t run out of disk
Very low chance of half completed updates
Aggressively ‘corrects’ any changes made by someone logging into the server

Organizational Usage at Symantec

One team is responsible for managing base images and pipelines to build off of those

Spinnaker Security Deep Dive

TRAVIS TOMSU, GOOGLE

Abstract: Security is a cross-cutting concern across the entire Spinnaker application architecture. Using the Spring Security framework, this talk details how a user can plug-and-play various authentication and authorizations mechanisms, and how a developer can easily hook new features into the security model.

Presentation and/or video:

Collaborative notes from audience:

FIAT is the security server for Spinnaker
Spinnaker’s security model is unrestricted by default
Whitelist if any restrictions are specified — specified in clouddriver config file
User must have at least 1 role specified in the restriction
FIAT re-resolves users every 10 minutes
Supported platforms
… Google Groups
… Github teams
… LDAP
… File-based implementation
FIAT is backed by Redis
Service Accounts (Robot Users)
… Used for automated pipeline templates
FIAT client library is baked into Gate, Orca, Clouddriver, and Front50
… These microservices all run into a FIAT intercept before the HTTP call gets to the business logic

Questions:

What are the future plans for Auth Z?

Travis: I wanted to implement the minimum amount to secure an application. Waiting for feedback from the community to drive future roadmap. Ie. per pipeline, per stage. I don’t have any future plans to go beyond application-level permissions until I hear more from the community.

What are the restrictions on account ACL?

With read permission you can see that they all exist
With write permission, you can mutate them

Any plans to add admin functionality?

Added placeholder for admin for now. Haven’t implemented.

Metrics and Monitoring

ERIC WISEBLATT, GOOGLE

Abstract: This talk provides an overview of how Spinnaker deployments themselves are instrumented and monitored. It will discuss a developer perspective for adding additional instrumentation into the Spinnaker codebase, adding support for a custom monitoring system, and consuming the metrics for operational monitoring.

Presentation and/or video:

Collaborative notes from audience:

Spinnaker monitoring overview:

Each micro service has a “Spectator”, which can be found at https://github.com/Netflix/spectator

Spectator Vocab:

meter — a device that collects measurements. ex: gauge, counter, timer, distribution summary. Meters have unique Ids
Id: A named collection of tag = value bindings
Measurement: A meter value at a given time.
Registry: The collection of known Meters. The main abstract interface, acts as a factory.

Instrumenting: Spinnaker Strategy:

Spinnaker services contain an HTTP endpoint for polling
Allows users to use off the shelf builds with their own monitoring system
Runtime uses default “in memory” metrics
Spinnaker-monitoring daemon mediates services and external monitoring systems
Can be extended for new systems
Existing support for Datadog, Prometheus, Stackdriver

Daemon Vocabulary:

Metric: Named list of measurements of a “kind”
Counter, Gauge, Timer
Measurement: Tagged, timestamped values
In practice, these form a time-series for each set of tag bindings within the backing monitoring storage system.

Special case handling:

Spectator timers in Spinnaker have two measurements
statistic=”count” denotes number of timings
statistic=”totalTime” denotes nanos elapsed
Spinnaker-monitring daemon transforms timers
Removes “statistic” tag entirely
Creates two new counter (post-fixed with double ‘_’)
*__count
*__totalTime

Integration: Daemon:

Python runtime can be invoked different ways
As monitior (with or without webserver)
As webserver (default port 8008)
As CLI
Typical use case is to run as monitoring daemon
Embedded webserver provides some tools

Integration: Daemon Configuration

A registry directory contains one or more yml files
/opt/spinnaker-monitoring/registry
clouddriver.yml
orca.yml
…
Each yml file contains the endpoint to poll
metrics_url: http://localhost:7002/spectator/metrics

Operating Spinnaker at Netflix

Batteries Not Included
Need to find solutions for:
Logging (ELK, Splunk, SumoLogic)
Dashboarding (Prometheus)
Alerting (Prometheus)
At Netflix, we use ELK + Atlas

We have dashboards, but we don’t look at them. We want alerts to tell us when something’s happened.

Quick Tip:

Make sure your dashboards can be programmatically generated.
Example, have dashboard automagically pick up all the clouddriver nodes you’ve launched
Have logs inject searchable keys like PID, et al.
Same with alerts!
(link)Our Logstash Config

Rate Limiting:

see https://www.spinnaker.io/guides/runbooks/api-rate-limiting/
Orca “queue:trafficShaping:*”
Gate “rateLimit:*”

Useful Metrics Cheat Sheet at Netflix that we look at:

All Services

controller.invocations (by controller, method and status code)
hystrix.rollingCountFallbackSuccess

Gate

rateLimit.throttling

Orca

task.invocations (by execution type and application)
task.invocations.duratio (by bucket: 5m, 15m, 30m)
queue.pushed.messages
queue.acknowledged.messages

Clouddriver

executionCount (by instance)
cache.drift (by agent and regio)
google.api (by resource and faiure)
google.safeRetry (# of retries)

Front50

google.storage.invocation

Rosco

bakesRequested
bakesCompleted

Managed Pipeline Templates

KATRIEL TRAUM, WAZE

Abstract: In this talk we’ll discuss how the brand new pipeline templates were put to use in production at Waze. We’ll discuss the benefits of pipeline templates and patterns we found useful, both in code reuse for pipelines, and using pipeline templates to implement multi-cloud/provider infrastructure as code

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

About Waze:

80 mil monthly active user (users who drive 5 days a week)
100+ micro services
500+ pipelines
Multiple cloud providers: GCP and AWS (Waze was acquired by Google 4 years ago)

Problem Statement of Spinnaker Pipelines

Hundreds of Spinnaker clusters and (unique) Pipelines, (lots of unique but similar pipelines)
No Code reuse
Hard to maintain
Hard to automate

Managed Pipeline Templates:

source controlled in git
Code review and auditing of ‘Infrastructure as Code’
Provides a Paved Road for developers, significantly eases adoption

Architecture:

Uses Jinja
Pipeline Template + Configuration X = Runnable pipeline
Pipeline template contains stage(s)
These stages use Jinja templating to interpolate variables into the stage options.
Pipeline Configuration
Points to a specific Pipeline Template
Defines variables used by the Pipeline Template’s Stage(s)

Publishing templates using Thin Spinnaker CLI

https://github.com/spinnaker/roer
roer -v pipeline-teplate publish template.yml

Video Demo:

3 pipelines using the same template
Template is updated
Template change is published, all 3 pipelines are updated with new stage
you can “inject” stages before another stage. Spinnaker will manage the order.
Templating the ‘when’ conditional for a ‘destroy server group’ stage using a pipeline template variable
Configure Template UI in Deck shows the new template variable, updating that value adds/removes the conditional stage from the Runnable Pipeline

Design Patterns:

Versioning
Have a master v1 template
New changes will be on v2, tested
After it works, roll it out to older templates
Shim Template
Utilize the template inheritance to swap the entire fleet from one version to the next for rolling out a template change
Per-Service config
Template as a starting point
Inject, Replace or Delete stages
Override defaults where necessary

New service onboarding:

copy/paste configuration
create the pipeline template
create the runnable

Automate Pushing of Templates:

user → git → jenkins (https://github.com/spinnaker/roer#pipeline-template Spinnaker Thin CLI) → in production spinnaker

Used to update OS across all apps

Supported with Orca execution engine v2 (alpha for v3)

Questions:

Do you store the template with the microservice or alone?

Pipeline templates live in a separate repository

Can you talk about how to roll out a the entire fleet?

Do application dev run the pipeline?

they run the pipeline, but they don’t manage the pipeline
Links to blog posts and tools:
Sample Templates: https://github.com/spinnaker/roer/tree/master/examples
Pipeline templates have 100% coverage with stages in spinnaker

Where is the converter?

look at roer for the converter
Orca does the validation
running dry run on roer will validate it

Why YAML?

https://github.com/spinnaker/dcd-spec/blob/master/PIPELINE_TEMPLATES.md#q-why-yaml
Ping Rob (@rz on spinnakerteam Slack) if there’s issues on pipeline templates

Canary Analysis at Netflix

CHRIS SANDEN, GREG BURRELL, NETFLIX

Abstract: In this talk we’ll discuss how Netflix approaches canary releases. Towards this end, we’ll discuss how we have been able to automate the process. In addition, we will talk about how our canary release process has evolved and discuss some of the lessons learned along the way.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

We do 1,200 judgements every day
What is canary analysis? deployment strategy, a decision is made at each checkpoint, not a replacement for testing, complimentary value and asses risk of environment
We don’t have a simulation environment, we use canary analysis to fill the gap
Observability by measuring the behavior of this system and need to be able to separate out your metrics
Humans are prone to confirmation bias, and decisions can differ from one human to the next
repeatable and reproducible

Assessing risk:

gather metrics, data validation, data cleaning, metric comparison, compute score
nonparametric statistics, mann whitney U test, for each metric is it similar or dissimilar

Spinnaker:

in canary report we list metrics and differences between them
pipelines that have failed

Developer canary pipelines are also available

Fast Property Canary

Canary is used also for CDN and ChAP

Best Practices:

pipeline configuration — scale up your canaries after N minutes
too large of a canary cluster, 6 instances vs canary of 3 instances and 25% of traffic, better solution would be to put it behind an ELB or proxy to steer smaller bits of traffic
There might not be enough traffic to make a determination and do it during the day to similar traffic during the day
use execution windows and cron triggers
warming up canaries and instances, busy filling up caches and GC

Metrics selection: Garbage In, Garbage Out:

shy away from sparse metrics and are noisey metrics and can you give false positives
metrics that are always different than canary and baseline
metric transformations, such as cpu/rps
metric weighting, error (50%), system (25%), latency (25%)
it’s important to explain the scoring

validate your configuration A/A testing

Advancing Spinnaker at Lookout

BRANDON LEACH, LOOKOUT

Abstract: Lookout recently replaced multiple home grown software delivery tools with Spinnaker. Now that we have a standard software delivery platform, we are extending Spinnaker to help reduce years of accumulated tech debt and optimize infrastructure costs. This talk will cover the work we have done so far with service development standardization (one-click) and containerization (AWS ECS).

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

Tried to do CD many times before (six examples of failed attempts)

What are we doing differently?

Treat software delivery as an internal product
How: Internal customer requirements. Definte MVP and success metrics. POC, Beta, GA. Internal customer champions.
Use off-the-shelf supported tooling
How: Do not reinvent the wheel. Leverage industry standard.
Iterative approach

Migrated to Spinnaker with help of Armory

Early Metrics
Steps to deploy to production: From 25 to 1–3.
Engineer time to deploy: From 60min to <1min
Automation time to deploy: From 60min to 31min
Engineer time to patch: From 5days to ZERO ()
Onboarding time: From 3+ days to 30min.

Migrated 33%, targeting 100% by end of Q1.

Tech Debt

mostly in the form of legacy, monolithic services
Replacing with microservices

Microservice Snowflakes

Need to make sure microservices are standards-based and avoid unique instances

We’re building “1-click project”

Devs can create a new project in one click.
repos in github
jenkins jobs
artifact repo
Slayer was built by Lookout to automate and publish SLAs
foremast creates a default pipeline
<10 minutes to deploy a “Hello world” app with full SLA support.

Infrastructure costs

switch some services to use containers instead of VMs
we evaluated K8s and ECS
Spinnaker has stable K8s support but we’d have to manage our own K8s on ECS
ECS is what we’re already on, but no driver for Spinnaker
we decided to build a spinnaker driver for ECS instead of create a K8s team. we plan to release it in Q4 2017

Questions:

How long have you been working on ECS integration

3 months ago we did a tech spec
haven’t started writing code until 3 weeks ago
2 engineers working on it full-time

What were your biggest concerns with going with ECS?

Main concern was around Amazon joining CNCF. What’s going to happen with ECS? We had convos with Amazon about that and they eased our fears on that.

What do you think was the biggest driver to get teams off of their old tooling on to Spinnaker and what’s the biggest challenge with getting new teams on Spinnaker?

One of our challenges was to write a self-service onboarding guide.

Are you planning to open source “1-click”?

Yes, soon!

Spinnaker at Capital One

SRI CHADALAVADA, CAPITAL ONE

Abstract: Capital One believes that winners in banking will have the capabilities of a world-class software company. Since our business is about building awesome digital experiences, we must be great at a) acquiring/retaining talent, b) building software / technology at pace and with quality, and lastly c) baking the right engineering practices to be the leaders in digital banking. Driving excellence in what we do required us to substantiate 3 core tenets, a) #SlayTheMonolith — MicroServices, b) #NoFearReleases — Automate Everything, and c) #YouBuildItYouOwnIt — Agile + DevOps. The focus of today’s conversation is to show you what we’ve done as an organization to drive #NoFearReleases, i.e., mature our delivery process via CI/CD pipelines, automated testing and release validation to improve speed, quality, predictability, and auditability of our software delivery.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

Top 10 in banking, checking, credit cards
The leading digital bank
“CapitalOne has realized that the world around us has been changing at a rapid pace.”
“We have established a strategy to power an ecosystem of innovative products via investments in great talent, technology and tools.”
Key transformations:
Full stack agile engineering teams: “YBYO” — you build, you own… applicaion developers used to just pump out features and toss them over the wall to ops teams.
Monoliths to microservices
Now look at Open source first
Move to public cloud (AWS)… used to have huge data centers; from 9 to 5; shrinking more in next few years
Matured DevOps practices
#NoFearReleases via Certified Continuous Delivery pipelines
“We’ve invested in a pipeline manifesto to drive uniformity and standardization across the organizational deployments.”
Accelerate ability to deliver tested, valuable changes to customers easily
Facilitate quality, compliance and well-engineered to deliver long-term value
Drive innovation and creativity
Power of community over individuals
Early stage in delivery pipeline as code

CapitalOne previously had 20 pipelines with names… and 700 w/o names

pipelines had varying level of maturity

Before choosing Spinnaker, looked at / used multiple options.

Started cloud journey 2 years back.

Cloud OneView (COVE)

Internal application powered by Spinnaker
customized to meet CapOne needs
single pane view into all environments an application passes through in its delivery lifecycle
deployment pipelines that run tests, spin up & down infrastructure, monitor rollouts
Took Spinnaker services and created custom modules on top
Peer Review, enforced code & pipelines
Role based auth

CapOne is on container journey; deploying to ECS

Spinnaker at Scopely

AVRAM LYON, SCOPELY

Abstract: Scopely is using Spinnaker to help diverse game studios implement clean and reliable deployment patterns. In the past 9 months, we’ve moved three game backends and dozens of our platform services to deploy with Spinnaker in production. The move has helped us simplify the stage-QA-canary-scale-switch deployments of game teams and radically reduce human error and complexity. We will share how we built pipelines to canonicalize our standard practices for teams, as well as the deploy strategies for multi-tier game servers and for streaming data applications, and the pitfalls and benefits we’ve encountered.

Presentation and/or video:

Spinnaker at Scopely - Spinnaker Summit 2017

Spinnaker at Scopely Avram Lyon Spinnaker Summit September 12, 2017

docs.google.com

Collaborative notes from audience:

No audience notes were captured from this talk

Panel Discussions

Spinnaker’s Kubernetes Integration

Abstract: Panelists include engineers from Google, Under Armour, Cisco, Target, and Schibsted. Moderated by Isaac from Armory.

Presentation and/or video:

No presentation or video (yet) from this talk

Collaborative notes from audience:

How is your company using Spinnaker and K8s?

Julian(?) at Cisco: We’re moving from hardware solutions to software solutions. We’re skipping VMs, going straight to Kubernetes. We go to a customer site and we want a standard way to deploy to their infrastructure. That’s how we want to use Spinnaker. We’ve already decided to use K8s but we’re not there in production yet. We’re doing the things we have to do to prepare for it. Our customers have their solutions mainly on-prem. We deliver an appliance with Spinnaker, Ansible, etc. We also put a layer of UI on top of Spinnaker. We don’t want to expose Spinnaker to our customers — just a button to deploy.
Edwin at Target: We use K8s at a very large scale — 430 namespaces per cluster. 8 large clusters.
We do not use Spinnaker for that.
Too many issues with current Clouddriver implementation of K8s.
Devs got kubectl access to their clusters and they are used to it. No value add yet, they want to control all aspects of their YAML deployments. We’re eagerly awaiting Kubernetes v2.
50% of Target.com is now deployed on K8s. “We’re finding it a joy to use.” Not dealing w/ cloud-provider specific problems
Gard at Schibsted: We started with K8s about a year ago but it’s only over this summer that we started using it in prod. We are using the native Spinnaker / K8s deployments and we’re using replica sets. We have training wheels around K8s integration. We have a webhook stage that maps config maps to K8s. We have a smooth solution for K8s. We’ve onboarded 3 teams now in production.
Kevin at Under Armour: in prod for 1 1/2 yrs for Spinnaker. We onboarded onto Spinnaker with Kenzan. We’ve been working in prod for a year with K8s.
Have e-commerce website but also many other properties managed w/ K8s clusters… mobile shopping app…
It’s been about ease of use to roll things out w/ Spinnaker
When we made move to microservices it was too much to run on a dev’s workstation. Spinnaker spins up environments for our dev teams on K8s.
Each team is provisioned a K8s cluster; spun up by Spinnaker
Kevin was originally the only DevOps engineer at UA; now 4 people but “we couldn’t do things w/o automated tooling.”
Steven at Google: I’m on the panel as a provider, not end user.
Comment about K8s space: You always hear people say “our use of K8s is unique”
Based on how quickly k8s has grown; lots of tooling around k8s
Spinnaker’s approach stands on its own — not necessarily in a good way. Looking to have an increasingly clear view on approach
Lots of time passed before Spinnaker started working w/ k8s; now there’s a lot of ingrained opinions; hard to walk them all back.
#1 request we get is better support for manifest-based deploys.
For example: w/ Drone, a k8s customer updates manifests manually; this is not Spinnaker’s approach; Spinnaker has automated approach
Cisco: We have a tool that converts k8s manifests to Spinnaker pipelines… Support for stateful and replica sets are very important.
Helping Lars upstream this; hope to have it out soon
Lars at Google: The manifests do add value. Natively supporting the manifests means that new features are supported out of the box.
Looking for ways of finding support of manifests while getting benefits of automation
Isaac from Armory: How does Spinnaker play into moving from legacy infrastructure to Kubernetes? What’s the best way to convince your company to go with Spinnaker even though there might be some holes in the Kubernetes integration.
Gard: The most powerful part of spinnaker is the orchestration engine. You need to orchestrate your deployments and track artifacts to production. Start with pipeline orchestration then you’ll be in a good position when v2 provider is released.
Isaac: Why not just use Jenkins pipelines?
Gard: Spinnaker orchestration is totally different than other CI tools.
CI tools like GoCD or Bamboo don’t have this kind of orchestration support
Edwin: For organizations with complex or large number of deployment patterns:
You get the benefits of Spinnaker’s opinions around deployments; no more wild west from various Chef etc based methods
If you start your K8s journey on Spinnaker it’s great b/c Spinnaker gives you the opinions.
Cisco: Can you do canary ,etc w/ Ansible & other scripts? Sure you can but then you have to maintain that; lots of bugs to deal with
Isaac: Presented w/ 2 ways of deploying — deployment object in k8s vs. deployment strategy in Spinnaker. Which one is best to use when?
Edwin: That’s one of the reasons we’re not using Spinnaker for K8s deployments.
Under Armour: When we first engaged with Kenzan, we started with a more Spinnaker approach. Now we have a utility that takes a manifest file and translate into a Spinnaker pipeline. It runs on a cron. Our devs are very comfortable writing a Kubernetes manifest file.
Gard: Deployment objects removes the responsibility from Spinnaker to Kubernetes itself. If you’re using Deployment Objects, everything gets strange.
Steven: I also think that the deployment … We have a good convo with the K8s team. The deployment object is the K8s’ team’s opinion. We don’t necessarily agree with that. They’ve asked us for a long time to support the deployment object.
Lars: the kubectl deployment object is great for text-based edits. If you make edits in a text editor and want to roll out to your cluster, kube ctl apply is great for that.
But when you have the luxury of doing more advanced deployment strategies, it’s not great for that; this is where Spinnaker is valuable.
Isaac: How does a container go from developer’s machine to production?
Kevin at Under Armour: Developer will re-tag container . We have different environments. We tag each container and run it thru automated tests in each environment, then re-tag it.
Gard: All of our deployable artifacts need to be unique versions. Each container has a unique tag and we track it thru the pipeline — to enable fast rollbacks. If you have a previous version of a RS, you can scale up a rollback that was previously running in production.
Edwin: Drone builds every container on a PR dry run. On push to master we build a container with tag of git commit hash. We do a GCR re-tag on the image. On a tagged release we’ll take the tag, place on container, re-write K8s manifest, promote release.
Lars: Spinnaker encourages that you generate a new tag on deploy to ensure immutability. A better way might be to treat tags as release streams. When you deploy, it’s very easy to take the digest with a hash. You have safe rollbacks.
Cisco: I like the find image feature in Spinnaker so you don’t have to go back to Artifactory.
Isaac: What’s the best way for the community to get involved in Kubernetes?
Steven: Lars started a Kubernetes sig — in the form of a Slack channel and google group. You can get access to design docs and proposals. That’s a good start and the best way to ask your questions. We want everyone’s voice and opinions to be heard.
Tomas: Can you talk more about your docker registry strategies and how you deal with multi-region?
Lars: with revamped artifact support, say you have kubernetes clusters distributed across the world and registries in various regions and you want to make sure you don’t send an image across regions if you don’t have to…for every registry, an image with same tag, same digest, this replica set is the one that I use for this region. That way you ensure the image you pull is geographically located near the registry.
Why don’t we setup a proper forum? Slack doesn’t save old messages and Google doesn’t index.
How do you share the allocation for CPU, memory for images running on QA, Stage, prod?
Lars: In v2, it’ll be up to you to define these in the manifest file.
Andy: Does the cluster view UI still make sense in a container world?
Steven: yes, it maps pretty well.
Edwin: I think the representation is great. Green/red is great for pods. It makes sense.
Lars: One small improvement would be to break down pods into containers.
Gard: With AMIs you have meta data about the bake.
Andrew: What has been your experience deploying K8s clusters? Specifically credentials to the clusters that you want to deploy to.
Lars: I don’t deploy K8s clusters with Spinnaker
Edwin: Our k8s clusters are deployed by Spinnaker but we don’t deploy things into them yet. I think the base keys are in Vault. Calls out to Vault to get keys.
Under Armour: Halyard has been very helpful in this regard. We use KOPS + Halyard to plugin as KOPS generates credentials. We can plug them in dynamically.

Operating Spinnaker

Abstract: Panelists include engineers from Netflix, Capital One, Under Armour, Armory, Target, and Schibsted. Moderated by Andy from Netflix and Steven from Google.

Presentation and/or video:

Here’s the codebase that we use (mocked interaction with Metatron) for doing in-memory secret decryption in Spinnaker services. It should be fairly straight forward to follow along, it supports searching for secrets in both the classpath as well as filesystem: https://github.com/robzienert/spinnakerext-encryptedconfig

Collaborative notes from audience:

Panelists:
Rob Fletcher: Delivery engineering team. 3 yrs at Netflix. Wrote Orca.
Sri: Technology Director at Capital One. We started using Spinnaker 2 yrs ago.
Andrew: We help other companies operate Spinnaker.
Rob Z: Delivery Eng team at Netflix. Working on Spinnaker for a year. Passionate about automation.
Doug Campbell: Eng at Gogo. Been using Spinnaker since early 2016.
Kevin Chung: Under Armour
Gard: Schibsted
Edwin (Target): We’ve used 4 different cloud providers with Spinnaker
Andy: What’s the biggest learning in operating Spinnaker?
Edwin: don’t lose pipeline history.
Rob Z: Have monitoring setup for the metrics that Spinnaker emits and log aggregation for your services.
Sri: Resiliency. Esp when you rehydrate an AMI. We’re upgrading to Orca 3 now.
Andrew: the biggest problem we see is AWS rate limiting. You can have Spinnaker limit the calls it emits, but the side effect is that you may have delays in your pipelines.
Doug: understanding what each component is doing. The services were black boxes for a while.
Kevin: Spinnaker was greenfield. we were moving towards microservices, instead of migrating an existing app to Spinnaker. Working with dev teams to understand overall software development cycle.
How do you secure Spinnaker?
Gard: We use Okta as login service. Spinnaker supports SAML in combination with VPN. We use FIAT to control access to various applications.
Edwin: We currently employ SAML for authn/z. We use Google Groups for another spinnaker instance.
“We trusted engineers for a long time”
Doug: We lock down front-end with SAML and x.509 for all API access.
We lock down all pipeline edits thru the via. All edits have to go thru our tooling — foremast.
Rob Z: We use SAML for authentication and lock down apps based on Google Groups. We have x509 certs.
Andrew: We use Github OAuth. Other users we see have a wide range of security measures. Some don’t need anything except Oauth. That means all users can see all other apps. We also see companies that have multiple spinnakers.
What’s the cost to run Spinnaker?
Doug: We have a team of 8. 2–3 of us are 50% of time dedicated to Spinnaker. Getting it setup was hard and took the whole team, but once it’s setup it’s not that much maintenance.
Kevin: We run Spinnaker within Kubernetes. It’s just another app that we’re deploying out.
Edwin: we have 1000 application team users on Spinnaker.
120 apps w/ a number of components
“Our first Spinnaker installation didn’t get touched for 8 months and it kept going and going”
6 engineers dedicated to upstream OSS bugfixes on OpenStack
Spent 6–8 months learning how to configure Spinnaker and do production HA deployments
Rob Fletcher: We have 12 people. One person is always on call. We monitor Slack, can get paged. Most of the time it’s not something we dread — being on call. The big difference between Netflix and other companies is that we’re not pinned to OSS, we’re on the bleeding edge. We deploy Spinnaker about 10 times per day.
Kevin: team of 4 maintaining Spinnaker. Our pipelines are pretty simple. We work with platform engineering team to understand what types of apps are being deployed. We translate K8s manifest files into Spinnaker pipelines.
Doug: We spend more time educating engineers on how to use Spinnaker than operating it. Common questions we get are: “Spinnaker is slow” or “Spinnaker is broken” (even tho their app is usually broken).
Steven: Can you talk about the scale of your Spinnaker setup.
Rob Z: We have four installations of Spinnaker. Single server for PCI compliant account, some staging ones, and production account. Prod instance is about 75 servers. c3 or c4.xl.
Doug: We have one Spinnaker cluster. Each microservice has 2–3 instances. Most of those are t2.mediums or c4.larges. We run read-only instances behind a load balancer and multiple clouddrivers for multiple accounts. Ideally we have a clouddriver for each account.
Kevin: We run in K8s. t2.mediums.
Gard: We run with separate redis for all services. FIAT has one write cluster and one read cluster. clouddriver has one cluster that does indexing — 6 large instances. 3 instances for read/write cluster.
Edwin: “you should all be running Spinnaker on Kubernetes, seriously.”
We give clouddriver 3 gigs of RAM. 2 replica sets. clouddriver is the only thing I’ve seen with performance issues.
Andrew: 3 m4.xls with 300 pipelines running per day sits at 15% utilization; more than enough to keep you happy.
Sri: 3 node cluster in prod, 3 AZs. We use Elasticache for redis.
How did you install Spinnaker and how do you patch and upgrade it?
Edwin: configuring spinnaker is the hardest part. we override defaults provided by upstream. we did k8s deployments with manifest files. we were using the upstream disk commands to create the containers. then we uploaded and reuse in k8s cluster. halyard made our lives easier. Our GCP spinnaker runs pretty close to head.
Doug: we still with whatever’s in the latest version.
Sri: Using own Ansible-based scripting, cadence is to upgrade about quarterly
Kevin: Kenzan helped us onboard initially. Now we use Halyard.
Gard: Using mix of local forks and upstream packages, looking forward to using Halyard.
Andrew: We use TF to install Spinnaker and we use Spinnaker to deploy Spinnaker.
Steven: We felt like Spinnaker 1.0 included most of what everyone needed. What else is missing?
Heph: some documentation is missing after 1.0 launch.
Edwin: Would love a breakout to talk about documenting how to use Halyard.
AWS support for Halyard
Doug: I want to encourage everyone to open source your internal tooling around Spinnaker.
On Spinnaker Slack: #dev for dev stuff on slack, but #general for kicking around ideas.

Extending Spinnaker

Abstract: Panelists include engineers from Google, Armory, Netflix, and Oracle. Moderated by Dave Stenglein from Kenzan.

Presentation and/or video:

Collaborative notes from audience:

Target has run into problems with authentication, trying to parse out roles from X.509 certificates. It is extendable, but would really like it to take less code to do it. Also run into problems with how Spinnaker (on packer in rosco) parses .deb and .rpm names.
Would be nice to extend to allow package managers to do a “push” to Spinnaker when packages have been updated, rather than polling.
Could cloud integrations be split out into separate modules for each cloud provider, avoid cross-package muddying (also need to split out clouddriver soon, but don’t know what final solution is yet)
Re: Azure: Core team not likely to put cycles into Azure, because they’re not using Azure
Suggest pulling Azure stubs out to clean the code, but argues for the component-ing of Clouddriver so people can add/maintain providers without interfering with core code.

Automated Canary Analysis

Abstract: Panelists include engineers from OpsMX, Armory, Netflix, and Google. Moderated by Vinay Shah from Netflix.

Presentation and/or video:

Collaborative notes from audience:

Why is it not as common as simple monitoring?

Isaac: lots of folks still struggle with observability.
must also create additional orchestration to actually do it.
Chris: willingness to take that risk to do the canary. Canary does introduce some risk. As a company, you have to be willing to take that little bit of risk.
Gopal: easy to onboard, trustable.

Questions:

If you have a complete suite of unit tests and integration tests, what value does canary testing add? What do expect to find?

Canarying doesn’t replace testing. There’s no substitute for production traffic. Canaries provide safer rollouts overall.
Make it consumable and easy to use.
Isaac: In order to build an integration test to reduce risk is very expensive. Canarying is an additional tool.

How long should a canary live? Are there hard requirements, how do you draw the line on that?

Chris: It depends. Look at the service and ask “when do I have enough traffic to sufficiently reduct risk?” We recommend a minimum of 30 minutes.
Some critical patches were canary’d for 96 hours.
Gopal: We’ve found about 2 hours gives us confidence in results, but 30 miunutes minimum.

Given the longer times, can you talk about how this falls into a model where every git commit triggers a new deployment? We’re getting pressure from above to get code in front of customers as soon as possible.

Jarrod: we have varying canary times. Our minimum canary time is 5 minutes for a really small change. You can also configure days or hours. It depends on how risky the change is.
Gopal: to speed it up, you can increase your sampling rate.
Chris: there’s a challenge. canary analysis should be used to assess risk. gotta weigh pros and cons of how long you want to run the canary for vs how fast you want to deploy. If you have endpoints that are only used a few times a week, running a very short canary not likely to be representative of the impact.
Isaac: canarying is just one tool to reduce risk. You could consider shorter canaries combined with a more gradual roll-out plan that allows you to roll back more quickly.

How automated can a canary analysis system be?

Jarrod: Google’s system has a set of defaults. Users can add more and customize. Getting users to trust the results is a continuous battle.
Isaac: for the most part, you can trust it. back to observability, if you can catch it, you will. Your application changes over time.
Gopal: there’s good, bad, and unsure. create a range for the “unsure” to start building trust.

What is your strategy when it fails?

Gopal: when a canary fails, we give the human the ability to do a manual judgement.
Jarrod: It’s the caller’s responsibility to figure out what to do with the result. It’s either pass, fail, or unsure.
Chris: Users can go into a manual judgement stage if its unsure. In failure scenario, it will fail the canary and clean up.
Matt: It’s all configurable, and you should set it up to work the way your team needs it to work, but take advantage of the mindshare to get ideas and don’t reinvent the wheel.

Any thoughts on doing a canary on a load test?

Chris: We have a framework called Citrus that squeezes our load tests.

Q: Is it blue/green or red/black?

It’s red/black. Even if it’s blue/green.

Any thought around looking at previous canaries? (boiling frog problem)

Jarrod: We’ve thought about it
Isaac: Should have bounds configured as alerts, which would catch (eventually) continually-degrading canaries.

Are there any anti-patterns to look out for?

Isaac: there are certain applications that have unequally distributed load — one server will behave more erratically than another.
Chris: beginners will use too large of a canary deployment.
Jarrod: another mistake is to throw too many metrics at the system. Garbage in garbage out

What happens when your baseline isn’t healthy?

Chris: we see it frequently. sometimes you get bad servers. It’s a hard problem to solve b/c we assume our baseline is the gold standard. we’re working on it.

Where do you see the future of canary testing heading?

Gopal: “Ops as code”
Jarrod: Our goal is less user configuration by doing more automatically, and machine learning (system should just figure out when things have gone badly)
Less user configuration; instead look at what’s important to the user; what alerts have they set?
Chris: “Sentience” (ha!) Creating a community around OSS canary analysis and get feedback from them to figure out where to go
Isaac: Canary will fill in gaps where things aren’t covered by existing testing and practices
Matt: We were in a time where automated unit tests weren’t ubiquitous, I think we’re getting to a place where automated canary analysis will be ubiquitous.

Avram: Some of the most valuable data we want to have is not attributed to the existing fleet. We’re looking at the impact of downstream apps. How do you deal with the limited information from downstream systems?

Chris: we do sticky canaries where canaries are annotated and that data passes all the way downstream for analysis.
Isaac: this might come to building defensibility into your app. Assume your application will fail.

Gard: the next evolution of CD is feature flags. You still need to run a canary after a feature flag. Do you have any thoughts on this?

Jarrod: yes, you should do it. we have a separate feature service that does this and calls into our canary service.
Chris: we have the same idea with fast properties. You can canary feature flags.

Gard: Will feature flagging become part of Spinnaker?

Andy: That’s fast properties for us. That’s Netflix specific though.

Can canaries be implemented to deploy Spinnaker itself?

Duftler: yes.

Graphs of Survey Responses:

2017 Spinnaker Summit Debrief

Opening Keynote

ANDY GLOVER, NETFLIX and STEVEN KIM, GOOGLE

Spinnaker at Target

EDWIN AVALOS, TARGET

Continuous Delivery at Cloudera

CHUKA OKOYE, CLOUDERA

Scaling Spinnaker at Netflix

ADAM JORDENS, NETFLIX

ROB FLETCHER, NETFLIX

OWAIN LEWIS, ORACLE

Spinnaker at Schibsted

GARD RIMESTAD, SCHIBSTED

Ubiquitous Delivery: Plugging Into Spinnaker

CHRIS THIELEN, JEREMY TATELMAN, NETFLIX

Cerner’s DC/OS-Spinnaker

WILL GORMAN, ROBERT FARR, CERNER

What Armory Has Learned

DRODIO, BEN MAPPEN and ISAAC MOSQUERA, ARMORY

Armory's Spinnaker Summit Slides

hello@armory.io | 1.888.222.3370 Our Playbook to WIN WITH SPINNAKER In Your Enterprise

Spinnaker at Gogo Air

ALEX KING, DOUG CAMPBELL, JOEL VASALLO, AND STEVE BASGALL, GOGO AIR

Spinnaker at Gogo

Spinnaker at Gogo

Kubernetes + Spinnaker

LARS WANDER, GOOGLE

Halyard Deep Dive

LARS WANDER, GOOGLE

Spinnaker Release Process

JAKE KIEFER, STEVEN KIM, GOOGLE

Kayenta: Automated Canary Analysis from Google and Netflix

MICHAEL GRAFF, NETFLIX; MATT DUFTLER, GOOGLE

Spinnaker at Under Armour

KEVIN CHUNG, STEPHEN SCHMIDT, UNDER ARMOUR

Extending Spinnaker

CAMERON FIEBER, NETFLIX

Fleet Management with Dominator

RICHARD GOOCH, SYMANTEC

Spinnaker Security Deep Dive

TRAVIS TOMSU, GOOGLE

Metrics and Monitoring

ERIC WISEBLATT, GOOGLE

Managed Pipeline Templates

KATRIEL TRAUM, WAZE

Canary Analysis at Netflix

CHRIS SANDEN, GREG BURRELL, NETFLIX

Advancing Spinnaker at Lookout

BRANDON LEACH, LOOKOUT

Spinnaker at Capital One

SRI CHADALAVADA, CAPITAL ONE

Spinnaker at Scopely

AVRAM LYON, SCOPELY

Spinnaker at Scopely - Spinnaker Summit 2017

Spinnaker at Scopely Avram Lyon Spinnaker Summit September 12, 2017

Panel Discussions

Spinnaker’s Kubernetes Integration

Operating Spinnaker

Extending Spinnaker

Automated Canary Analysis

Graphs of Survey Responses:

Written by DROdio