Move fast and don’t let the cloud break things: CoreOS + Backplane

Published in

Backplane.io Blog

5 min readJul 27, 2017

During a live 15-minute demo, starting with zero infrastructure, CoreOS engineers deployed a web app across two different cloud providers and three separate geographical regions using its self-driving Kubernetes product Tectonic.

They tore one region down completely to simulate an outage. The application continued to hum along with Backplane intelligently redistributing load to regions and clouds that remained healthy and operational.

Is the cloud your single point of failure? Tectonic will fix it

Multi-cloud infrastructure makes apps far more resilient to outages that inevitably occur. Without it, you’re locked in to your vendor and you can’t control your app’s uptime.

With it, your app is running at global scale, across multiple domains. Teams can focus on optimizing performance, and stop worrying about surviving the next incident.

Never worry about the cloud as a single point of failure again

In February 2017, Amazon S3 went down for four hours at an estimated cost to its customers of $310M in aggregate. Slack, Nest, Adobe, Salesforce.com — all had products down for the entire four-hour period.

Most apps and services are built with an “all-eggs-in-one-basket” design. They’re built for a single cloud platform run from a single location like AWS US-East.

Engineers often don’t have the resources to redeploy to other cloud platforms or other regions. Configuring and managing stand-by installations to fail over to can be costly and time consuming, with several manual hours of labor for if and when downtime occurs. When a region goes down, like it did for S3, entire products go down and stay down until the cloud comes back online.

Eliminating the Single Point of Failure for Tectonic Deployments

CoreOS ensures engineers can run container infrastructure anywhere compute/network/storage exists. This means users can run Tectonic on Amazon Web Services (AWS), Microsoft Azure, and bare metal.

The whole system is built on Terraform, which helps create the underlying infrastructure on these platform providers. Because Tectonic is multi-platform, you’re not locked into any single cloud platform which avoids a single-point-of-failure on any given cloud.

*Deploying Tectonic clusters across AWS US-West (Oregon), AWS EU (Frankfurt) and Azure US-East (Virginia) with Terraform (via* *CoreOS demo*)

With a few configuration files, teams can deploy Tectonic clusters across multiple cloud platforms and regions using the same basic command. Each individual Tectonic cluster is highly-available, and resilient to availability zone failure. By deploying across multiple regions and cloud platforms, you’re increasing the failure domain to region-wide, cloud-wide and across multiple cloud platforms. This means no matter the location of a given infrastructure incident, the application can continue to be served.

Rather than administer each cloud instance individually, everything is managed through a centralized web interface and via the command line. Load balancers, auto-scaling deployments, volumes, configuration, secrets, jobs — they’re all managed through federated services rather than at the individual cluster level.

*Launching a web app across multiple clouds via the federation control plane (via* *CoreOS demo*)

After loading a federated control plane and distributed key-value store (etcd) across all of your cloud platforms, you’re able to run kubectlcommands as you did on a single platform. Now it executes across all of your platforms and around the whole world.

Survive Catastrophic Failure by Balancing Load in a Multi-Cloud Environment

Typical load balancing solutions don’t work intelligently in a multi-cloud environment. AWS Elastic Load Balancer won’t talk to Azure Load Balancer and vice versa. Load balancers are vendor specific, tailored to the cloud infrastructure that they run on.

Load balancers themselves can become single points of failure. For ELB, you need to instantiate your load balancer across multiple availability zones. But even then you’re reliant on a single cloud platform.

Just as your app should be multi-cloud, your load balancer should be multi-cloud as well.

Using Backplane to load balance across AWS and Azure

The CoreOS team demonstrated application load balancing using Backplane, because it enables complex load balancing rules while not being tied to any specific cloud provider.

*Backplane set up to balance the load across three clusters running across two cloud platforms and three geographical regions (via* *CoreOS demo*)

Backplane allows you to balance load to multiple cloud platforms as if you were using a single cloud. Each instance of a web app simply appears as a backend in Backplane.

When a backend comes online, it starts receiving load. With Backplane silently running, if an AWS or Azure cluster goes down and takes down a group of backends with it, the backends go offline and load intelligently and automatically is redistributed to healthy operational backends.

Failover demo

At CoreOS Fest, CoreOS engineers Quentin Machu and Alex Somesan got two AWS clusters and one Azure cluster up and running with Tectonic. After generating some benchmarking workload to the site, they tore down one of its AWS clusters completely, and Backplane automatically failed over to the remaining Azure cluster and AWS cluster, intelligently balancing the load to the six remaining backends.

In the case of the total catastrophic failure of an AWS region, the Tectonic console remained healthy and the app continued to be available.

Optimize capacity for the best user experience

When your application is setup as multi-cloud and multi-region, it’s running at global scale. Your business can survive a major outage, and that empowers you to optimize capacity.

Within Backplane, backends can have labels which correspond to the cloud and the region on which the backend is running. These labels allow you to create rules to optimize your capacity and deliver your users the very best experience.

*Our web app running on nine backends — three replicas running on each of the three regions, with us-west-1 taken offline — with each backend labelled by cloud and region (via* *CoreOS demo*)

Global scale means that you can use rules and labels to do the following:

Improve response time by routing requests based on location: With a few simple rules in Backplane, you can route requests coming from the United States to your US clusters and requests coming from Europe to your EU clusters.
Flatten out traffic spikes by scheduling capacity based on time of day: While usage spikes during the morning in Europe, your clusters in the US might sit dormant as people are sleeping. You can automate that away using Backplane by balancing the load across your EU and US clusters to minimize response time.
Migrate capacity based on demand: Demand might be vendor specific, for example when the DDoS attack on Dyn affected AWS users. Backplane can reroute your load through your other cloud platforms like Azure to ensure that your app is responsive to your users.

You have the flexibility to pick and choose among available best of breed offerings and to completely eliminate downtime. Rather than have your users’ experience be in part determined by the cloud platform you choose, you design the highest quality experience for your users.