Coal, Hyperscale Computing and Cloud Workload Automation

Published in

The Spot to be in the Cloud

5 min readSep 11, 2019

Around 250 years ago, the Industrial Revolution ushered in an unprecedented era of productivity and growth. Society transitioned from painstaking, manual, hand-made craftsmanship of all goods, to automated, machine-driven, mass manufacturing of everything.

This incredible shift was fueled by the invention and enhancements of coal-powered steam engines.

Today, we are living in a technological revolution that is driven by hyperscale cloud computing and while the changes I personally have seen in the last 30+ years are amazing, I believe this is just the beginning.

In 1865, English economist William Stanley Jevons was researching the future supply of coal and noted, “Whatever, therefore, conduces to increase the efficiency of coal, and to diminish the cost of its use, directly tends to augment the value of the steam-engine, and to enlarge the field of its operations.”

Simply put, the more efficient people become in consuming a resource, more of that resource is consumed.

Coming back to 2019, today’s coal is cloud computing. The more cost-effective it is AND the easier it is to use, the more it will be used.

But just how easy is cloud computing? Let’s take a look at some areas in the AWS cloud that can be challenging.

The Finest Blend of EC2 On-Demand, Reserved and Spot Instances…..in One Cluster

Many companies are seeking a platform that automatically places workloads on the most suitable EC2 instances, both in terms of pricing models AND best matching CPU / Memory for the workload.

On the cost side of things, if you purchased Reserved Instances, you are locked-in for either 1 or 3 year. In the event that projects end sooner than projected, you will end up with fully-paid for (or contractually committed to), completely unutilized resources.

If you dream of using Spot instances that can slash EC2 pricing by 90%, ensuring availability is a real challenge (AKA a cluster@#$%^&*!) as AWS can pull the plug at any time with just a 2 minute warning.

So by and large, you end up paying On-Demand rates and crying at the end of each month when you get the bill.

Keep reading to find out how you can utilize all your RIs and run mission-critical workloads on Spot Instances.

Container Requirements Should Determine Infra Deployment and Workload Placement

Wouldn’t it be great if infra deployment and workload placement was driven by actual application or system requirements? Unfortunately this is not the case.

Take for example, Elastic Kubernetes Service (EKS) and other orchestration tools. Their autoscalers are typically driven by node metrics which can be completely misleading and unrelated to the reality of the actual resource consumption of tasks or pods

In Kubernetes, the master node is responsible for maintaining the desired state for the cluster. Even if the cluster has a surplus of resources, once a single pod requires more CPU or RAM than a single node has available, the Pod won’t start and you’ll receive an error message such as:

Similarly in ECS, imagine your Auto Scaling Group has a policy to scale up anytime that the Task’s CPU goes over 70%.

If the Task CPU is at 65% avg, there is no scale-up, even if the ECS scheduler has a new Task that requires, say 40% CPU. This scenario will result in the Task not being scheduled.

Clearly, Tasks or Pods not running due to this auto-scaling gap, is a major issue that must be handled to ensure your workloads run when they should.

Surprise Ending — There is a Solution That Can Handle it All

Of course, I wasn’t going to leave you hanging with all these problems. In these final paragraphs, I’ll briefly outline how Spot allows you to easily launch clusters with spot instances, reservations and if desired, on-demand. I’ll also share how Spot can abstract away all the infra challenges that are inherent in all container orchestration platforms.

Launch Spot, On-Demand and Reserved Instances in a Single Cluster

While AWS Spot Fleet is a great option for a nicely blended cluster (it allows you to define the percentage of On-Demand vs. Spot Instances in your cluster), if any Spot Instances are interrupted, Spot Fleet will not replace them with On-Demand (to ensure high availability).

Additionally, even if you have available Reserved Instances, Spot Fleet will not actively seek to launch instances of the same type, thereby missing out on additional savings and cost efficiency.

This is where Spot’s flagship product, Elastigroup, takes your cluster to the next level.

For example, with Spot’s Elastigroup you can:

Launch your cluster with 100% spot instances and if there is any interruption, Elastigroup will automatically launch other spot instance types or on-demand instances to ensure availability.
Fully utilize any pre-purchased reserved instances as Elastigroup knows to automatically launch matching instances (and will do so before spinning up Spot Instances).
Increase cluster availability (beyond what Spot Fleet can do) by fulfilling Spot requests across all subnets in selected Availability Zone(s).

NOTE: While Spot Fleet and Spot both provide percentage distribution of spot instances vs. on-demand, it’s NOT at the level of the Task/Pod but only at the infrastructure level of the cluster. Once AWS adds lifecycle to service definition, this capability will be feasible.

Say Goodbye to “No Nodes are available” with Tetris Scaling

As we mentioned above, the available container orchestration platforms don’t properly handle scaling based on Pod or Task requests which often results in workloads not running as scheduled.

Here too Spot provides a comprehensive scaling solution, AKA Tetris scaling as a part of Ocean by Spot , a “serverless container” deployment option for Kubernetes and ECS.

Ocean automatically detects when a Pod or Task has specific CPU or RAM requirements and will spin up the appropriate instance size or type.

Additionally, Ocean by Spot offers right-sizing (Vertical Pod Autoscaling or VPA) which essentially looks at the actual Pod or Task resource utilization and based on that decides if it needs to increase or decrease available CPU and RAM as relevant.

These two features complement each other. As the Pods and Tasks themselves becomes more optimized, the Pod/Task-driven scaling can act more precisely for even greater cost savings and efficient placement of workloads in your cluster.

Let’s Hyperscale Your Cloud

Circling back to the beginning of this blog post, Spot is providing cloud consumers an easier and more affordable way to automate workloads with any of the popular cloud providers (AWS, GCE, Azure, etc.) which in turn is driving our new era of technological innovation and growth.

In closing, here is Jevons’ famous economic principle, slightly modified for our purposes: “Whatever, therefore, conduces to increase the efficiency of cloud-computing, and to diminish the cost of its use, directly tends to augment the value of the cloud, and to enlarge the field of its operations.”

Explore Spot today!

Originally published at https://spot.io on September 11, 2019.