Dear DevOps, Why You Need Cloud Garbage Collection in 2017?

I believe the cloud cost management in the new world is equivalent of program memory management in the old world so it’s important to deploy cloud garbage collection mechanisms to ensure you are not leaking the dollars($$$$).
In the Cloud world, Programmers and DevOps engineers are building applications using cloud resources through API model. And hence there is a new dimension to these resource usage in terms of dollars($$$$) instead of memory as the primary attribute. For the reason that cloud offers pay-per-use model. As engineers, in this new world, we are responsible for programming and production.
One of the most common complaints that I hear from Botmetric customers is that their engineers are not de-provisioning (or unallocating) cloud resources after their usage, which is causing a direct financial impact as companies are adopting cloud at a rapid pace. It’s a painful problem as companies scale their usage across teams and business units causing the cost leakages to go out of spiral. So in this post, I want to share our learnings and experience in building a Cloud Garbage Collection management using Botmetric.
The Cloud Garbage Collection essentially boils down to following 5 key areas:
Unused Cloud Resources: Based on the hundreds of Botmetric customers data analysis, we have observed that 2% to 5% of customers’ cloud spend is wasted on unused cloud resources. These are the compute, storage, IPs, services and databases provisioned but not used.
Underused Cloud Resources: Our observation from Botmetric data is that 95 out of 100 cloud customers have provisioned larger capacity than they need in-terms of servers, storage and databases used by them on the cloud. It’s like provisioning Large or Extra Large size virtual machines by engineers for Dev or QA or even production use cases, but hardly using 25% to 50% of the capacity.
Idle RI Capacity: Many companies plan their cloud spend based on budgeting cycle and purchase reserved capacity from the cloud vendors. This RI capacity purchased should be used efficiently as each purchase is related to specific region/zone, operating system, instance family & size, etc. However, the application workloads in the cloud world are dynamic.
DEV & QA Workload Usage: In 95% of the companies, the development and QA resources are not used during off hours like 10 PM to 6 AM and weekends. During these hours, resources are mostly idle and cause extra hours of billing despite not being used.
Instance Family Upgrades: Most cloud providers are ramping up their virtual compute offerings through different slice and dice of instance families & sizes. Many customers who have been using cloud for 2 or more years would have started with older instance families and continue to use them, despite cloud providers launching a better performant and slightly low cost alternatives.
The detailed post can be read at LinkedIn Pulse or Botmetric Blog.
