How to Manage the Costs of Cloud Computing
Hi my name is SEEK; it’s been 12 months since my last bill shock
This is part 5 of a series about the evolution of DevOps @ SEEK. In this post I’ll be talking about how to manage the costs of cloud computing. Read the previous post here.
When we dove head-first, literally, into AWS we were aware that managing costs was a thing that we needed to do. But we took (what was in hindsight), a highly optimistic and naïve approach to it. Rather than establishing controls about how to use the cloud, or doing more than just a smattering handful of training, we brought in a cost dash-boarding tool, suggested everyone use it to track costs and usage, and figured they would just “get it”.
They didn’t get it.
When we opened up AWS access for Product Delivery Streams to start spinning up development and testing environments in the cloud they kept creating more environments than they needed. This was due in part to the environments breaking all the time due to their complexity, but also because many of the users were on a steep learning curve, so most of their time was spent learning how AWS actually worked. It didn’t help either that EC2 instances were continually being scaled up by the teams as the default sizes were too small and most were usually left running all night and over weekends as we hadn’t enforced the automation of start-up or shutdown from the beginning. Therefore, when the invoices started coming in painting a very nasty surprise for the end of year budgets, we were faced with having some difficult and slightly uncomfortable conversations about how we were using the cloud. It wasn’t that the problem we faced was invisible, invoices way over budget do tend to capture people’s attention, the issue was we weren’t making the cost issues visible while it was happening. We also weren’t policing usage to prevent people from hurting themselves and we were lacking in training and communications on how to use it properly too.
But what really made matters worse was when we started breaking down the invoices to work out who was costing us the most money, we actually didn’t know! We had very little tagging or other metadata information that could tell us where the costs were being generated.
So we had to learn fast how to manage costs, compute usage and build our cloud solutions in a way that would fit with our ever-changing and evolving delivery culture. The next few entries will focus on a number of things we have done and learned on this journey with AWS. This entry in particular will focus on Shared and Non-Shared Account Models and the various costs, architecture and security challenges that came with them. Note that this is what has worked and not worked for us, your experiences may be different.
The Shared Account Model
When we first started we envisioned that we would need, at most, about 10 AWS accounts to meet all our Product Delivery Stream and Team needs. Effectively they would cover
1) Development and Testing of our cloned Production environment;
2) Sandboxing concept projects;
3) Staging before going to a Production Account;
4) Production itself;
5) Shared Services to manage common tools, DNS, Active Directory, Direct Connects; and
6) Billing to manage account consolidation and Reserved Instance purchasing and management
We also reasoned that it would be fine if every Product Delivery Stream and Team could share these accounts as we’d use IAM roles to control usage, make use of subnets and VPC Peering to demarcate teams and use resource tagging so we would know who was using what to employ a charge back model.
In hindsight there was a strong data centre oriented mindset at work here that led to these decisions.
The problems with the Shared Account model.
Put simply it became a serious pain to manage. If you don’t put controls in place to manage costs and security before you let teams use them, it requires a lot of communications and planning efforts down the line when you retrofit them to handle multiple numbers of CFN stacks spinning up and down, complex IAM configurations and the proliferation of S3 Buckets. Another big gotcha for us was creating large VPC CIDR ranges in these accounts from the outset, in hindsight we would have made them much, much smaller.
Shared Accounts can restrict innovation
Shared Account models generally do not promote autonomy for teams. By their nature governance and control of resources become centralised and get managed top-down. These practices have been known in the past to limit innovation through making organisations more rigid and less fluid in delivery processes. Knowing how and where to apply governance is a bit of an art-form, and it is very dependent on your delivery culture, to ensure you avoid nasty surprises but don’t step on everyone’s toes at the same time.
Shared Account cost management.
In a shared account model where multiple teams are delivering projects together — yet budgeted under different cost centres and sanction id’s, — knowing where to apportion costs when the bill comes in becomes a complex discipline. The most effective way to do this is by tagging all your cloud resources, baking this metadata into infrastructure code scripts and enforcing their usage with open source tools like Janitor Monkey or various SaaS provider options now emerging on the market. Enforcement ensures that resources will be automatically terminated if they do not meet the requirements of a tagging policy and that wasted and unused resources get cleaned up. At SEEK we use these three mandatory tags enforced with Janitor Monkey:
· Stream : The Product Delivery Team name or Operations Team
· Project : The name of the project that required the resource to be created
· Owner : A valid seek email address of the person to contact if there are issues
But you can’t tag everything
Not every single resource in AWS can be tagged, things like Direct Connect, Network, Bandwidth are examples of this. In a Shared Account model these accumulated costs still need to be paid for. A simple model would be to have a separate cost centre or code with which to apportion these costs on a monthly basis. A more complicated model would be to charge back these costs to the teams that are using the Shared accounts based on the percentage amount of their usage. At SEEK we employ the latter, as it is fairer across cost centres and promotes people to use resources more efficiently.
The concept is simple enough, consider an account which has 4 teams using it. The total invoice for the account is $50K of which $10K is untaggable costs
Team A charged $20K or 50% of the tagged costs
Team B charged $10K or 25% of the tagged costs
Team C charged $5K or 12.5% of the tagged costs
Team D charged $5K or 12.5% of the tagged costs
Therefore when distributing the unallocated costs that amounted to $10K
Team A is charged $5K (50%)
Team B is charged $2.5K (25%)
Team C is charged $1.25K (12.5%)
Team D is charged $1.25K (12.5%)
Thus the final total charged for each team is
Team A = $25K
Team B = $12.5K
Team C = $6.25K
Team D = $6.25K
Using this model Team A will have a greater incentive to reduce their total charged costs next month in order to pay a smaller percentage of the unallocated costs.
And then there is the impact on the finance department
In the Data Centre world of IT Infrastructure and operations you’d make the business case to buy a lot of hardware, get it approved, raise your PO’s and then send the invoices on to finance. Finance would process it as a CAPEX expense and then depreciate it over-time. Any costs incurred from running it (power, cooling etc..) will of course be treated as OPEX.
All standard stuff.
But it doesn’t work like that in the cloud. You don’t actually “own” anything. You’re just paying to “use” it. And the waters get even muddier when you reserve cloud compute by paying an upfront amount for a reduced monthly rate. And these differences mean traditional Finance departments used to the standard business case approval and invoicing process will need to adjust their workflows to deal with this shift to the cloud to know how and what accounts, or tags, to CAPEX/OPEX for. Finance departments will need good insight into cloud spending and will need to stay on top of the amount of small projects that are generating costs to keep the organisation well-informed.
Because processing those invoices is not easy
For SEEK processing those original AWS bills was an ugly business. In the beginning we manually processed the DBR files from AWS to correlate them to the invoices and it took a very, very long time to complete. Thankfully this area of cloud management is maturing quickly and there are a number of vendors competing for your business to take the hard work out of processing the usage data. We use Cloudability at Seek to take the pain away of bill processing and also to get near-real-time insight into how costs are being generated across all of our accounts.
Our advice? Use them, the cost is worth it for your time and sanity.
So are there actually any benefits to a Shared model?
It all depends on your perspective. Just like the drawbacks we talked about earlier, there are benefits to centralising control of security, risk and cost. For customer facing sites and services, stricter and automated governance over their security and usage may be a more ideal model for very risk-adverse organisations. Especially where lower levels of cloud skills and maturity amongst technical teams exists. Ultimately this can be a safer road to take, so long as controls are in-place to stop people accidentally blowing away entire AZ’s or Regions and there is a plan to move away from this easily should your delivery culture change. Which it should and most likely will.
Non-Shared Accounts — a better way?
Ultimately the answer is yes. But there are caveats to this which we’ll discuss further on.
In a Non-Shared account world, you create multiple AWS accounts to support development and production workloads. There are many ways in which this can be done, through having separate development and production accounts for each project, or raised up a level and creating them for business units or domains . Ultimately it has to be what makes sense for you and your delivery processes, and the skills/capability of the people that are using it.
So what are the benefits of Non-Shared accounts?
There are a few.
They can enable your delivery teams to have achieve greater autonomy with the technology solutions they are building, as they are able to design and build in isolation with less controlling oversight. In mature organisations this can lead to more refined and efficient solutions for customers when appropriate governance and design principles are well-understood with respect to cost, reliability and security
They make it easier to keep development and production workloads separate, so costs incurred from each can be easily split along CAPEX and OPEX lines. This can avoid the effort involved in implementing, enforcing and maintaining a complex tagging policy and will make life a lot easier for Finance.
They provide a natural blast radius reduction in the event of catastrophic, failure as business sites and services are contained and managed within the account.
Drawbacks of Non-Shared account models
Depending on how you deliver your products there can be a number of them, let’s take a look at three specific areas
· Account proliferation
· Account Integration and Architecture
· Security
Account proliferation
Separate accounts still need to consider routing, security, integration, costing and much more. In the Data Centre world development teams usually had little to do with these specific areas, but with their own accounts they will be directly responsible for them. This overhead all comes at a cost, a cost borne directly by the account holders and therefore this impacts traditional held views of software delivery schedules and development lifecycles.
Integration challenges
Connecting different accounts to use services without needing to traverse the Internet can be achieved in AWS using VPC Peering. This can seem harmless enough when connecting a handful of accounts, but it becomes very ugly, very quickly, when more and more require integration. You could contain peering blowout if a strict hub and spoke model is followed, by relegating spoke integration to using secured public endpoints, but as the number of integration requirements grow you will be left maintaining a rather large spreadsheet of CIDR ranges, IAM roles and endpoint certificates.
What can start out innocently enough
Can lead to this very quickly in a loosely governed Non-Shared Account model
CISCO do have what is called a Transit VPC solution, it works, but is effectively a Layer 2 solution for an enterprise integration challenge and may be more suited for integrating different regions, not accounts in the same AWS Region.
We’ve spoken to a number of organisations about this, including AWS, yet ultimately there is no silver bullet for this problem. There are ways to mitigate the worst of the effects — the answer lies in your delivery culture and at the least, limiting the amount of Production AWS Accounts you have. One suggestion was to consider creating one Production account per Business Domain, not per Project or Feature request, and not putting limits on development accounts, an approach which sounds entirely sensible.
Meanwhile we are just hopeful AWS have some API Gateway/peered VPC Trusted Account thing-a-ma-jiggy solution on their Roadmap to make managing this challenge a little easier.
Architecture challenges
AWS do have SLA’s, usually set to 99.95% uptime. Per month. That means these SLA’s are telling you that you must design for 21 minutes of downtime or significant service degradation per month.
This kind of SLA is a very rare scenario to have to plan for in a well architected data centre with full backup, redundancy and active-active setups. Therefore, in the Non-Shared account world, this can have wide ranging and complicated effects if systems are not architected correctly for the cloud. A few things we’ve learned are:
1) Don’t design for the data centre, design for scale out/in and make your solutions as elastic as possible. The smaller the instance size the better.
2) Build degradation tolerance into your solutions, always expecting responses within less than a second (especially with DNS) are probably going to cause you pain.
3) Don’t bet the farm on vendor COTS solutions that need to be uploaded and installed/configured in the cloud, especially ones that don’t scale well.
Security challenges
As account proliferation accelerates, creating a landscape of multiple S3 buckets, VPC’s, API Gateways and more Lambda functions than you can poke a stick at, Security starts getting very, very challenging. Back in the Data Centre day your Security team would have sat in some secluded room inside the ivory tower, worrying about anti-virus, perimeter and edge detection, network scanners and whatever software the CEO installed on his or her laptop at some café with free Wifi
The cloud changes this.
As accounts proliferate and PaaS and SaaS solutions continually broaden and mature, old ways of managing Security gradually become slow, expensive and ineffectual. Your security team will need to become a lot more consultative, involved in the designs of solutions and use more sophisticated software that is optimised for the cloud to check for unsecured endpoints, data-stores with open permissions, people mistakenly exposing access and making secret keys public and so on.
We have found making good use of AWS CloudTrail does become essential in this new world along with tools like Security Monkey
Summary
In this post, we have looked at Shared and Non-Shared account models and the pro’s and con’s of each, especially as they relate to cost, architecture and security. In my next post, we’re going to focus on cloud hygiene and the tricky science behind Reserved Instances.