DevOps@SEEK — a 4 year evolution (part 6)
“Ultimately, the cloud is the latest example of Schumpeterian creative destruction: creating wealth for those who exploit it; and leading to the demise of those that don’t.” — Joe Weinman,
The cloud is designed for consumption; it wants you to consume as much of it as possible. All the time. It has completely changed the face of the IT industry, much like Uber did for taxis, and the cloud services provided by the major vendors are evolving and growing faster than ever. At the 2016 AWS Re:Invent conference in Las Vegas, no fewer than 24 new major services were introduced — an extraordinary amount given what the AWS Platform offered only 5 years ago. This rapid evolution of services, will see businesses build new solutions that will continue to evolve architecturally from how they would have been built less than 12 months ago. All the while exploiting new platform services to improve security, monitoring, automation and auditing to eliminate waste and increase efficiency.
In our halcyon days of AWS at SEEK, when we were rapidly adopting AWS for everything we built, we had a novel approach to eliminating waste. Plainly speaking we didn’t really do it very well. But over time we have gotten very good at it. Really good in fact. But rather than subject you to another walk down memory lane we’ll instead use this entry to go over some of the basic lessons we have learnt on managing our usage of the cloud.
Managing Usage Efficiency
One great benefit of the cloud is the near infinite reserves of storage and compute that it possesses. Scaling from 10’s to 1000’s to 10’s of 1000’s of machines is possible, but as your usage grows the need to be more efficient with these resources does too. Regular monitoring of CPU utilisation is very important, as is identifying bottlenecks and ASG’s going wild to ensure that you are not over or under provisioning when scaling up and down with load.
Here are some of the things we tend to pay the most attention to.
Number of running instances.
In a Data Centre you run as many instances as you can pretty much 24x7. Sure there is some additional power consumption costs associated with this approach, but it is your hardware, you paid for it and you’re going to depreciate it over time so you may as well use it as much as you can. In the cloud however, that is a very bad model to adopt and it is going to cost you a lot of money
Why is this a bad model to adopt? The answer is quite simple. As you pay for everything you use, in hourly blocks (in AWS), you only want to use the minimum amount possible that can still ensure your production workloads meet your standards of service.
And that same logic applies for any AWS resources used to develop your products too.
Here are a few simple things we do at SEEK to keep our resource costs in AWS at a minimum:
- All development environments are automated to shut down around 18:15 Monday to Friday.
- All development environments are automated to shut down over the weekends and public holidays
- Reporting tooling is used to check how many instances are running both in real-time using monitoring systems like PRTG and DataDog and through the AWS DBR reports consumed by Cloudability.
But it’s not just about the amount running, it’s how you use them
As discussed earlier, in the Data Centre instances run 24x7, probably even most or all of your development environments do too, because, well, you just can. The cost is usually not significant enough to change behaviour (depending on the agreement you have with your provider or service management vendor), and usually it’s just more convenient to do it this way.
But in the cloud you pay per instance, whether or not you use all the CPU, RAM and I/O that has been provisioned for you. Which means knowing what the utilisation percentage of the resources you use becomes important to ensure you’re not over-provisioning what you need.
For example, if you are running an m4.10xlarge EC2 instance that is barely utilising its CPU i.e. it is hovering at or around 1 percent utilisation, you could sanely question whether this is the most appropriate instance type your team could be provisioning over something smaller like an m4.2xlarge.
Why is utilisation important? If it is low, it may suggest the architectural solutions being produced by delivery teams are not focused appropriately on scale-out and good elasticity. But then on the flip-side there could be good reasons too. Larger instances can provide better network and more RAM, something a particular solution may need. You won’t know this from just the raw numbers so it’s important to ask delivery teams when things don’t look quite right.
Most good SaaS tools will provide enough visibility to prompt further conversations around utilisation, which are always worthwhile discussions to have to ensure the best possible solutions are being built.
Why is utilisation important?
In the past scaling resources up was a fast (and lazy) way to wallpaper over performance problems arising from poor architectural solutions that did not scale in the Data Centre. In the cloud with infinite resources on tap and the ability to use techniques such as infrastructure-as-code, there is simply no need to create large bloated web or API servers. Ideally solutions should be built with good resiliency such that anyone could terminate an instance, at any time, and the system just creates a new one and balances the load automatically. It’s always better to lose 1 small instance out of 10 than 1 large instance out of two i.e. lose 10 percent capacity versus 50
Secondly, if you cannot blow-away an entire development environment without being able to re-create a new one in a matter of minutes, it would be right to question the architectural strength of the solutions being built and the disciplines being followed in development
Keeping it clean
The more you use the cloud the more waste you are liable to generate. Lots of EC2 volumes left lying around are a common issue (we once uncovered several thousand!), but there are other costs that can rack up too such as huge amounts of data being transferred in and out of S3 Buckets, poorly built Lambda functions that run for too long and much more. Practicing good hygiene in the cloud is important to control your costs and ensure there is no unwanted bill shock at the end of the month.
In the past you needed to manage this yourself. Lucky Netflix developed its Simian Army library of tools that came with very good features from the get-go (we have used Janitor Monkey a lot), but the marketplace here is much more mature now. Several SaaS providers can, with the right level of access granted that you are comfortable with, manage a lot of this wastage for you. And of course there are the cloud platform offerings as well such as AWS Config that can enable you to build custom automation compliance solutions. Many of these tools will deliver value right away and decent vendors will let you trial their products first.
At SEEK we use a mix of Janitor Monkey, AWS Config and some custom solutions, many of which have evolved over time. As our usage of the platform grows, no doubt some of these will be either replaced or upgraded — especially with the recent announcements of more platform services from AWS.
Reserved Instances the tricky science
With the rise and rise of server-less computing it remains to be seen whether reserved instance (RI) computing will still be as desirable in the future as it is at the moment. For now, it is still an excellent way to manage costs with AWS services such as EC2, RDS and ElasticCache. Here are some of the ways we handle RI’s at Seek.
We purchase RI’s, on average, every quarter. We start each financial year with a pre-determined budget based on expected growth and draw down on that as needed. The reason for purchasing every quarter is two-fold:
- Buying smaller amounts at a time means that we don’t need to go through very large approval processes. It also lowers the complexity of the effort needed to buy them.
- You have greater flexibility as the AWS platform evolves, handy given the number of new EC2 types just released at Re:Invent
Buying reserved instances is complex, a good SaaS provider (and even AWS with Premium Support enabled), will be able to auto-generate recommendations for you. To do it alone would be an arduous task, that is error-prone and would require a number of validation activities at each step. Our advice here is simple: save yourself a world of pain and use a SaaS provider. We use both Cloudability and AWS Premium Support.
But you can’t just rely on SaaS tools alone. These tools can only make recommendations based on the data that you have generated, with some predictive analytics thrown in. In other words, it is all based on your current state. They do not have a view into your Product and Feature roadmap, what your delivery teams are going to do next, or where the cloud vendors roadmaps are headed either. In summary they provide a very good baseline, but checking in with your delivery teams is very important to validate the forecast you’ve being given.
Here is a real world example
At one point in the not too distant past, our SaaS tool was recommending we purchase 16 c4.4xlarge machines, but after checking with the delivery teams we quickly realised such a large allotment of expensive machines was only going to be needed for another 4 months. To buy RI’s for them would not have been a good financial decision! Granted we could have split these machines down to smaller types, but as the types were using an operating system we don’t use much anymore, we would still have been left with a sizeable hole burning in our back pocket when they stopped being utilised.
Savings coverage, how to stay safe in the margins
Reserved Instances will save you around 30 percent when used well. To make that work as effectively as possible they need to be bought for instances that are going to be on for at least 70 percent of the time in a calendar year, that means around 17 hours per day. Therefore, for RI purchasing we focus on Production accounts only and not Development. The reason we do this is because development accounts are automated to shut-down at night, weekends and public holidays, meaning we never use the resources enough to make reserving them financially viable. Specifically, this relates to what is known as the “break even” point which will not be reached for Development environments. This article explains this scenario very well as does this one.
As an added layer of protection, we only buy instances that we know are going to save us at least 35 percent over the equivalent on-demand price, this ensures that we get the maximum amount of usage out of them. We don’t bother with instances that are only predicted to save us a small amount as we run the risk of them not being well-utilised in the future.
As with all financial concepts such as ROI and NPV there is inherent risk in making future predictions on the time value of money, or in this case, cloud resources. Analysis of SEEK’s usage of RI’s has shown making buying decisions on savings greater than 30 percent ensures they are always utilised effectively. Your situation however may be different.
What purchase type do we use?
We favour buying Partial over Full Upfront as we can monitor the usage through the billing cycle through what is known as the Injected Line Item and gain valuable insight into the RI coverage through Cloudability and Amazon EC2 Utilisation reports.
We also allocate them to a region given this recent announcement as it saves us more money, but we don’t use convertibles as we believe a 3 year commitment is too much.
Reserved Instance maintenance
We regularly check the usage utilisation of our RI’s to ensure that we have customised our investment to achieve the most amount of savings. Both AWS and SaaS providers will provide you detailed information on your RI utilisation and good SaaS providers will also analyse this data and make recommendations on how you should modify them to make the most of your investment. At SEEK we modify our RI’s frequently — at least weekly.
This concludes our brief two-part sojourn about how we manage the cost of cloud. In this post and in my previous one we have discussed at length how we manage bill processing, architecture, security, finance and the trickier aspects of measuring and monitoring usage along with buying reserved capacity.
Also if you missed the announcement on AWS Organisations last week at Re:Invent it would be worth checking out, we will almost certainly be making use of this excellent feature in the near future.