In the present day scenario, people really value that moment of instant gratification they get upon leveraging Infrastructure as a Service. Otherwise which would have been a significant effort for solutioning those use cases in house. Yes, we are talking about the power of Cloud Computing , which present day software product/service companies use extensively.
All of it works perfectly well for you and keeps your life easy until one fine day you start talking about things at SCALE. You start dealing with problems like storing and organising your data at large scale, making internal as well customer facing systems highly available, efficient monitoring, logging and alerting mechanisms etc. All of these things can really be a matter of few clicks if you are into using public cloud solutions for real.
So, as the legends say With Great Power Comes Great Responsibility, one can understand and work around the engineering level responsibilities for the same, but the next big thing which hits the individuals and organisations is that you need to pay your bills for all the resources you leverage on public cloud solutions. Like it or not, your cloud costs can be a bomb if you don’t plan it well.
I work extensively with AWS Cloud and to be just a little more specific, at Indix HQ, a Data as a Service company, are building world’s largest cloud catalog for structured marketplace product information. The scale of data is in hundreds of TBs which we deliver to our customers through API’s and Bulk Feeds. As a DevOps and Infrastructure person, I was responsible for designing, developing and maintaining a stable, reliable and secure cloud infrastructure for our internal and customer facing apps on AWS Cloud.
But AWS like any other public cloud service providers, comes at a cost. When your infrastructure on cloud is serving and processing data at large scale, you are sure to use resources on cloud extensively and nothing is cheap then. Your bills are in thousands of dollars then.
This is where your Finance and Engineering ninjas adopt cost control measures and try to bring standardisation in the use of resources on cloud as per common best practices. Some of the standards that we follow at Indix are:
- Choosing the right instance types for deploying applications, so that there is no under utilisation of the computation power offered the selected instance type.
- Using Spot Instances for systems which are not mission critical and Reserved Instances for mission critical ones.
- Proper tagging of all the resources. This reduces the risk of anything getting unmonitored.
- Keeping a track of service usage through Cost Explorer
One of our biggest challenge in terms of controlling costs on AWS despite of following all standard best practices for the same was that we weren’t able to track and hence respond to an unknown event. To elaborate more on this, think of the following possibilities :
- An unexpected autoscaling event during 3 AM at night which scales up and then scales down your cluster which you are likely to keep a track for in terms of cost control.
- Human error of EC2 instances being left idle.
- Untracked jobs which have a potential to incur huge cost due to high Data Transfer.
Though AWS through Cost Explorer tool helps us analyse cost very efficiently, but we were not sure if we could leverage in a programmatic manner to build our own custom solutions on top of it. One fine day we came across an article from Jeff Barr from AWS which talks about how we can use AWS Cost Usage Report (CUR) to analyse the cost distribution on AWS. Though the blog says it all, but we wanted to build an automated system which could help us track our costs in the desired time granularity level possible.
CUR, a CSV file though, is a really complex thing and it was going to take ages for us to parse it and extract the required cost data for our use case. Alternatively, we imported our CSV data to Amazon Redshift where we can store large scale data in a Postgres database without caring about the underlying infrastructure for the same. So this way we could store our CUR data in form of a SQL table and can fetch the required data from it using simple/complex queries. It was a fair win over the effort involved in writing parsing scripts for CUR otherwise.
Added to that, AWS helps us do this entire import process by just a few clicks. It allows you store your billing reports to an S3 bucket and import to Redshift.
Post configuration of reports you’ll be able get CUR’s in your S3 bucket. We also further configured our reports to make it available into a Redshift cluster by giving a table schema for the report. (You have options to do the same in the same billing console). This meant, we just needed a new redshift cluster and we’r done on the cost data loading part.
Once we had the data in our Redshift cluster, we started running queries on it right away.
Everything we expected after getting inspired from the blog article was working fine for us. The journey was half complete though. What we needed was an end to end automated system which could do the above jobs for us and also alert our engineers or engineering managers whenever there is a certain threshold cost is crossed for monitored resources on AWS.
We then decided to leverage the power of AWS Lambda which is a serverless architecture service provided by AWS. We had the following design for Lambda functions :
- Lambda Function #1 : Automates the process of fetching CUR from S3 and uploading it to Redshift.
- Lambda Function #2 and #3 : Reads a set of specified configurations from a JSON file, frames queries from that information, compares the obtained costs from queries with some threshold data and then sends Slack alerts if the thresholds are crossed.
Let’s take a deep dive in Lambda functions #2 and #3. These are nothing but cron based functions which are responsible for alerting people on Slack (we use it for our internal office communications) whenever a resource on AWS crosses incurs more than a certain specified cost.
Above image shows the kind of configurations our Lambda functions #2 and #3 read for framing queries. For the above specified configs, the query formed goes something like this :
select sum(cast(lineitem_unblendedcost as float)) from #TABLE_NAME where #TAG_NAME='#TAG_VALUE';
The result of this query gives us
sum which is compared against specified Threshold value and if the thresholds are crossed, alerts are being sent to specified Channel on Slack tagging the concerned Engineering Manager(EM_Name). Following is one of the sample alert :
So this is it :) Over the day, if the cost shows an unexpected behaviour over our monitored resources, we get to know for sure :)
Overall graphical representation of the architecture is something like this :
I named it up as
PLUTUS :) Soon enough, at OpsLyft we are going to release this as a full fledged product which you can use it within your AWS environment.
Our vision at OpsLyft is to achieve Digital Transformation with DevOps. We are set of experienced engineers who help organisations achieve their DevOps goals. Please reach out to us at email@example.com for for any help you need on solving problems and enhancing your cloud.