Automate monitoring of inactive cache clusters
Monitor cache clusters in AWS to make sure they don’t eat up your resources.
Organizations use Cache Servers to improve data retrieval performance, optimize the latency gap in IOPS and reduce cost at scale. Or I should say organizations have to do it, to manage millions of requests per second. It takes a whole lot of effort of an engineering team to build a reliable cache management system which could be deployed, managed and monitored efficiently.
To support these engineering efforts, companies like Amazon and Google provide infrastructure to deploy and manage cache servers (or clusters). However, there is one crucial area which even these companies cannot fully help you with which is — Monitoring of Cache clusters which have become inactive and are just consuming resources. There is no direct way to infer which cluster has become inactive because it’s totally subjective to how an organization has structured its infrastructure. A question arises then how we can monitor the cache clusters over time?
The answer lies in the core metrics. You have to dig deep and observe the different metrics like what’s the rate of CacheHits and CacheMiss, what’s the CPU Utilization over a specific period, are you getting sudden peaks in the usage, and so on. Using this as a base, an organization can design their logic and understand the data over a period of time to discover which metric is not being used anymore.
Since there is no automated way to achieve this, companies which use AWS infrastructure rely on their ElastiCache and CloudWatch dashboard. They manually observe the trends in their dashboard and picks the inactive clusters. This method needs careful investigation and too much of manual work. And in the world of automation, doing this every week or month manually seems we are not on the same page the automation technology is. We need a better approach which is automated and reliable too.
Considering this problem of selecting the right metrics and on the top of it automating it, we at Postman used one of our tools — Postman Collection to achieve this. The approach we picked is based on the metrics which AWS provides. If you are using any other service (or your own), the underlying principles used here can be applied to that as well ;)
How the solution is designed
- Picking the right metrics.
- Fetching the metrics data and processing it.
- Beautifying the result and sending it to Slack. (Beautifying here means attaching the right reasons of why the process marked a cluster as inactive)
All these steps are already written and configured, and available as a Postman Collection template. This can be directly run in the Postman Desktop App. Selecting of inactive clusters is subjective but there are few core metrics/parameters (explained below) which can be considered every time for their reliable results.
The whole agenda is to build an automated workflow which can be configured anytime.
Postman Collections are a group of saved requests you can organize into folders. Requests can be linked together via scripts. One can even write pre-scripts and tests process the API requests.
Understanding the process
A Postman Collection interacts with the AWS ElastiCache first to get information about cache clusters and it’s internal details. Then it uses CloudWatch APIs to gather data about every cluster depending on multiple types of metrics.
A metric is a fundamental concept of CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch which are being monitored over a specific time interval to derive statistics about that system. For example, the CPU usage of a particular EC2 instance is one metric provided by Amazon EC2. Data from all AWS services is sent to CloudWatch which forms the metrics there. (All metrics)
Below are the different types of metrics which are being used in this Postman Collection.
This is a host-level metric. Thresholds for this type of metric depends on the Infrastructure designer. This decision is based on how many cores do you have and even the pricing. In the Postman Collection template, you can set this threshold on your own.
The number of successful read-only key lookups in the main dictionary is defined as CacheHits. Generally, there is no threshold in this metric since if your cluster is even having one cache hit, that means it’s in use. However, you can still set your own threshold for this.
The number of items in the cache in the given specified time range. The threshold concept is same as of CacheHits but the only difference is that the collection gathers data from CurrItems on the basis of Sum Statistics type while with CacheHits it uses Average Statistics type.
Getting and Running the collection
- Download Postman Desktop App from here.
- Read about Collections and Environments first.
- Use this published template — AWS ElastiCache Utilization Checker
- Configure the environment variables. To make the process easy, a default environment is already present which have default values to some variables.
Environment Variables —
secretAccessKeyare the AWS credentials variables.
periodis the length of time (seconds) associated with a specific Amazon CloudWatch statistic. Each statistic represents an aggregation of the metrics data collected for a specified period of time.
slackWebHookURLCreate your own Slack WebHook where you want the results to be published at.
daysAmount of time you want to monitor the ElastiCache Clusters from current day to back N days. For example, taking the above case of 14 days. This collection will aggregate data points from today to last 2 weeks.
regionAdd the region where your clusters are.
A walk through all APIs
You don’t need to do any changes in params (unless required). Just set the environment variables and you are good to go!
Fetching all clusters in a region —
This API uses ElastiCache’s DescribeCacheCluster. Maximum records you can fetch in one request is 100.
Fetching CacheHits statistics —
This request is done using a CloudWatch API, GetMetricStatistics. Metric used here is CacheHits with statistics type Average. (You can change this type if you want)
Fetching CPUUtilization statistics —
This request is done using a CloudWatch API, GetMetricStatistics. Metric used here is CPUUtilization with statistics type Average. (You can change this type if you want). This metric is one of the most important metrics and can signify a lot about the usage of clusters.
Fetching CurrItems statistics —
This request is done using a CloudWatch API, GetMetricStatistics. Metric used here is CurrItems with statistics type Sum. (You can change this type if you want).
Note: This collection is using CloudWatch API version 2010–08–01.
How to run this collection?
- Hit the ‘RUNNER’ button present at the top bar in the desktop app. This will launch the Runner in a new window.
- Select the imported collection AWS ElastiCache Utilization Checker and set the environment to AWS ElastiCache Utilization Checker.
- Click ‘Start Run’.
The runner keeps a record of every request which can be visualized in the runner tab itself. After the successful completion, this collection will send a notification on the Slack via the WebHook URL you specified in the environment.
Example Slack Result —
This article proposes an automated workflow to monitor the inactive cache clusters. The metrics used here might be less quantitative if you want to achieve more accuracy. You can use this Postman collection template as a base and can include more metrics.