Misunderstood Misfits — The T2 Instance.

We are going to take a simple but often misunderstood AWS methodology / feature and drag it out into the unapologetic sunlight of my way of looking at things: The T2 Instance — An Architects Users Guide

T2 Instances help us get to the elusive “Goldilocks” zone in our lower level capacity mapping in the AWS Cloud. It’s the architectural equivalent of sharing a meal with your significant other — done right, everyone walks away happy, feeling like they nailed their order, and that they (for once) didn’t over eat. However, if you’ve ever tried to “share” something when the both of you are ravenous, you know just how well that works out — as in it doesn’t, and enviably someone ends with a fork stuck in the top of someone’s hand like a bad Tom & Jerry cartoon.

The major benefit of the T2 instance class is that you get percentage of an instances CPU power to leverage against your workloads, at a very low price. As we all operate in a give and take world, the consideration here is that you aren’t allowed to constantly consume more than your allocated portion of the processor time. It’s sorta like bogarting to many fires off that shared plate — you can get away with it so many times before, {{!BAM!}}, your hand is slapped.

Getting to the technicalities, each of the T2 instance sizes allows for different level of constant access and expected use the available CPU — this is called the Baseline.

The diagram above shows the baseline expectations for each T2 instance size. To make sure that you are using the right instance size, you see what baseline you can operate in.

For example, if you have an application / workload that is going to use very little CPU, equal to or less than an average of 5% of one processor for an hour, then a t2.nano is a good fit. If you had an application/workload that would take up less than an average of 20% of once processor, but more than 10%, then a t2.small is the T2 for you.

The t2.nano through the t2.small are 1 vCPU instances. The t2.medium and t2.large are 2 vCPU instances. The trick to making sure you are walking the line using the medium and large T2’s is to understand that the baseline is an aggregate of the both vCPUs. For example, if you use 100% of a single vCPU and a 5% of the other, your CloudWatch metrics will show 55% utilization.

With either one or two processors, the expectation is clear — stay under the baseline and you’ll get constant performance. Well, what happens when your workload gets hungry for more?

In most cases, AWS expects that you are going to soar past the baseline occasionally. Really, asking a system to boot using less than an average of 5% of it’s of CPU is unrealistic. (It’s like the idea that people want more hotdog bun than hotdog, what rocket scientist came up with that??)

To address the inevitable over-use of the processor, AWS implements CPU credits. Single processor T2 instance sizes start you off with 30 credits, dual processor sizes start you with 60 credits.

Credits equate to 1 minute of full CPU performance against on vCPU core. AWS is agonizingly literal about this — they track the utilization of the vCPU’s at the millisecond level. So you can and do consume fractions of a credit as your system works the processor, but at least you are charged an exact amount as opposed to getting rounded up on. You can trust but verify the state of your systems and pools using CloudWatch metrics and alarms. AWS stays very much above board on this.

CPU Credits are the foundation of the T2 instance operation. When you are running your system you are constantly consuming and in some cases banking credits. This is where it gets a wee bit complicated to plan to use T2 instances in a steady state / production environment, and why a lot of people just use them for development or sandboxing use cases.

Basically, AWS provisions each T2 instance size with a large starting pool of credits, and then feed it enough over the course of each hour to run at it’s baseline level. If you consume more than the credits you are given in an hour, the pool goes down. If you consume less, you get to back the hours that you didn’t use. Each credit you earn is good for 24hrs in the bank. If you don’t use that particular credit, it expires and is taken off that instances pool ledger.

Each T2 instance type has a maximum pool size to take into account as well — but the ebb and flow of credits is more important to understand as it is your planning metric. The pool is your safety margin.

Honestly, one of the best explanations of Credit expenditure and banking is straight from the AWS on-line manual entry on T2 instances:

[I]f a t2.small instance had a CPU utilization of 5% for the hour, it would have used 3 CPU credits (5% of 60 minutes), but it would have earned 12 CPU credits during the hour, so the difference of 9 CPU credits would be added to the CPU credit balance. Any CPU credits in the balance that reached their 24-hour expiration date during that time (which could be as many as 12 credits if the instance was completely idle 24 hours ago) would also be removed from the balance. If the amount of credits expired is greater than those earned, the credit balance will go down; conversely, if the amount of credits expired is fewer than those earned, the credit balance will go up.

AWS has a pretty “fair” system for helping you utilize T2 instances — however, if you are undersized too far, you will exhaust your pool. Once you reach a zero pool balance, AWS will start lowering your access to the vCPU core over 15 minutes until you are throttled down to the baseline expectation of the instance type you provisioned. While a fair and balanced response to using too many resources, it will feel like your application is bogging down, and your CPU KPI’s will look as though the system is not stressed. Don’t be fooled — you’re being starved for your gluttony.

Admittedly, there are a lot more nuances working with T2’s than with all the other instance sizes in AWS — but there is a financial benefit for sure as a T2 instance tends to be 40%-50% cheaper than their closest M class cousins.

The T2 class excels at single server workloads — or in other words single uniform tasks. File servers, File transfer servers, and other “You’ve got one job” roles that you can develop/predict an average workload on. Picking up jobs from a queue and processing tasks is another place where they find a happy purpose. I love using them to support Continuous Integration (CI) and Continuous Delivery (Cd) tools — mostly because of the predictable nature of those workloads resource needs, and that I like to keep those types of systems up in a steady state to handle the needs of Dev and QA teams. Devies and QA Cops are a hair trigger on the panic button in my experience as an Ops Monkey, and I’d rather not move their cheese on tools they see as critical.

(Somethings, like team-critical toolsets, I still am bearish on blinking in and out of existence, even if the load on them isn’t constant — T2’s let me look at not compromising cost vs consistency in these cases.)

Using T2’s as web origin servers for a website fronted by a CDN, or as a read-replica feeding an ElastiCache node is another interesting use of the instance type. Sometimes when I suggest people try this out, their eyes bug out — however the network throughput of a T2 instance is on par with the other t-shirt sized instances in it’s class, and oddly in some cases there is a quick 1G burst of through put before they settle into a steady state comparable with their like sized cousins. At the very least, you should consider load testing such a scenario if you are looking for some pennies to squeeze out of a budget.

Where they don’t fit in very well at all is load balanced workloads working as a close knit group of servers. T2’s aren’t really team players. Putting T2’s in an auto scaling group is also a lesson that I can share my scars with you on, so you can avoid it. They don’t play nice under such scrutiny, and tend to make for some very grumpy load balancers.

Theoretical spreadsheet jockeys often come to the beer-napkin conclusion that acquiring T2’s on the Spot market would support a fairly powerful queue tasking workload… in theory yes, however AWS would like to stay in business, so you won’t find the uber-frugal T2’s on the Spot market.

Finally, don’t expect the T2 instances to give you stellar disk performance, as none of them support the EBS Optimized instance feature — however a nice steady, under baseline use case might not need that feature as much, and it’s easy enough to uncover in the KPI’s.

So, coming full circle back to our starting analogy (and oh do I love analogies…) how hungry your application/use case is determines if the T2 instance family is appropriate for you. Hopefully you’ll consider these little guys for a job other than a sandbox sometime soon.