The AWS spend of a SaaS small business

Shane Harter
Crafting Cronitor
Published in
6 min readJun 9, 2017
Cronitor AWS spend for the last 12 months

In the first 30 days after moving Cronitor to AWS in January, 2015 we collected $535 in MRR and paid $64.47 for hosting, data transfer and a domain name. In the time since we’ve continued to increase our footprint, level-up instances and add more managed services. Despite the AWS reputation as an expensive foot-gun we’ve improved availability while keeping our bill consistently close to 12.5% of revenue. Here’s a look.

The bumps and scrapes of AWS on the cheap

When it became clear to us that our idea had a little traction we knew we needed to raise the bar from side-project to small business. Our first goal was not high availability, only higher availability than our previous setup on a single 2GB Linode. We really just wanted to restart our database without losing incoming telemetry data. Our initial build-out was simple:

  • ELB
  • SQS
  • A pair of t2.small instances running our web app and data collection, both in us-west-2
  • A single m3.medium running MySQL and our daemon that evaluates failures and sends alerts

We completed the migration in two hours with essentially no downtime and we were really pleased with ourselves. Beers were raised. Celebratory tweets were tweeted.

The joy was brief.

Issue 1: Breaking the ELB

Our users send telemetry pings when their jobs, tasks and daemons run. With NTP, the average server today has a very accurate clock and we see emergent traffic spikes at the top of every second, minute, hour, and day — as much as 100x our baseline traffic.

Immediately after the migration, users reported intermittent timeouts and we started double-checking server configs and tailing ELB logs. With traffic spikes still under 100 requests per second we resisted blaming the ELB and looked for mistakes in our own configuration. Finally, we set up a test to ping continuously from a few moments before 00:00 UTC to a few afterward and we saw requests failing that were never recorded in the ELB logs. Individual instances could be reached and the ELB request queue never backed up. It was clear that connections were being dropped at the load balancer, probably because our traffic spikes were too large relative to our baseline and too short-lived for it to be warmed up to more capacity. With an expensive AWS support plan we could’ve asked them to manually increase the size of our ELB but instead we chose to round-robin requests using DNS and remove the need for the load balancer altogether.

Lesson learned:

  • Cloud solutions like elastic load balancing are generally engineered for the average use case. Think about the ways you are not like the average.

Issue 2: Getting familiar with the CPU credit balance

The T2 family of burstable instances provides cost-effective performance for occassional workloads, or so the website says. What I wish it said was: Once you start running this instance at a consistent 25% of CPU you will drain your CPU credit balance and once it’s empty you essentially have the computing power of a RasperryPi. No alarms will sound when this happens and your CPU% will not reflect the diminished capacity. The first time we exhausted the credit balance we mistakenly thought connections were being dropped due to an upstream issue.

Lessons learned:

  • If something is a lot cheaper there’s a good reason, so understand it.
  • You should only use T2 instances within an auto-scaling group.
  • Just in case, create a CloudWatch alarm to warn you when the credit balance drops below 100.

Issue 3: Reading the fine print

Last year at re:Invent Amazon updated their Reserved Instance offering, probably in response to more generous terms from Google Cloud. The press release said Reserved Instances would be cheaper and could now be moved between availability zones. I’ll drink to that!

When it came time to retire our final T2 instances in October we rolled-out new M3’s with these cheaper, more flexible twelve-month reservations. After on-boarding several large users in Q1 we decided in April to level-up instance types again to M4.large. We had 6 months left on our October resevations so I planned to sell them like I always had before. Then I learned the expensive truth that the trade-off to these cheaper more flexible reservations is… you can’t resell them.

Lessons Learned:

  • If something is a lot cheaper there’s a good reason, so understand it.
  • Always read the fine print twice. AWS billing is incredibly complicated.

A peek at real-world AWS costs

Our infrastructure today remains fairly straightforward:

  • M4.large cluster handling incoming telemetry collection
  • M3.medium cluster serving our web app and API
  • M4.large worker running our monitoring daemon and alerting system
  • M4.xlarge for MySQL and Redis

We continue to use a number of managed services including SQS, S3, Route53, Lambda and SNS.

Elastic Compute

We use partial upfront reservations for all of our instances.

You can see that on our monthly bill we pay 2/3 as much for provisioned iops as we do for instances. In contrast to most cloud metrics guaranteed iops (input/output operations per second) are a concrete SLA, something that has a real cost to provide. They’re also an essential part of your Ec2 budget for any host where disk performance matters. If you don’t pay for iops your workloads will wait in line for remnant capacity.

Please don’t ask me what an “alarm-month” is

SQS

We use SQS extensively to queue incoming telemetry pings and results from our healthchecks service. One optimization we added a few months after migration is max-batching of reads. You pay by number of requests, not number of messages, so it reduces costs and makes it significantly faster to gulp down messages.

During the migration we were concerned about SQS as a single point of failure for our data collection pipeline. To mitigate the risk we deployed a small daemon on each host to buffer and retry SQS writes on failure. It’s had to buffer messages only once in 2.5 years so 1) it was 100% worth building and 2) SQS has proven incredibly reliable in us-west-2.

Lambda

Our Healthchecks service is built in part on Lambda workers deployed in the regions below. It’s worth calling out that Lambda has a generous free tier that applies to each region independantly. Currently the free tier is advertised as “indefinite”.

S3

We backup database snapshots and log files to S3 with replication to us-east-1 for disaster recovery.

An AWS pro-tip is that backups and EBS snapshots are vital for disaster recovery but serious failures within a region are not usually contained to a single service. If you can’t launch an instance there’s a good chance you won’t be able to copy your snapshots to another region. Do that ahead of time!

Wrapping it up

Having worked on large corporate and investor-funded AWS deployments, I can personally vouch that you can run up a big tab. They’ve created a wonderland of tools and shiny objects that all snap together — and you’re charged for everything you touch. Not only that, you’re also charged for the data flowing in and out each of those systems. This can be a real gotcha if you’re not careful. Our friends at Cloudforecast.io helped us get a better understanding of this, and also wrote a handy guide to reducing AWS data transfer costs.

AWS can be dangerous. You have to be frugal and restrain your appetites, but you’re rewarded with the ability to grow a small business into a bigger one with on-demand access to any resources you need. It’s good to pause for a moment and appreciate how awesome that is — then get back to work.

Cronitor is your missing monitoring tool. Try it free.

--

--