Replacing Pingdom with Lambda

I want to periodically test the availability, error rate, and response time of one of my HTTP services.

Generally I would use Pingdom to measure this, but there are a few things I don’t love about it:

  • It’s another vendor / API / config to add to my stack
  • It starts at $14/mo for 10 checks with a 1 minute interval
  • It has a bunch of features for websites which don’t apply to my use case
  • It’s hard or impossible to customize properties of the requests and report properties of the responses
  • It’s hard to integrate its data into my other operational dashboards

So I set out to build a simple HTTP service health checker on AWS Lambda starting from the “Using AWS Lambda with Scheduled Events” guide.

Lambda Setup

For “Step 1: Select blueprint” I selected the lambda-canary blueprint. This blueprint “performs a periodic check of the given site, erroring out on test failure.” Perfect!

For “Step 2: Configure event sources”, I configured the highest frequency schedule expression: rate(5 minutes).

Lambda offers “cron” like scheduling

For “Step 3: Configure function” I threw away the Python in favor of some simple timeout / interval / callback Node.js code and a 5 minute timeout:

For “Step 4: Review” it looks like this:

No servers

Constantly Check

My Lambda function is configured to be invoked every 5 minutes, and to run for up to 5 minutes. My function code calls also context.succeed() around 5 minutes. I first tried to setTimeout to exactly 300 seconds and observed errors:

2016–03–05T19:57:58.415Z dc218df7-e30b-11e5–8597–67f2c3eb0614 Task timed out after 300.00 seconds

So I dialed it back to 290 seconds and it prevented the error.

Then I run the HTTP check code in a setInterval of 1 second and log its observations.

The hope and expectation is that I’ll almost always have a Lambda function running and checking my URL every second.

Observations

The goal is to visualize the uptime of my service. Since I’m also building the monitor I’d like to visualize its health too. A bit more setup of CloudWatch Metrics and Dashboards and I can see it all working:

Service and Checker Operational Dashboard

I first thought my Lambda function would also need to put custom CloudWatch Metrics about its observations, but it turns out ELB and Lambda report all the important stuff out of the box.

Here I can see that my service is seeing continual traffic over the past hour. The minimum is 900 requests and the maximum is 2700 requests in a 5 minute window.

3 req/sec to 9 req/sec matches my expectations for this experiment. I expect 1 request per second from Lambda, 1 request per second from the ELB health check and various Internet traffic.

I also intentionally scaled my service down to 0 processes for a few minutes and can clearly see the service interruption and error count.

Equally important, I see my Lambda function has a constant 1 invocation that takes 290 seconds.

Cool!

Room For Improvement

This took an afternoon to set up so there is room for improvement:

  • It was set up manually
  • The URL is hard coded
  • The Node.js HTTP requests are naive. It should report a custom agent and make sure to not use keepalive, support HTTPS, etc.
  • The test metrics aren’t broken out from normal Internet requests
  • There is a 10 second gap between checks. How close can I get to 300 seconds without hitting the timeout?
  • Lambda is in a single region and will certainly fail partially or totally from time to time
  • It’s not hard to imagine the worst case scenario where both the checker and my service fail at the same time due to a correlated outage at AWS
  • There are no notifications on service downtime and checker downtime
  • There are no reports available about failures

But all this can be addressed. CloudFormation or Terraform can almost certainly set this up automatically. The Javascript is easy to modify. The same function can be configured in another region. SNS can monitor the metrics and send a notification.

Costs and Conclusions

Finally the cost, in dollars and total cost of ownership, can’t be ignored.

With 2,592,000 seconds in a month, this single 128 MB Lambda function will consume most of the 3,200,000 seconds of free tier. Without the free tier it will cost $5.30 per month:

0.000000208 dollars/100ms * 10 * 2,592,000 seconds/month = $5.39/mo

$6/mo really isn’t bad for any business service. But I could get an entire 512 MB DigitalOcean Droplet for less. And running this in two or three regions could end up costing close to the Pingdom $14/mo anyway.

However, I am still very intrigued by this concept and will continue to play with it. Since I control the check logic, I can:

  • Check multiple URLs from the single Lambda function. I don’t see why it couldn’t be checking 100s of things.
  • Check more or less frequently than 1 second
  • Perform deep checks like authenticated URLs and API transactions

With very interesting side effects:

  • Eliminate a SaaS vendor
  • Eliminate monitoring VMs and software
  • Collect data and observations about the total cost of ownership — in dollars, maintenance and operations — of such a system implemented in Lambda

I work full time on open source infrastructure automation at Convox (website, GitHub).

Please send feedback and/or questions via Medium or Twitter to @nzoschke or email to noah@convox.com.