Thin Status Monitoring

Kev Jackson
Mar 25 · 6 min read

For a traditional Software-as-a-Service offering, needing a dedicated for your application is something that engineering teams run up against soon after they get their first external customers. However there is a great deal of value to have a status page or dashboard even for internal services.

The next logical step from internal systems monitoring is to provide an externally visible status portal to clients. This transparency is useful for commercial applications, Platform-as-a-Service providers, open source distribution platforms, even internal systems providing a service for other developers.

This transparency allows clients to define an SLA for the service and provides a strong incentive to the service provider to build robust software and systems.

The idea is motivated, from an engineers perspective at least, by laziness; “I don’t want to handle calls asking if the service is up, I’ll build a simple status page so that users can check the current status themselves”. This quick, project, quickly morphs into a much larger scale project when you consider what good status pages entail:

  • Cannot be hosted on same infrastructure as your main application
  • Ideally needs to be hosted in a different data centre (ideally on a different continent)
  • Should automatically update when the application or service is unavailable or is suffering from some form of service disruption
  • Should automatically update with information about if an engineer is assigned to investigate an issue and if possible an ETA for a fix
  • Should automatically update to show when an issue is fixed
  • Provide a historic view of incidents so that clients can tie their service disruption back to your upstream service
  • Should be suitable for integration with 3rd-party software (at the very least allow web scraping with little effort)

Suddenly a simple off-site hosted page with “All Good/Something is broken” is no longer a weekend project.

Our , James, decided to investigate the status page landscape and see if it is as complex as it first seems to be.

What does good look like?

For an initial requirements gathering exercise, we investigated a variety of status page formats:

Google Cloud
AWS status dashboard

These first two status pages have a mainly technical audience, thus breaking down the service into details of individual APIs makes sense in this context.

The Google Cloud dashboard breaks down the information by API or service, however the view is a calendar, showing service events on a timeline which is visually more interesting (and more informative) than the descending list that the AWS status dashboard provides.

AWS provides a personalised view (if logged in with your AWS credentials), which limits the view to the services that your account is using. They also provide an feed of service outage that allows clients to access a more computer-friendly view of the same information (which would allow integrations of AWS service status into an aggregated service status dashboard for example).

Both of these technical dashboards are great for their audiences, however our audience (or users) are not technical and need a more product focused dashboard.

Live service games such as and require a different kind of status page — one that hides technical details while still breaking down the live service into component products. In the first example, we can see the current status along with a historic timeline of any past incidents.

Fortnite’s status
League of Legends’ status

The second example shows how service disruption (or planned maintenance) can be displayed within the status page. Importantly, both of these examples break the services down in terms that an end-user would naturally understand.

Buy vs Build?

Having decided that we required more of a product focused status page, do we simply write a cheque to and get integrating? That depends, not only on your budget, but also on your willingness to investigate alternatives and also how invested you are in other DevOps tooling that could potentially make building your own version possible.

Open source solutions

Two open source solutions to the status page problem, are and . The first is a Ruby (on rails) project and the second a PHP project. Both of them require the equivalent of a stack to be installed, configured and managed (which poses its own question).

Do we then need to monitor the status stack?

The Warehouse Management Systems team at THG already uses AWS, which led us to look for a more AWS-centric approach to status pages: we found one that fit the bill, .

A Serverless status dashboard

The benefits of a serverless approach are huge — especially when considering the fact that this is a serverless status dashboard — the last thing you need when handling service disruption is for your status dashboard to have a scaling issue!

Outcome of users refreshing a typical status page during an outage…

Integrations

We already use , and to handle our host, service and application monitoring along with incident and on-call management respectively.

We route all CloudWatch alarms and AppDynamics alerts through VictorOps to ensure that the correct on-call engineer is alerted and finally we also push alerts into dedicated Slack channels to keep the entire team informed.

A simplified view of THG WMS alert routing and status page mechanism

As VictorOps is acting as an aggregation platform for our various alerts, we can use that to also push our alerts into LambStatus to keep the status dashboard automatically updated.

Serverless Application Model (SAM)

As the LambStatus code is entirely serverless, it makes sense to deploy it to AWS using the . This allows us to associate the core LambStatus code with additional AWS resources (AWS DynamoDB tables and custom AWS Lambda code), which we can use to route incident information from VictorOps to our status page. The SAM configuration is contained in a yaml file:

We keep a mapping of VictorOps incident types and their equivalent product area or service in LambStatus, in DynamoDB. The function that converts from the incident in VictorOps to a valid LambStatus incident is also defined in this SAM configuration.

This is the core of the lambda that is defined in the SAM config file. The handler code is invoked by the AWS Lambda infrastructure, looks up some incident and status mapping information from DynamoDB and then posts a correctly formatted ‘incident’ to the LambStatus endpoint.

James also wrapped the interactions with LambStatus in a thin API to provide meaningful names to the REST calls we need to make.

Results

After hooking up this custom code to LambStatus and deploying the SAM application — all that is left is to pick a snappy domain name, register it and :

Our LambStatus powered status page

We’re recruiting

Find out about the exciting opportunities at THG here:

THG Tech Blog

THG is one of the world’s fastest growing and largest online retailers. With a world-class business, a proprietary technology platform, and disruptive business model, our ambition is to be the global digital leader.

Kev Jackson

Written by

Principal Software Engineer @ THG, We’re recruiting — thg.com/careers

THG Tech Blog

THG is one of the world’s fastest growing and largest online retailers. With a world-class business, a proprietary technology platform, and disruptive business model, our ambition is to be the global digital leader.