Thin Status Monitoring
For a traditional Software-as-a-Service offering, needing a dedicated status page for your application is something that engineering teams run up against soon after they get their first external customers. However there is a great deal of value to have a status page or dashboard even for internal services.
The next logical step from internal systems monitoring is to provide an externally visible status portal to clients. This transparency is useful for commercial applications, Platform-as-a-Service providers, open source distribution platforms, even internal systems providing a service for other developers.
This transparency allows clients to define an SLA for the service and provides a strong incentive to the service provider to build robust software and systems.
The idea is motivated, from an engineers perspective at least, by laziness; “I don’t want to handle calls asking if the service is up, I’ll build a simple status page so that users can check the current status themselves”. This quick, “I can build it in a weekend” project, quickly morphs into a much larger scale project when you consider what good status pages entail:
- Cannot be hosted on same infrastructure as your main application
- Ideally needs to be hosted in a different data centre (ideally on a different continent)
- Should automatically update when the application or service is unavailable or is suffering from some form of service disruption
- Should automatically update with information about if an engineer is assigned to investigate an issue and if possible an ETA for a fix
- Should automatically update to show when an issue is fixed
- Provide a historic view of incidents so that clients can tie their service disruption back to your upstream service
- Should be suitable for integration with 3rd-party software (at the very least allow web scraping with little effort)
Suddenly a simple off-site hosted page with “All Good/Something is broken” is no longer a weekend project.
Our chief Engineer, James, decided to investigate the status page landscape and see if it is as complex as it first seems to be.
What does good look like?
For an initial requirements gathering exercise, we investigated a variety of status page formats:
These first two status pages have a mainly technical audience, thus breaking down the service into details of individual APIs makes sense in this context.
The Google Cloud dashboard breaks down the information by API or service, however the view is a calendar, showing service events on a timeline which is visually more interesting (and more informative) than the descending list that the AWS status dashboard provides.
AWS provides a personalised view (if logged in with your AWS credentials), which limits the view to the services that your account is using. They also provide an RSS feed of service outage that allows clients to access a more computer-friendly view of the same information (which would allow integrations of AWS service status into an aggregated service status dashboard for example).
Both of these technical dashboards are great for their audiences, however our audience (or users) are not technical and need a more product focused dashboard.
Live service games such as Fortnite and League of Legends require a different kind of status page — one that hides technical details while still breaking down the live service into component products. In the first example, we can see the current status along with a historic timeline of any past incidents.
The second example shows how service disruption (or planned maintenance) can be displayed within the status page. Importantly, both of these examples break the services down in terms that an end-user would naturally understand.
Buy vs Build?
Having decided that we required more of a product focused status page, do we simply write a cheque to Atlassian and get integrating? That depends, not only on your budget, but also on your willingness to investigate alternatives and also how invested you are in other DevOps tooling that could potentially make building your own version possible.
Open source solutions
Two open source solutions to the status page problem, are staytus and cachet. The first is a Ruby (on rails) project and the second a PHP project. Both of them require the equivalent of a LAMP stack to be installed, configured and managed (which poses its own question).
Do we then need to monitor the status stack?
The Warehouse Management Systems team at THG already uses AWS, which led us to look for a more AWS-centric approach to status pages: we found one that fit the bill, LambStatus.
A Serverless status dashboard
The benefits of a serverless approach are huge — especially when considering the fact that this is a serverless status dashboard — the last thing you need when handling service disruption is for your status dashboard to have a scaling issue!
We route all CloudWatch alarms and AppDynamics alerts through VictorOps to ensure that the correct on-call engineer is alerted and finally we also push alerts into dedicated Slack channels to keep the entire team informed.
As VictorOps is acting as an aggregation platform for our various alerts, we can use that to also push our alerts into LambStatus to keep the status dashboard automatically updated.
Serverless Application Model (SAM)
As the LambStatus code is entirely serverless, it makes sense to deploy it to AWS using the AWS Serverless Appication Model. This allows us to associate the core LambStatus code with additional AWS resources (AWS DynamoDB tables and custom AWS Lambda code), which we can use to route incident information from VictorOps to our status page. The SAM configuration is contained in a yaml file:
We keep a mapping of VictorOps incident types and their equivalent product area or service in LambStatus, in DynamoDB. The function that converts from the incident in VictorOps to a valid LambStatus incident is also defined in this SAM configuration.
This is the core of the lambda that is defined in the SAM config file. The handler code is invoked by the AWS Lambda infrastructure, looks up some incident and status mapping information from DynamoDB and then posts a correctly formatted ‘incident’ to the LambStatus endpoint.
James also wrapped the interactions with LambStatus in a thin API to provide meaningful names to the REST calls we need to make.
After hooking up this custom code to LambStatus and deploying the SAM application — all that is left is to pick a snappy domain name, register it and Bob is your uncle:
Find out about the exciting opportunities at THG here: