3 Easy Steps to Cloud Operational Excellence

Bruce Wang
Runscope
Published in
7 min readMay 30, 2017

There are a lot of tools out there, and sometimes its hard to sift through them all. Here’s a simple guide to combine 3 tools, Runscope, PagerDuty and StatusPage to create a powerful cloud operational workflow that will give you peace of mind and clear visibility to your application for your customers and internal teams alike!

In case you’re not familiar with the tools, here’s a quick rundown:

  • Runscope — highly flexible API testing and monitoring service
  • PagerDuty — incident management system
  • StatusPage — Customer-facing API health status page

The workflow

It’s important to implement tools for specific purposes, and we wanted to integrate these 3 tools to help manage our operational process better.

In the following examples, we’re going to show you how we added a new feature to our product (Live Streaming APIs), and added operational visibility to it.

First let’s walk you through our workflow:

  • A Runscope test monitors the service on a schedule and sometimes tests from different geo locations depending on how it has been configured
  • If a Runscope test fails, PagerDuty creates an incident, alerts our Slack channel, and alerts the appropriate engineers
  • PagerDuty also updates the service status on StatusPage to alert our customers the service is having problems
  • Once the problem is resolved and the Runscope test that was failing starts to pass, the incident on PagerDuty will resolve itself and the service on StatusPage will revert back to operational status automatically

Step 1: Create a Runscope test for the Live Streaming API

Runscope provides an easy way to make POST requests on an API and then make assertions on the response.

Here is what our POST request to our live streaming service looks like in Runscope:

Note: the {{xxx}} is a variable that can be set from previous tests or configured via “environment” specific settings. You may hard code values in the beginning, but using variables is invaluable for creating richer tests across your various service environments

When our live stream API is called, the JSON response we expect should include a playback and stream url, so we just need to add some simple assertions in Runscope

We check that the HTTP response is 200 and then we check that playback_url and stream_url are not empty. We also save the values that are in playback_url and stream_url

The reason for saving the values is that we will then call our video details API and assert that the values stream_url and playback_url are present.

We then make the assertion on the details API that the playback_url and stream_url are the values we expect.

After we built this test, we put it on a schedule using the ‘Schedules’ menu in Runscope and we were ready to add a PagerDuty alert so that we could be notified if the test for the live streaming API fails.

Step 2: Setting up PagerDuty with Runscope

Luckily, Runscope and PagerDuty have a pre-built integration. So all we had to do was go to PagerDuty and create a new service under the ‘Configuration’ menu. When adding the service for ‘Integration Type’ we specified ‘Runscope’

Then we configured the ‘Incident Settings’ and ‘Incident Behavior’ and then simply clicked ‘Add Service’ . Once the service was added, we were able to see it under our ‘Services’ in PagerDuty.

To then connect to our live stream test in Runscope to PagerDuty, we went into Runscope under ‘Connected Services’ and clicked the button that said ‘Connect PagerDuty’

Then the Runscope system asked us to authorize our PagerDuty account with Runscope, so we put in our PagerDuty credentials and clicked ‘Authorize Integration’. Finally we choose the service from PagerDuty that we want to integrate with Runscope and clicked ‘Finish Integration’

Once we did that, inside of ‘Connected Services’ in Runscope we could see our PagerDuty integration:

As you can see from screenshot our PagerDuty service called ‘SYNQ Live Stream Check’ is now integrated into Runscope. The last step was connecting the PagerDuty service to our Runscope test for the live streaming service. To do that we simply went to the live stream Runscope test and went into the ‘Editor’, we then modified the integrations for the environment we are using. Then we just flipped the integration to ‘ON’.

Note: Again, this notification is available in a per environment setting, as you can see this environment is “Production”

We now had the live stream test from Runscope connected to PagerDuty. Thus we would get alerted by text message or phone call if the Runscope test fails. In addition to that, we connected PagerDuty to our Slack channel following this guide , so that if a PagerDuty incident is triggered by Runscope, we get alerted on our Slack channel. The last piece left was to connect PagerDuty to StatusPage, so that our clients could be alerted if the live streaming service fails.

Step 3: Adding the Live Streaming Service to StatusPage

Now that we have a way to monitor and alert our live streaming service, we need to expose this to our clients. We do this with our public facing StatusPage (having a transparent operational status is very important and you can read more about that here.)

To connect PagerDuty and StatusPage, we followed this PagerDuty guide. Once we had both of the accounts connected, the rest of the setup occurred on StatusPage. Inside of our StatusPage configuration, we now had a section for PagerDuty. Inside that section, to connect a component to a PagerDuty service, we needed to add a rule.

Under the `SYNQ Live Stream Check’, we clicked ‘Add Rules’ and then that brought us to another page, where we were able to connect the ‘Live Stream’ component on our StatusPage to the PagerDuty ‘SYNQ Live Stream Check’ service.

We clicked on ‘Save Rules’ and we were done. On StatusPage under ‘PagerDuty Setup’ and ‘Active Services’ we could now see our ‘SYNQ Live Stream Check’ present:

Now our public facing StatusPage shows our ‘Live Stream’ status!

If our live stream service test failed on Runscope the ‘Live Stream’ component on our status page goes from ‘Operational’ to ‘Degraded’.

Mayday! Mayday! We have a Problem

Although our live stream service was still in alpha, we had no issues and our Runscope test for the service were all green. Then one day, we get a text message from PagerDuty, alerting us that our Runscope test for our live stream service was failing. In the meantime we were also alerted on Slack and our ‘Live Stream’ component on StatusPage went from ‘Operational’ to ‘Degraded’.

Next, we immediately went into our live stream Runscope test and noticed that we were not getting the appropriate HTTP response code from our live stream API. We knew at this point that our live stream service was having an actual failure. We then checked the server logs for our streaming servers in Amazon Cloudwatch and we noticed that it was not taking any requests for creating new streams. We eventually traced this to a backend service we depended on that that ran out of resources.

There were two issues we discovered. One, we were not deleting old and unused streams, which resulted in excessive streams and running out of resources. The second issue was that our Runscope tests were running too often, thus exacerbating the issue by creating 288 unused streams a day. We learned that in some cases running a Runscope test too often is not ideal and that building a test and monitoring model around new features can help you find bugs in your platform.

Conclusion

Thanks for sticking with us for the whole article. Hopefully you got a lot of value in it, and feel free to ask us any questions you may have about our process or any individual services we use in the comments below. Happy Service Building!

--

--

Bruce Wang
Runscope

Dir of Eng @ Netflix, Co-founder, CTO at Large @Synq.fm, foodie, techie, and startup advisor, based in SF