Monitoring on the BBC Sounds Mobile application. A story of parenting and friendship

Diego Pinedo Escribano
BBC Product & Technology
7 min readJul 29, 2020

Chapter I: The Beginning

We love the audience, and we love Sounds. Like parents, we want to make sure our baby, the Sounds mobile application, is safe and happy. We want to know when something is wrong so we can help. Sometimes you have to trust your kids and give them freedom, sometimes you ask them to call you every hour to know they are OK and they are not talking to strangers.

That is why on that evening, The Council called for a meeting to address this issue. During that meeting, a wise man said:

“Currently we lack a client-side monitoring solution that allows rapid alerting and diagnosis of issues in the Sounds mobile app. We would like to make client-side issues visible, measure their impact, and respond rapidly to problems.“

As a consequence of that meeting, a ticket was created. One epic ticket to rule them all:

Chapter II: Sharing is caring

Our lovely neighbours, the iPlayer team, were already using a monitoring solution using AWS CloudWatch so instead of reinventing the wheel, we thought of reusing their system at first. Then we could iterate and build on top of that.

An important part of our monitoring solution is not what it does or how it does it, I will get to that a bit later. As many important people say, it is all about the journey.

We had several meetings with the iPlayer team and they helped us to copy their solution. We created a new AWS account and forked their Github repo. After several find-and-replaces, and adding some new lambdas, we started reporting our first JSON parsing errors (fake ones of course, our app is perfect and does not produce any errors).

We proved that the solution worked, but Winter was coming and we had to prioritise other pieces of work like DARK MODE

Chapter III: Let’s do this

In 2020, Thanos (a.k.a. Olympics, a.k.a Glastonbury) was coming. It was time to move to Monitoring and Alerting phase 2.

Avengers, assemble!

Under the supervision of Nick Fury (a.k.a Cloud Engineering team), iPlayer and Sounds gathered forces and created a common solution we could reuse, following the best architecture patterns and assuring it was flexible enough to accommodate both teams’ aspirations.

After several meetings, some mobbing sessions and three or four diagrams later, we were ready. Allow me to introduce you to: Sounds Mobile Monitoring on AWS

TLDR:

Tell me more:

When something particular happens in the app — something we are interested in monitoring, the application sends a request to a specific Amazon Web Service with useful information like the version of the app, the version of the operating system, etc. The request also contains what is called a metric body, which is a JSON payload that helps us to know more about events like “A certain error has occurred when playing a particular episode” or “The JSON parsing has failed for this particular endpoint”

These metrics are parsed by a lambda function and stored on the AWS CloudWatch servers as logs. These logs are then fed to a Grafana Dashboard where we can monitor the state of the application.

I am an Engineer, show me the real stuff:

The application:
We have a MonitoringService that can be used anywhere in the application. This service translates the input information into a json payload that is then sent inside an HTTP request using the BBC HTTP Client library.

We can enable and disable sending these payloads using a remote configuration file that also includes the AWS endpoint we want to target. This allows us to easily swap between production or testing environments.

Disclaimer: All the information we sent is anonymous. We don’t send anything that can be used to track the user. Our solution has the InfoSec Seal of Approval.

AWS:

When a request hits the AWS CloudWatch server, it gets filtered by our shared load balancer. From there, a different lambda function is triggered depending on the domain (iPlayer or Sounds).

Our lambda works slightly differently than the iPlayer one. We extract the information we need from the payload and from the User Agent and we create our own metrics with our own dimensions (application version and OS version). These metrics are then stored in the shape of AWS CloudWatch logs.

The reason behind this design implementation is that we wanted our metrics to be generated dynamically instead of having a predefined set of metrics and metric filters to generate them like the iPlayer version. The advantage of this system is that we are more flexible when our client apps need to report new metrics because we don’t need to modify and deploy our lambdas again.

Another consequence is that by creating one metric log per metric body and per dimension, we can create richer Grafana graphs. A side effect is that the amount of metric logs we generate are way larger than the iPlayer version.

Grafana:
We use Grafana to display graphs that allow us to monitor the state of the live application.

Grafana is configured to feed from the CloudWatch logs. We have created one dashboard that we share between iOS and Android applications.

The dashboard is divided into different panels. Each panel can get its date from one or more different metrics, and each metric can have multiple dimensions, like the application version.

Here are some examples of what we are monitoring right now on the Sounds mobile application:

Example Playback errors per thousand plays:

1000*SUM(REMOVE_EMPTY(SEARCH(‘Namespace=”SoundsMobile” MetricName=”PlaybackError” os=”Android”’, ‘SampleCount’, 300))) / SUM(REMOVE_EMPTY(SEARCH(‘Namespace=”SoundsMobile” MetricName=”PlaybackStart” os=”Android”’, ‘SampleCount’, 300)))

These panels allow us to monitor unusual spikes on playback errors. If we see anything abnormal happening here, we can go to the CloudWatch logs to get more insight information of the individual events, like the id of the episode that is producing the spike.

Example Runtime errors:

These panels help us to identify run time errors that can affect the experience of the user.

For example, if we are receiving too many JSON parsing errors, it could mean that the Sounds content provider is giving us some content we are not ready to display, which means the user could be missing some interesting content like the new ‘The Digital Human’ podcast episode.

If we get too many HTTP Unauthorised Errors, it could mean there is something wrong with our BBC Authtoolkit library and we should investigate ASAP.

Chapter IV: The future

Reusing the parenting metaphor, we won’t be good parents if we stop worrying about our children when they finally get a job and emancipate. We still want to know if they are eating properly and when they are going to have a baby.

In order to get better at monitoring, we are setting up alerts so we don’t have to constantly monitor the Grafana board. We also keep improving the board by adding new panels and improving the existing ones.

Epilogue

What I am about to share has never been released to the public but I think people should know about it. Please allow me to share with you part of the original script of one of Monty Python’s Life of Brian. I don’t know what happened at the end and why they decided to change it in favour of talking about the Romans…

Tony

Monitoring has taken resources and time from us, and what has it ever given us in return?

Diego

We saw the Android app was doing http requests on a non-existing endpoint

Tony

Ah yeah, that is true, we found that issue

Dave

We also saw some JSON parsing errors on the iOS app

Jamie

And we can now check if the DRM library is not initialising properly

Polly

We can also now check if there are spikes on play errors on particular programs

Tony

Ok, but apart from the HTTP errors, the JSON parsing errors, the DRM error and the play errors, What has the monitoring ever done for us?

Stu

It improved the quality of the application

Tony

Oh shut up!

--

--