A practical guide on client-side software monitoring for Quality Assurance

Published in

Lumen Engineering Blog

8 min readMay 31, 2023

At Lumen, we assist video streaming platforms in enhancing user experience and resource utilization through our client-side products. Two of these products, Lumen® Mesh Delivery and Lumen® CDN Load Balancer, are provided as a native SDK written in C++ and wrapped in platform-specific libraries.

To ensure a certain level of quality, we developed sample apps that are supposed to reproduce our customers production environment. We let them run for extended periods of time (sometimes multiple days) on multiple devices in parallel. This article documents how we developed our custom solution to monitor these devices, using Prometheus and Grafana but also by leveraging the specificities of our products.

If you are working on a product that needs to run many devices interacting with each other and want to monitor them, you might find this article useful.

Introduction

As mentioned in the abstract of this article, our products are provided as native SDKs written in C++. They act as a proxy intercepting HTTP requests from a video player.

These SDKs are integrated in sample apps that we run for long periods of time on multiple devices. We call these runs “long runs” internally, and that’s how I will refer to them in the rest of this article.

This is where monitoring starts to be interesting. Since it’s impossible for a human being to monitor these long runs, we needed a way to quickly being able to tell if something went wrong and, more importantly, which device encountered issues.

Given that, we came up with the following list of requirements:

On client side, integration should only require a few adjustments. We didn’t want to spend too much time integrating a heavy framework and needed quick results.
The tool we choose to use should be compatible with all the platforms we currently support (Android, iOS, Windows/Linux/macOS, UWP) and will eventually support (Roku, Set-top box, …).
Device discovery should not require any manual intervention except setting the URL of our monitoring system.

Backend

First of all, we needed something that would act as a backend where all monitoring data would be collected to be treated and/or displayed.

Prometheus logo (source: https://www.vectorlogo.zone/logos/prometheusio/prometheusio-ar21.png)

After a quick analysis of what tools could be used, https://prometheus.io/ seemed to be an ideal choice. Quoting from their website, Prometheus is “a an open-source systems monitoring and alerting toolkit”.

In other words, it lets us gather vitals from our SDK and potentially alert us in case of unexpected behaviors. Furthermore, Prometheus can be used as a “Data Source” for Grafana.

An example of a Grafana dashboard (source: https://grafana.com/media/grafana/images/grafana-dashboard-english.png)

At that point, we started to see how our Long Run monitoring system would look like: each device would appear on a Grafana dashboard and a quick look at the vital graphs could inform us of a potential problem in the tested version of our SDK.

Prometheus is built on top of time series. It means that it stores a stream of timestamped values. For that, data can either be pushed by clients or pulled by Prometheus. We use the latter mode as it’s the commonly used one. In that mode, Prometheus will send a scrape request at regular intervals to all its registered targets. This request will be directed on the /metrics HTTP endpoint exposed by the target.

Example: If your software exposes an HTTP server on port 9090 and its IP address is 10.145.3.45, Prometheus will send requests regularly to the http://10.145.3.45:9090/metrics URL.

Now one problem needed to be solved. If you remember our requirements, I said that the target discovery should not require any specific manual action. If Prometheus is expecting to be given a list of targets and we expect targets to automatically register to our monitoring system, how should we do it ?

This is where Service Discovery is useful. Prometheus offers a way to specify an HTTP endpoint that can be used to retrieve the list of current targets to be scraped for metrics. We took advantage of that feature to develop a very basic discovery service.

Here is a quick explanation of how it works:

When starting, the SDK tries to contact this Discovery Service with a request on a /welcome endpoint (the URL can be configurable as said in the requirements).
When stopping, it sends a request to the /bye endpoint.

The /welcome request has another purpose. This is how the IP and port of the targets are retrieved. Since the SDK HTTP proxy is listening on that port, Prometheus can use it to scrape metrics.

Example: If we keep the same port and IP address, the Discovery Service would get a /welcome request containing the 10.145.3.45 IP address and 9090 port, expose it as a target so that Prometheus can send scraping requests to http://10.145.3.45:9090/metrics.

As a safety measure, we also added a way to purge the list of target in the discovery service. Indeed, since this monitoring system was going to be used on development builds, it would be possible for problems/bugs or crashes to make the /bye request not to be sent. In that situation, targets can pile up and we want to avoid that.

Client side

Prometheus being a popular solution, there are bindings for all common languages. Some of them are officially supported by the Prometheus project and others are unofficial but still considered as the go-to solution.

As our SDK is written in C++, we started looking at the prometheus-cpp client library. After some time spent investigating how to integrate this library to our existing codebase, it quickly appeared that it wouldn’t be possible. prometheus-cpp requires zlib and libcurl to work properly and they are not provided with the library. We already had a dependency on zlib but, due to the way the library is supposed to be integrated to an existing CMake project, there was no way to properly tell CMake to use our existing zlib dependency.

After some thinking, We came to the conclusion that this was actually not required at all. prometheus-cpp is just offering some convenience to expose an HTTP server and to declare Prometheus metrics in a C++ friendly way. Let’s try to analyze these two problems separately.

HTTP server for Prometheus scraping

prometheus-cpp can take care of creating and running the HTTP server that will be used by Prometheus to scrape metrics. But, in our case, we already have something very similar to that. The proxy we use to intercept the requests from the video player already is an HTTP server. Why would we need a third party library to create and run something we already have internally ?

Furthermore, if we take some steps back and look at the big picture, why stop there ? Exposing an HTTP server from our SDK opens new possibilities. We could expose more than just monitoring metrics such as real-time stats or events. We could also think of it as a two-way communication channel, if it’s possible to query for data, it’s also possible to control how our SDK is behaving by changing the value of internal variables. As soon as there is an entry point in the SDK, that opens up a lot of possibilities and we felt like only designing it for monitoring with Prometheus was a bit narrow-minded.

Metrics declaration

prometheus-cpp also offers a simple and C++-friendly way to declare metrics and automatically format them so that it’s working automagically. But if we look at the documentation of Prometheus, we can find that it supports multiple formats for the HTTP response to the scraping requests. These formats are documented here. These formats are actually pretty simple, especially the text-based one. There is no difficulty in formatting metrics so that they can be understood by Prometheus.

Given, this to observations, we decided that it was not required to use the prometheus-cpp library. Instead, we planned to modify slightly our HTTP proxy by adding a request handling mechanism. We wrote a very basic HTTP router that would associate request handlers with specific routes. I won’t dive on the details as it’s not the subject of this article but, using this mechanism, being able to answer to Prometheus scraping request was just a matter of registering a handler to the /metrics route. This handler would gather all the metrics we want to expose, format them in accordance with the text-based format mentioned earlier and send that back as an HTTP response.

Demo

For our first version, we decided to start small and only monitor very basic vitals such as CPU, Memory and Battery usage. The rationale behind this choice was:

By comparing CPU usage between versions, we can easily catch obvious performance regressions
By looking at a graph of Memory usage, we can quickly catch any memory leak
By looking at a graph of battery, we can see what is the impact of our software running on the power consumption (We expected this to be strongly linked to the CPU usage but still wanted to gather the metrics)

Here are some screenshots of various Long Runs we did using this monitoring tool:

Grafana dashboard showing a long run with a 1 hour window

Grafana dashboard showing a long run with a 12 hour window

Conclusion

What can we conclude from all of this project ? Mainly that entry-level monitoring is not hard to setup. Some tools such as Prometheus and Grafana are very easy to grasp and can quickly give you interesting insights about your software. Being very popular, you are pretty sure to find client libraries that can be integrated into an existing codebase.

However, we would suggest to not blindly go for an off-the-shelf solution. As with all tools, you first need to understand how it works and see if it’s really what suits your needs. For us, this meant first understanding how Prometheus gathers metrics from its registered targets and then noticing that all required pieces were kind of already present in our codebase.

As a rule of thumb, I’ll suggest to roll your own implementation if you happen to have an HTTP server running somewhere in your software. Prometheus’ text-based format is extremely easy to understand and you won’t spend any significant time exporting your metric to that format.

So, if like us, you are in a situation that requires to monitor devices for extended period of time, don’t hesitate to give this solution a try. You’ll get quick observability benefits without spending too much time.

This document is provided for informational purposes only and may require additional research and substantiation by the end user. In addition, the information is provided “as is” without any warranty or condition of any kind, either express or implied. Use of this information is at the end user’s own risk. Lumen does not warrant that the information will meet the end user’s requirements or that the implementation or usage of this information will result in the desired outcome of the end user. All third-party company and product or service names referenced in this article are for identification purposes only and do not imply endorsement or affiliation with Lumen. © 2023 Lumen Technologies