Why user-centric monitoring is important for complex systems

Cloud and distributed systems provide great resiliency and high availability but this calls out for a change in what you monitor. Monitoring from end-user perspective is critical for complex systems, this is how we learned it.

Published in

THRON tech blog

7 min readOct 19, 2018

Software systems are becoming more and more complex, cloud architectures (autoscaling, microservices and programmatic infrastructure) limit the ability of developers to evaluate how local issues propagate on the system. This calls for an improvement on the way we monitor things. Too often we see traditional monitoring tools used to just monitor system parameters (such as cpu, disk quota, threads, memory, etc) or application behaviour to debug faulty calls. With the increasing complexity of systems, a weaker relationship between “system has issues” and “customer has issues” establish: you might have faults at system level but the service is still reliable to the customer and this might be ok.

Why a change of perspective is useful

We see it all the time, when a small company starts thinking about monitoring it is usually because the engineering team needs to measure system health, so infrastructure monitoring and application monitoring become hot topics and several tools are evaluated as well as the processes involving the on-call engineers.

Monitoring system components is not enough to know whether system is working or not.

The assumption that system monitoring + application monitoring = service health is correct only when: your system is simple (not fully commiting into microservices architectures or not fully leveraging cloud vendor architectures) and you have one team (or very few and well communicating ones) covering the whole product. Things start becoming more complex when your product is composed by several independent or loosely coupled elements and you have different teams or suppliers in charge for the different components. Traditional monitoring design approach does not ensure if the service is working for the end users because each single team lacks the broad vision about how the complex system works and even if all components are working, the service for the end user might still be disrupted or slow.

There’s also a priorisation issue, how would you prioritise an alert coming from a local component monitoring? Does it really means that it’s urgent ? This is clear when you think about distributed, redundant architectures that are becoming more common with cloud computing: a system failure might not cause any issue to customers at all. How can you prioritise issues if you can’t easily and quickly foresee the end-user impact ? We understood that, in order to apply correct prioritization, we needed to implement our monitoring practices with elements that measure service availability and quality from end-user perspective, this led us to think about User Centric Monitoring: monitoring what the user can see in addition to what you send to the user.

Example of lack of “user centric monitoring”, the kitchen (system) was working but users were starving :)

We keep the existing application monitoring and system monitoring tools, because they are needed by the different engineering teams to diagnose local problems and ensure they do not escalate to bigger issues, and we started working at our user centric monitoring dashboard.

What should it measure ?

Since we wanted to be able to understand how the customers perceive the product, our first stakeholder has been our technical support team instead of the product management, so that we could have a first go at how users use the product/service instead of how this differes from. Identifying the main usage cases and frictions the user find with your product/service should be an easy task for your tech support team. We suggest to start measuring the metrics you can extract right away from this first interaction, before moving to other stakeholders. This discussion taught us a lot about what customers need instead of what our engineers need (both have to be satisfied). Product Management was then able to integrate the most common usage patterns with the unusual ones or by providing insights about the future ones, that might be created with future product/service updates.

How we make it

What we realised is that, despite having a market full of monitoring tools, the metrics we wanted to measure could be unusual and hard to integrate into existing tools. We also wanted to have a very lean process and light infrastructure to keep the evolution of this project very quick and easy, this is why we started working on bespoken monitoring probes and dashboard using modern cloud components and serverless computing. Probes have been the most challenging part because they had to simulate a real user behavior and thus interact with the product through browser or mobile app.

Languages and tools

This not an engineering team project, but a QA one, so at first we tought of using Python because it’s a powerful but still easy to master language that has a good learning curve for new QA members. It’s important that probes and monitoring are not developed by product engineers because it’s better to have this kind of test to be made by people different from who build the product. Python is also easy to port on different platforms but at the end we started by using Scala to reduce development time since our product is made with Scala and there’s a very useful SDK with parts of probes code already developed in past projects that shortened our initial steps. We still plan to port the whole probes to Python in future evolutions.

To be able to check real user usage, our probes must simulate user interactions, so we needed to adopt new tools: Selenium is the state of the art in automated browser control and provides several WebDriver for the different browser types. To leverage Selenium we choose Robot Framework, an opensource framework designed for letting non-developers be able to design tests. We mixed the usage of Robot Framework and direct api calls based on the need. Python/Scala also allows to create scripts that manage even complex interactions, Robot Framework has a Python SDK so it can be easily extended with bespoken features.

Regression test made by automated browser. It finds a difference between the baseline image and the actual image (a button disappears)

Geographical distribution of probes

Thanks to Robot Framework and our own API calls we could start collecting data about the different use cases, but when considering a world-wide distributed system, performance analysis must be performed from the different geographical areas your customers access the product from (end-user perspective). We are already using a very powerful tool (Catchpoint www.catchpoint.com) but we wanted to have more flexibility and be able to develop custom logic to evaluate our product. To start distributing our probes geographically we created different testing sites using AWS Lambda and AWS API Gateway, features that allowed us to execute code without needing to manage (and pay for) underlying servers. This choice is interesting because it brings a “pay per use” approach, we can easily tune the monitoring costs by changing frequency of the probes execution.

Dashboarding and alerting

We need both to satisfy push communication (real-time notifications) and pull communication of issues. Issues communication requires to provide as many details as possible, being targeted to the right recipient and being instant. As push communication channel we choose Slack, a collaboration tool that, thanks to its API, can easily be integrated with monitoring messages. Each probe writes to a Slack channel and each Slack channel is seen by QA members. Each channel refers to different components is visible to the engineers that oversee that component.

A dashboard (pull approach) is also needed, so that we could see treends and understand if something is wrong despite not receiving an alarm or to understand what was the state of the system before the alarm. This leads to the need of a real-time dashboard that also allows to go back in time to see past state.

We choose to develop our dashboard by leveraging Google Firebase because it offers a realtime database service that is well suited to real-time dashboard needs. Real-time database’s Javascript client generates events each time data changes in the database (so each time a probe writes something in it) so it’s very easy to trigger a visualization update when you have new data. By using Firebase we were able to obtain a real-time dashboard with very limited effort.

What we like

adopting UCG gave us immediate improvements in defining issue priority, improving our efficiency:

reducing intervention time to solve issues, often identifying issues before users experience them
detect new bugs and issue patterns that are caused by integrations of components, not by single components
automate controls that were made by humans because they required complex navigation patterns or interactions
Python is proving to be a good choice: it’s easy to read and easy to learn, allowing to focus on results instead of code (which is less relevant on probes)
Google Firebase: building a scalable real-time dashboard was really matter of hours
Selenium to control browser: it’s widely available, there’s lot of documentation and is well integrated with most development languages
Robot Framework: it’s a good starting point, it’s well suited for small automation projects and provides quick results with low effort. It’s well integrated with Jenkins too.

What we don’t like

Robot Framework: it’s easy to use but it can quickly become a limit when you want to integrate with big projects due to lack of object oriented paradigm to easily reuse code and of a dedicated development IDE. This led us to use Python and Scala more than we wanted, interfacing directly to Selenium driver
Firebase realtime database is not designed to support queries and is also hard to setup access control levels for different user groups

What to do now

Monitoring from end-user perspective proved to be an invaluable change, quickly becoming the main way to assess issue priority in our complex architecture. We plan to integrate Real User Monitoring data (capture monitoring data from real user sessions) and to migrate to Google FireStone , an evolution of FireBase, to be able to perform complex queries.