Catching bugs on the client-side: how we developed our error tracking system
Our team develops several products, Badoo and Bumble, two of the world’s largest dating and connection applications. For both, we have a web version (desktop and mobile) and mobile applications (Android and iOS). With more than millions of users, it’s important for us to gather client-side errors, and for this we use a system of our own code-named Gelato. For the last two years, I have been involved in developing its server-side and throughout this time I have discovered a lot of new things about the world of error tracking systems development that I would like to share with you in this article.
What we will cover:
- how we use error information
- The tools we used previously, and why we developed our own system over a ready-made solution
- a brief overview of our system, its architecture, and technology stack.
How we use error information
Firstly, and most obviously — we track errors in production. Nobody is safe from these errors, hence the importance of tracking them, of finding out how many users have been affected, and operatively fixing the most critical of them.
Secondly — we conduct error analysis.
At Bumble we release new versions of applications quite often:
- Web applications: 1–2 times a day, including the backend
- Native applications: once a week (dependent on how quickly the build gets accepted by App Store and Google Play).
Error analysis is always one of the steps in releasing any new version of the application. For this, the release-manager needs a summary report listing errors in that version. This enables them to decide whether it is safe to deploy the build on production or to see if the build contains any bug that eluded our QA, in which case the report will make it clear that the broken feature needs removing from the release.
Thirdly — having all the error information available in one place simplifies the work of developers and QA engineers.
What we used previously
Historically, we used two systems to collect client errors: HockeyApp for collecting crash reports from native applications, and our own system for collecting JS errors (written in PHP).
HockeyApp met our needs perfectly until it was acquired by Microsoft in 2014. Microsoft changed HockeyApp’s usage policy and began encouraging people to switch to their new system AppCenter. AppCenter at that time did not meet our requirements. Still in active development, some of the functionality we needed was missing: in particular, the deobfuscation of Android application stack traces using DexGuard mapping files, without which error grouping is impossible.
I’ll look at deobfuscation in detail later but if this is the first time you’ve come across it, having read this article will hopefully have proved useful to you.
A deadline was set: October 16th, 2019, by which date all HockeyApp users should have migrated to the AppCenter, but the support of DexGuard mapping files would only be added to the AppCenter at the end of December 2019, a few months after the official termination of HockeyApp.
In addition to this, we encountered the problem of incorrect calculation of the total number of errors in the HockeyApp. And since no further development was to be done on HockeyApp we had to start duplicating this information into our internal analytics system to see the real number of errors.
Internal tool for collecting JS errors
As for our self-written system for collecting JS errors that we developed in-house, for many years it worked flawlessly despite having only basic functionality.
The architecture was quite simple:
- the data was stored in MySQL (we stored information on the last 10–20 releases)
- there was a separate table for each version of the application
- search involved a limited set of fields.
In 2017, our frontend development team approximately doubled in size. The system began to be used more actively and the developers soon became increasingly aware of its limitations.
Having collected and analysed all the requirements our team would ideally like to have, we realised that it was going to require more than just a little hard work to improve the current solution. Developed back in 2014, the system was now obsolete, and the cost of refactoring would exceed the cost of implementing a new solution.
So, the decision was made to gradually switch to a new system that would cover all the existing functionality and meet all our requirements.
Requirements of the new system
- Store all client errors in one place
- Store all the errors without sampling for at least six months. The ability to locate and analyse errors for a particular user at a certain moment in time can simplify the research of a phantom bug (one that appears periodically and cannot be reproduced in a development environment) and make the life of developers easier. Thus, the more information we have, the better
- The system has to be able to scale easily so that it can handle errors that produce millions of events per minute. Luckily for us, such errors don’t occur very often, but we have to be prepared
- Fault tolerance — to minimise the risk of data loss
- The system has to be able to answer not only the question “How many events occurred?”, but also «How many users have been affected?»
- Grouping of similar errors while preserving meta-information (time of the first and last error in the group, the release in which the error appeared, error status, etc.)
- Flexible search by any combination of fields with the support of full-text search
- Various statistics: the number of events for the time period, the number of errors by releases, by browsers, by operation systems, and so on
- Integration with Jira. We need to be able to create tickets in Jira for certain errors in one click
- Self-hosted solution. We want the system to work on our hardware. This will provide us with full control over the data and the ability to change cluster configuration at any time.
Why not use a ready-made solution?
Of course, before writing our solution we analysed existing systems on the market.
There are a lot of SaaS solutions out there for tracking and monitoring errors, and this is not surprising: fast detection and fixing errors is a key aspect of modern development. Among the most popular services are Bugsnag, TrackJS, Raygun, Rollbar and Airbrake. All of them have rich functionality and generally meet our requirements, but we did not consider cloud solutions. Migration to a new solution is a rather complicated and lengthy procedure and we were concerned that the pricing and usage policies could as well change over time, as happened with HockeyApp.
With open-source systems, things were not so rosy. Most of them either stopped developing or never emerged from the development stage and were not recommended for use in production.
In fact, only Sentry continued to evolve and had most of the functionality we needed. But at that time (early 2018), the eighth version of the service did not suit us for the following reasons:
- a sampling of events
- PostgreSQL was used as the main storage, with which we had no experience
- some of the functionality that we needed was lacking (for example, integration with Jira) and there would be difficulties with implementing it on our own, since the system is written in Python, and our main languages are PHP and Go.
In July 2018, the ninth version of Sentry was released. It introduced integration with issue trackers and laid the foundation for key improvements (in my opinion) — the transition to ClickHouse for storing events (I recommend this series of articles on this). But unfortunately, at the time of our research, none of this even figured in the plans. Therefore, we decided that the best option in our case would be the implementation of our own system, customised for our processes and therefore easy to integrate with other in-house tools.
So, a system codenamed Gelato (General Error Logs And The Others) was born, the development of which is discussed further below.
Brief system overview
As they say, it is better to see once than hear a hundred times, so first I will show what our system can do now so that it becomes clear how we work with errors. This is important for understanding the architecture of the system: how data is used determines how it should be stored.
The main page contains a list of applications and general error statistics for a given criterion.
By clicking on a particular application, we are taken to a page with its release statistics.
By clicking on a particular version, we are taken to a page listing of error groups
Here we can see what errors occurred, how many there were, how many users were affected, when the error first occurred, and its most recent occurrence. Also, we can sort the data by most of the fields and create a ticket in Jira for any error.
This is how this page looks for native applications:
By clicking on a particular error, we are taken to a page giving detailed error information.
Here you can see general information about the error (1), a graph of the total number of events (2), and various analytics (3).
It also contains information about specific events, which is mainly used to analyse the problem.
Another interesting feature that I have never seen in similar systems is releases comparison. This makes it quite easy to detect errors that have appeared in the new version, and those that were fixed in previous releases but then, later on, began to appear again (regression).
Select releases to compare:
And we get to a page with a list of errors that are in one version but not in the other:
As you may have noticed, we have implemented a basic set of functions that cover most of the use cases. But we do not intend to stop here but shortly will be adding many useful features that expand the capabilities of the system, including:
- integration with our A / B framework — to track errors that appeared in a particular split test;
- advanced analytics (more graphs and charts);
- email digests with application statistics.
The system architecture
Now let’s go under the hood to see how everything works. The scheme is pretty standard and consists of three stages:
- Data collection.
- Data processing.
- Data storing.
This can be depicted schematically as follows:
Let’s get started with data collection.
We proceed from the assumption that the developers of the client application have already taken care of error handling on their side, and all that is required from our service is to provide an API for sending error information in a certain format.
What does the API have to do?
- Read data
- Check the data for compliance with the required format in order to immediately cut off the “noise”
- Save everything to the intermediate queue.
Why do we need an intermediate queue?
If we know we have a fairly low EPS (errors per second), and that all parts of our system will work in a stable fashion all the time, then we can significantly simplify the system and make the whole process synchronous.
But you and I know, that this is not the real world, but that at any stage, at the most inopportune moment, something unexpected can happen. And our system has to be ready for this. So, an error in one of the external dependencies of the application will mean it begins to crash, which will lead to an increase in EPS (as was the case with the iOS Facebook SDK on July 10, 2020). As a result, the load on the entire system will increase significantly, and with it the processing time for one request.
Or, for example, the database might become temporarily unavailable — so the system will simply not be able to save the data. There can be many reasons for this: problems with network equipment, a data centre employee accidentally touching a wire — so the server switches off, and the disk space runs out.
Therefore, to reduce the risk of data loss and make data collection as fast as possible (so that the client does not have to wait a long time for a response), we save all incoming data to an intermediate queue, which is processed by a separate script in our cloud.
What can be used as a queue?
- The first option that comes to mind is a popular message broker like Redis or RabbitMQ
- You can also use Apache Kafka, which is well suited for cases where you need to store the tail of incoming data for a certain period (for example, for some kind of internal analytics). For instance, Kafka is used in the latest (tenth) version of Sentry
- We settled on LSD (Live Streaming Daemon). The system has been in use at Bumble for a long time and has worked well, plus we already have all the necessary binding in our code to work with it.
Here there are two questions we need to answer: “Where to store?” (database) and “How to store?” (data model).
When implementing a prototype of the system, we settled on two options: Elasticsearch and ClickHouse.
Among the main pros of this database, I would highlight the following:
- horizontal scaling and replication out of the box
- a large set of aggregations, which is convenient when implementing the analytical part of our system
- full-text search
- support for UPDATE by condition (we have an asynchronous data processing process, and we need the ability to perform some steps in the pipeline repeatedly, which requires being able to update certain fields for specific events)
- support for DELETE by condition (we store data for six months, which means we need the ability to delete outdated data)
- flexible configuration via API, which allows developers to change the index settings according to the tasks.
Of course, like any system, Elasticsearch also has cons:
- complex query language, so the documentation needs to always be at hand (in the latest versions, support for SQL syntax appeared, but this is available only in the paid version (X-Pack) or when using Open Distro from Amazon)
- JVM that requires good expertise to keep everything under control, while our main languages are PHP and Go (for example, optimising the garbage collector for a specific load profile requires an in-depth understanding of how things work under the hood; we ran into this issue when upgrading from version 6.8 to 7.5, since the topic is not new and there are quite a few articles on the Internet (for example, here and here)
- poor strings compression; we plan to store quite a lot of data and, although hardware gets cheaper every year, we want to use resources as efficiently as possible (of course, you can use deflate compression instead of LZ4, but this will increase CPU utilisation, which can negatively affect the performance of the entire cluster).
The pros of this database are:
- excellent write performance
- good data compression, especially for long strings
- MySQL-compatible query syntax, which eliminates the need to learn a new query language, as is the case with Elasticsearch
- horizontal scaling and replication out of the box, although it requires more effort than Elasticsearch.
But at the beginning of 2018, ClickHouse was missing some of the functions we needed:
- Support for DELETE by condition (we planned to store data for six months, so we needed the ability to delete outdated data; in ClickHouse, deletion of data by an arbitrary criterion was not provided, and partitioning by an arbitrary field (in our case, by date) was at that time on experimental feature stage and was not recommended for use in production
- Support for UPDATE by condition: ClickHouse is geared towards immutable data, so the implementation of updating arbitrary records is not an easy task (this issue was raised on GitHub more than once — and at the end of 2018 the function was finally implemented, but it is unsuitable for frequent updates)
- full-text search (there was an option to search by RegEx, but it requires a full scan, which is rather a slow operation).
Actually, we could have circumvented all the above restrictions (and there were even several articles on this topic on the Web (for example), but we wanted to implement a prototype of our system at a minimal cost. For this, we needed a more flexible database so we opted for Elasticsearch.
Of course, in terms of write performance, Elasticsearch is inferior to ClickHouse, but for us, this was not critical. Much more important was the support of the functionality we needed and scalability out of the box. The fact that we already had an Elasticsearch cluster, which we were using to collect logs from daemons, was also significant — this meant there was no need for us to set up the infrastructure.
Now let’s talk a little about how we store events.
All our data is divided into several groups and stored in separate indices:
- Meta information
- Raw events.
Data is isolated for a specific application (separate index) — this allows us to customise the index settings depending on the load profile. For example, we can keep data of unpopular applications on warm nodes in a cluster (we use a hot-warm-cold architecture).
In order to store both JS errors and crash reports of native applications in the same system, we moved to the top-level everything that is used to compute general statistics (error occurrence time, in which release it occurred, user information, grouping key) and what is unique for each type of error is stored in the nested field attributes with its mapping.
The actual idea was borrowed from Sentry and slightly modified during operation. In Sentry, an event has base fields, field tags for data that needs to be searchable, and the extra field for all other specific data.
So, now we come to what I consider to be the most interesting thing in developing a system for collecting client errors — data processing. Without it, the information that we collected in the previous step will be useless and we will be unable to receive anything except a signal that something went wrong in our application. But our goal is to be able to track and fix the most critical errors as quickly as possible.
Let’s start with a simpler case.
Processing crash reports from Android apps
To reduce the size of the application as much as possible, it is customary in the Android world to use special utilities during the build process, which:
- remove all unused code (code shrinking)
- optimise everything that remains following the first stage (optimisation)
- rename classes, methods, and properties using a special format, which allows the size of the codebase to be reduced, and the process of reverse engineering of the application to be complicated (obfuscation).
You can learn more about this from the official documentation.
There are several popular utilities today:
- ProGuard (free version);
- DexGuard based on ProGuard (paid version with extended functionality);
- R8 from Google.
If the application is built using obfuscation mode, then the stack trace will look something like this:
o.imc: Error loading resources: Security check required
at o.jij$c.apply(Unknown Source:0)
at java.lang.reflect.Method.invoke(Native Method)
Not much can be understood from it, except for the error message. To extract useful information from such a stack trace, it first needs to be decrypted. The process of decrypting obfuscated classes and methods is called deobfuscation; this requires a special file called mapping.txt, which is generated at the time of building the application. Here is a snippet of such a file:
AllGoalsDialogFragment -> o.a:
java.util.LinkedHashMap goals -> c
kotlin.jvm.functions.Function1 onGoalSelected -> e
java.lang.String selectedId -> d
AllGoalsDialogFragment$Companion Companion -> a
54:73:android.view.View onCreateView(android.view.LayoutInflater,android.view.ViewGroup,android.os.Bundle) -> onCreateView
76:76:int getTheme() -> getTheme
79:85:android.app.Dialog onCreateDialog(android.os.Bundle) -> onCreateDialog
93:97:void onDestroyView() -> onDestroyView
Therefore, we need a service to which we could feed the obfuscated stack trace and mapping file — and get the original stack trace at the output.
We were not able to find suitable ready-made solutions in the public arena (maybe we did not look very hard), but fortunately for us, ProGuard engineers (and we use DexGuard for obfuscation) were looking out for developers and made the ReTrace utility publicly available, which implements all the necessary functionality for deobfuscation.
Using this, our Android developers wrote a simple service in Kotlin which:
- accepts a stack trace and an application version as input
- downloads the necessary mapping file from Ceph (mappings are filled in automatically when building a release in TeamCity)
- deobfuscates the stack trace.
Processing crash reports from iOS apps
Crash reports from the iOS application contain quite a lot of useful information, including the stack traces of all threads launched at the time of the crash (read more about the crash report format here and here). But there’s a catch: stack traces contain only information on the memory addresses where classes and methods are located.
0 libsystem_kernel.dylib 0x00000001bf3468b8 0x1bf321000 + 153784
1 libobjc.A.dylib 0x00000001bf289de0 0x1bf270000 + 105952
2 Badoo 0x0000000105c9c6f4 0x1047ec000 + 21694196
3 Badoo 0x000000010657660c 0x1047ec000 + 30975500
4 Badoo 0x0000000106524e04 0x1047ec000 + 30641668
5 Badoo 0x000000010652b0f8 0x1047ec000 + 30667000
6 Badoo 0x0000000105dce27c 0x1047ec000 + 22946428
7 Badoo 0x0000000105dce3b4 0x1047ec000 + 22946740
8 Badoo 0x0000000104d41340 0x1047ec000 + 5591872
The process of mapping a memory address to a function name is called symbolication. To symbolicate a crash report you need special archives with debug symbols (dSYM), generated at the time of building the application, along with software that can work with these archives.
What can be used for symbolication?
- You can write your own service based on console utilities, but it is better suited to manual symbolication and available only on macOS.
- You can take the service from Sentry called Symbolicator (I recommend reading this article on how it was developed). Having experimented with it for some time we came to the conclusion that “as is” this service would be difficult to integrate into our scheme: we would have to modify it to fit our needs, and we had no experience of using Rust.
- You can write your own service based on the Symbolic library from Sentry, which, although written in Rust, provides C-ABI — it can be used in a language with FFI support.
We opted for the last option and wrote a service in Golang which under the hood interacts Symbolic via cgo.
Another aspect that we need to look at is error grouping because the better errors are grouped, the more quickly you can detect the most critical errors among all other events.
Someone unfamiliar with how error handling systems work might imagine they use some kind of complex algorithms to determine string similarity. But, in reality, all popular systems use fingerprint for grouping because it is easy to implement and covers most cases. In the most basic case, it can be a hash from the error message and stack trace. But this is not suitable for all types of errors, so some systems allow you to explicitly specify which fields you want to use to calculate the grouping key (or you can pass the key explicitly).
We decided not to complicate our system and settled on grouping by hash:
- JS errors are grouped by message, type, and origin
- Android crash reports are grouped by the first three lines of stack traces
- iOS crash reports are grouped by the first non-system frame from the crashed thread (the thread marked as crashed in the crash report).
The journey from an idea to a fully-fledged transition to a new system took us almost two years, but we are pleased with the result and already have plans to improve the system and integrate it with our other internal products.
If you are planning to start collecting and processing client errors and don’t know which tool to use, then I highly recommend taking a closer look at Sentry, since this service is actively developing and is among the market leaders.
But if you decide to follow our example and develop your own system, then this article gives you the main points you need to bear in mind.
- store as much data as possible: this can save you the hassle and precious hours which would otherwise have to be spent on debugging
- be prepared for a sharpened growth of errors
- the more analytics you have, the better
- iOS crash reporting is difficult, take a look at Symbolicator
- alerting is a crucial part of the system because it gives you more control and the ability to respond quickly if something goes wrong
- be patient and get ready for an exciting journey into the world of error tracking systems development.