Catching bugs on the client-side: how we developed our error tracking system

Eugene Tupikov
Dec 3, 2020 · 18 min read

Our team develops several products, Badoo and Bumble, two of the world’s largest dating and connection applications. For both, we have a web version (desktop and mobile) and mobile applications (Android and iOS). With more than millions of users, it’s important for us to gather client-side errors, and for this we use a system of our own code-named Gelato. For the last two years, I have been involved in developing its server-side and throughout this time I have discovered a lot of new things about the world of error tracking systems development that I would like to share with you in this article.

What we will cover:

  • how we use error information
  • The tools we used previously, and why we developed our own system over a ready-made solution
  • a brief overview of our system, its architecture, and technology stack.

How we use error information

Secondly — we conduct error analysis.

At Bumble we release new versions of applications quite often:

  • Web applications: 1–2 times a day, including the backend
  • Native applications: once a week (dependent on how quickly the build gets accepted by App Store and Google Play).

Error analysis is always one of the steps in releasing any new version of the application. For this, the release-manager needs a summary report listing errors in that version. This enables them to decide whether it is safe to deploy the build on production or to see if the build contains any bug that eluded our QA, in which case the report will make it clear that the broken feature needs removing from the release.

Thirdly — having all the error information available in one place simplifies the work of developers and QA engineers.

What we used previously


I’ll look at deobfuscation in detail later but if this is the first time you’ve come across it, having read this article will hopefully have proved useful to you.

A deadline was set: October 16th, 2019, by which date all HockeyApp users should have migrated to the AppCenter, but the support of DexGuard mapping files would only be added to the AppCenter at the end of December 2019, a few months after the official termination of HockeyApp.

In addition to this, we encountered the problem of incorrect calculation of the total number of errors in the HockeyApp. And since no further development was to be done on HockeyApp we had to start duplicating this information into our internal analytics system to see the real number of errors.

Internal tool for collecting JS errors

The architecture was quite simple:

  • the data was stored in MySQL (we stored information on the last 10–20 releases)
  • there was a separate table for each version of the application
  • search involved a limited set of fields.

In 2017, our frontend development team approximately doubled in size. The system began to be used more actively and the developers soon became increasingly aware of its limitations.

Having collected and analysed all the requirements our team would ideally like to have, we realised that it was going to require more than just a little hard work to improve the current solution. Developed back in 2014, the system was now obsolete, and the cost of refactoring would exceed the cost of implementing a new solution.

So, the decision was made to gradually switch to a new system that would cover all the existing functionality and meet all our requirements.

Requirements of the new system

  • Store all the errors without sampling for at least six months. The ability to locate and analyse errors for a particular user at a certain moment in time can simplify the research of a phantom bug (one that appears periodically and cannot be reproduced in a development environment) and make the life of developers easier. Thus, the more information we have, the better
  • The system has to be able to scale easily so that it can handle errors that produce millions of events per minute. Luckily for us, such errors don’t occur very often, but we have to be prepared
  • Fault tolerance — to minimise the risk of data loss
  • The system has to be able to answer not only the question “How many events occurred?”, but also «How many users have been affected?»
  • Grouping of similar errors while preserving meta-information (time of the first and last error in the group, the release in which the error appeared, error status, etc.)
  • Flexible search by any combination of fields with the support of full-text search
  • Various statistics: the number of events for the time period, the number of errors by releases, by browsers, by operation systems, and so on
  • Integration with Jira. We need to be able to create tickets in Jira for certain errors in one click
  • Self-hosted solution. We want the system to work on our hardware. This will provide us with full control over the data and the ability to change cluster configuration at any time.

Why not use a ready-made solution?

Cloud services

Open-source systems

In fact, only Sentry continued to evolve and had most of the functionality we needed. But at that time (early 2018), the eighth version of the service did not suit us for the following reasons:

  • a sampling of events
  • PostgreSQL was used as the main storage, with which we had no experience
  • some of the functionality that we needed was lacking (for example, integration with Jira) and there would be difficulties with implementing it on our own, since the system is written in Python, and our main languages ​​are PHP and Go.

In July 2018, the ninth version of Sentry was released. It introduced integration with issue trackers and laid the foundation for key improvements (in my opinion) — the transition to ClickHouse for storing events (I recommend this series of articles on this). But unfortunately, at the time of our research, none of this even figured in the plans. Therefore, we decided that the best option in our case would be the implementation of our own system, customised for our processes and therefore easy to integrate with other in-house tools.

So, a system codenamed Gelato (General Error Logs And The Others) was born, the development of which is discussed further below.

Brief system overview

The main page contains a list of applications and general error statistics for a given criterion.

By clicking on a particular application, we are taken to a page with its release statistics.

By clicking on a particular version, we are taken to a page listing of error groups

Here we can see what errors occurred, how many there were, how many users were affected, when the error first occurred, and its most recent occurrence. Also, we can sort the data by most of the fields and create a ticket in Jira for any error.

This is how this page looks for native applications:

By clicking on a particular error, we are taken to a page giving detailed error information.

Here you can see general information about the error (1), a graph of the total number of events (2), and various analytics (3).

It also contains information about specific events, which is mainly used to analyse the problem.

Another interesting feature that I have never seen in similar systems is releases comparison. This makes it quite easy to detect errors that have appeared in the new version, and those that were fixed in previous releases but then, later on, began to appear again (regression).

Select releases to compare:

And we get to a page with a list of errors that are in one version but not in the other:

As you may have noticed, we have implemented a basic set of functions that cover most of the use cases. But we do not intend to stop here but shortly will be adding many useful features that expand the capabilities of the system, including:

  • integration with our A / B framework — to track errors that appeared in a particular split test;
  • alerting;
  • advanced analytics (more graphs and charts);
  • email digests with application statistics.

The system architecture

  1. Data collection.
  2. Data processing.
  3. Data storing.

This can be depicted schematically as follows:

Let’s get started with data collection.

Data collection

What does the API have to do?

  • Read data
  • Check the data for compliance with the required format in order to immediately cut off the “noise”
  • Save everything to the intermediate queue.

Why do we need an intermediate queue?

If we know we have a fairly low EPS (errors per second), and that all parts of our system will work in a stable fashion all the time, then we can significantly simplify the system and make the whole process synchronous.

But you and I know, that this is not the real world, but that at any stage, at the most inopportune moment, something unexpected can happen. And our system has to be ready for this. So, an error in one of the external dependencies of the application will mean it begins to crash, which will lead to an increase in EPS (as was the case with the iOS Facebook SDK on July 10, 2020). As a result, the load on the entire system will increase significantly, and with it the processing time for one request.

Or, for example, the database might become temporarily unavailable — so the system will simply not be able to save the data. There can be many reasons for this: problems with network equipment, a data centre employee accidentally touching a wire — so the server switches off, and the disk space runs out.

Therefore, to reduce the risk of data loss and make data collection as fast as possible (so that the client does not have to wait a long time for a response), we save all incoming data to an intermediate queue, which is processed by a separate script in our cloud.

What can be used as a queue?

  • The first option that comes to mind is a popular message broker like Redis or RabbitMQ
  • You can also use Apache Kafka, which is well suited for cases where you need to store the tail of incoming data for a certain period (for example, for some kind of internal analytics). For instance, Kafka is used in the latest (tenth) version of Sentry
  • We settled on LSD (Live Streaming Daemon). The system has been in use at Bumble for a long time and has worked well, plus we already have all the necessary binding in our code to work with it.

Data storing



  • horizontal scaling and replication out of the box
  • a large set of aggregations, which is convenient when implementing the analytical part of our system
  • full-text search
  • support for UPDATE by condition (we have an asynchronous data processing process, and we need the ability to perform some steps in the pipeline repeatedly, which requires being able to update certain fields for specific events)
  • support for DELETE by condition (we store data for six months, which means we need the ability to delete outdated data)
  • flexible configuration via API, which allows developers to change the index settings according to the tasks.

Of course, like any system, Elasticsearch also has cons:

  • complex query language, so the documentation needs to always be at hand (in the latest versions, support for SQL syntax appeared, but this is available only in the paid version (X-Pack) or when using Open Distro from Amazon)
  • JVM that requires good expertise to keep everything under control, while our main languages ​​are PHP and Go (for example, optimising the garbage collector for a specific load profile requires an in-depth understanding of how things work under the hood; we ran into this issue when upgrading from version 6.8 to 7.5, since the topic is not new and there are quite a few articles on the Internet (for example, here and here)
  • poor strings compression; we plan to store quite a lot of data and, although hardware gets cheaper every year, we want to use resources as efficiently as possible (of course, you can use deflate compression instead of LZ4, but this will increase CPU utilisation, which can negatively affect the performance of the entire cluster).


The pros of this database are:

  • excellent write performance
  • good data compression, especially for long strings
  • MySQL-compatible query syntax, which eliminates the need to learn a new query language, as is the case with Elasticsearch
  • horizontal scaling and replication out of the box, although it requires more effort than Elasticsearch.

But at the beginning of 2018, ClickHouse was missing some of the functions we needed:

  • Support for DELETE by condition (we planned to store data for six months, so we needed the ability to delete outdated data; in ClickHouse, deletion of data by an arbitrary criterion was not provided, and partitioning by an arbitrary field (in our case, by date) was at that time on experimental feature stage and was not recommended for use in production
  • Support for UPDATE by condition: ClickHouse is geared towards immutable data, so the implementation of updating arbitrary records is not an easy task (this issue was raised on GitHub more than once — and at the end of 2018 the function was finally implemented, but it is unsuitable for frequent updates)
  • full-text search (there was an option to search by RegEx, but it requires a full scan, which is rather a slow operation).

Actually, we could have circumvented all the above restrictions (and there were even several articles on this topic on the Web (for example), but we wanted to implement a prototype of our system at a minimal cost. For this, we needed a more flexible database so we opted for Elasticsearch.

Of course, in terms of write performance, Elasticsearch is inferior to ClickHouse, but for us, this was not critical. Much more important was the support of the functionality we needed and scalability out of the box. The fact that we already had an Elasticsearch cluster, which we were using to collect logs from daemons, was also significant — this meant there was no need for us to set up the infrastructure.

Data model

All our data is divided into several groups and stored in separate indices:

  • Meta information
  • Raw events.

Data is isolated for a specific application (separate index) — this allows us to customise the index settings depending on the load profile. For example, we can keep data of unpopular applications on warm nodes in a cluster (we use a hot-warm-cold architecture).

In order to store both JS errors and crash reports of native applications in the same system, we moved to the top-level everything that is used to compute general statistics (error occurrence time, in which release it occurred, user information, grouping key) and what is unique for each type of error is stored in the nested field attributes with its mapping.

The actual idea was borrowed from Sentry and slightly modified during operation. In Sentry, an event has base fields, field tags for data that needs to be searchable, and the extra field for all other specific data.

Data processing

Let’s start with a simpler case.

Processing crash reports from Android apps

  • remove all unused code (code shrinking)
  • optimise everything that remains following the first stage (optimisation)
  • rename classes, methods, and properties using a special format, which allows the size of the codebase to be reduced, and the process of reverse engineering of the application to be complicated (obfuscation).

You can learn more about this from the official documentation.

There are several popular utilities today:

  • ProGuard (free version);
  • DexGuard based on ProGuard (paid version with extended functionality);
  • R8 from Google.

If the application is built using obfuscation mode, then the stack trace will look something like this:

o.imc: Error loading resources: Security check required
at o.mef.b(:77)
at o.mef.e(:23)
at o.mef$a.d(:61)
at o.mef$a.invoke(:23)
at o.jij$c.a(:42)
at o.jij$c.apply(Unknown Source:0)
at o.wgv$c.a_(:81)
at o.whb$e.a_(:64)
at o.wgs$b$a.a_(:111)
at o.wgy$
at o.vxu$
at android.os.Handler.handleCallback(
at android.os.Handler.dispatchMessage(
at android.os.Looper.loop(
at java.lang.reflect.Method.invoke(Native Method)

Not much can be understood from it, except for the error message. To extract useful information from such a stack trace, it first needs to be decrypted. The process of decrypting obfuscated classes and methods is called deobfuscation; this requires a special file called mapping.txt, which is generated at the time of building the application. Here is a snippet of such a file:

AllGoalsDialogFragment -> o.a:
java.util.LinkedHashMap goals -> c
kotlin.jvm.functions.Function1 onGoalSelected -> e
java.lang.String selectedId -> d
AllGoalsDialogFragment$Companion Companion -> a
54:73:android.view.View onCreateView(android.view.LayoutInflater,android.view.ViewGroup,android.os.Bundle) -> onCreateView
76:76:int getTheme() -> getTheme onCreateDialog(android.os.Bundle) -> onCreateDialog
93:97:void onDestroyView() -> onDestroyView

Therefore, we need a service to which we could feed the obfuscated stack trace and mapping file — and get the original stack trace at the output.

We were not able to find suitable ready-made solutions in the public arena (maybe we did not look very hard), but fortunately for us, ProGuard engineers (and we use DexGuard for obfuscation) were looking out for developers and made the ReTrace utility publicly available, which implements all the necessary functionality for deobfuscation.

Using this, our Android developers wrote a simple service in Kotlin which:

  • accepts a stack trace and an application version as input
  • downloads the necessary mapping file from Ceph (mappings are filled in automatically when building a release in TeamCity)
  • deobfuscates the stack trace.

Processing crash reports from iOS apps

Thread 0:
0 libsystem_kernel.dylib 0x00000001bf3468b8 0x1bf321000 + 153784
1 libobjc.A.dylib 0x00000001bf289de0 0x1bf270000 + 105952
2 Badoo 0x0000000105c9c6f4 0x1047ec000 + 21694196
3 Badoo 0x000000010657660c 0x1047ec000 + 30975500
4 Badoo 0x0000000106524e04 0x1047ec000 + 30641668
5 Badoo 0x000000010652b0f8 0x1047ec000 + 30667000
6 Badoo 0x0000000105dce27c 0x1047ec000 + 22946428
7 Badoo 0x0000000105dce3b4 0x1047ec000 + 22946740
8 Badoo 0x0000000104d41340 0x1047ec000 + 5591872

The process of mapping a memory address to a function name is called symbolication. To symbolicate a crash report you need special archives with debug symbols (dSYM), generated at the time of building the application, along with software that can work with these archives.

What can be used for symbolication?

  • You can write your own service based on console utilities, but it is better suited to manual symbolication and available only on macOS.
  • You can take the service from Sentry called Symbolicator (I recommend reading this article on how it was developed). Having experimented with it for some time we came to the conclusion that “as is” this service would be difficult to integrate into our scheme: we would have to modify it to fit our needs, and we had no experience of using Rust.
  • You can write your own service based on the Symbolic library from Sentry, which, although written in Rust, provides C-ABI — it can be used in a language with FFI support.

We opted for the last option and wrote a service in Golang which under the hood interacts Symbolic via cgo.

Error grouping

Someone unfamiliar with how error handling systems work might imagine they use some kind of complex algorithms to determine string similarity. But, in reality, all popular systems use fingerprint for grouping because it is easy to implement and covers most cases. In the most basic case, it can be a hash from the error message and stack trace. But this is not suitable for all types of errors, so some systems allow you to explicitly specify which fields you want to use to calculate the grouping key (or you can pass the key explicitly).

We decided not to complicate our system and settled on grouping by hash:

  • JS errors are grouped by message, type, and origin
  • Android crash reports are grouped by the first three lines of stack traces
  • iOS crash reports are grouped by the first non-system frame from the crashed thread (the thread marked as crashed in the crash report).

In conclusion

If you are planning to start collecting and processing client errors and don’t know which tool to use, then I highly recommend taking a closer look at Sentry, since this service is actively developing and is among the market leaders.

But if you decide to follow our example and develop your own system, then this article gives you the main points you need to bear in mind.

To summarise:

  • store as much data as possible: this can save you the hassle and precious hours which would otherwise have to be spent on debugging
  • be prepared for a sharpened growth of errors
  • the more analytics you have, the better
  • iOS crash reporting is difficult, take a look at Symbolicator
  • alerting is a crucial part of the system because it gives you more control and the ability to respond quickly if something goes wrong
  • be patient and get ready for an exciting journey into the world of error tracking systems development.

Bumble Tech

This is the Bumble tech team blog focused on technology and…