Dealing with bugs on mobile apps

How we handle issues and edge cases at Azimo

What would you do if a new shopping app crashed while you were paying with your credit card? My guess it that you’d delete the app and try a competitor instead. Ten years ago, it was fine to “move fast and break things”. In today’s competitive, standardised world, customers don’t tolerate services that crash.

Stability is a core KPI for the mobile engineering team at Azimo. In 2016 our stability was “good enough”. In 2017, 99% of our users had a crash-free experience. In 2018 we raised that figure to 99.5%. By the end of this year, 99.9% of our users will be crash-free. Being crash-free, however, is not enough.

Crash-free is not enough

In Azimo’s early days, there was a clear correlation between customer complaints and our crash-free ratio. In 2018, however, we noticed that the trend had changed. Our app was at record stability, but customers still had problems with it.

We were mainly focused on catching and fixing technical issues (NullPointerException, IOException, OutOfMemoryException). But because our customers are humans, not machines, they perceive the app at an entirely different level of abstraction. They don’t care whether the code throws errors or not, they care about their experience.

What does this mean?
Here’s a simple example. When calling a remote API, you might see many different exceptions:

  • errors 4**/5** from the server
  • IOException when the internet connection is broken
  • JsonParseException when you get a response that wasn’t expected

Assuming that you use Rx framework for asynchronous calls, to make sure that app doesn’t crash too often, you will start with a base class overriding the onError() method:

In many cases this class is extended, so IOException shows “no internet connection” and the API error 401 asks for authentication. But it’s pretty likely there are other issues that you didn’t predict. Fortunately, your smart solution catches all of them by calling super.onError(exception) and showing a general error popup. All the technical problems are solved, aren’t they?

But, lo and behold, your customers still complain. And why? Because a pop-up with the message “Whoops, something went wrong” is about as helpful as a crash. To the customer, your app is broken, no matter what your code says.

You won’t see it in your crash analytics tools, but your users still have problems.

This situation is grimly familiar. Your crash analytics tool reports 99%+ stability, but your app isn’t working and users are sending you screenshots that tell you absolutely nothing about what’s wrong. No stack trace, no technical details.

To solve this problem, we invested a lot of time and effort into error handling, tracking and reporting.

Error handling

The Azimo app is reliant on user input being passed to our backend. That means we have dozens of kinds of each error to handle under 4xx status codes. Writing code to handle all that quickly became tedious.

To improve efficiency, we created an open source Android project that, thanks to Annotation Processing, generates error handling boilerplates for us. You can find it on our Github account: https://github.com/AzimoLabs/Api-Error-Handler, and soon we’ll also publish a detailed description and the story behind this solution.
For now, please take a look at self-explaining code examples in the repository.

And here is a snippet taken directly from our production code:

This interface is then applied to the view that is responsible for showing the proper error message. Everything else is generated automatically for us.

Error tracking and reporting

We also introduced a couple of techniques for more unusual cases that help us to measure and track all kinds of errors. The most basic solution is non-fatal error reporting. For example, our observer class from the above now also has something similar to this:

Then our LoggingManager dispatches all exceptions that aren’t yet properly handled to tools like Crashlytics by calling Crashlytics.logException(e);.

Why do we only dispatch exceptions that aren’t handled correctly? If we receive a “No space left” error, which triggers a pop-up asking our users to free up some of their device’s memory, there is no need to log the error as non-fatal. This is essential because over time you will have so many errors that are already properly handled, they might overlap with those that aren’t yet. Also, if you stay consistent with this approach, you can rely on notifications to tell you when something is truly wrong with your app.

At Azimo we receive notification errors:

  • By email from Firebase
  • Through our live monitoring channel on Slack
  • From an SMS/call via Pager Duty integration

Error analytics

While handled errors aren’t sent to our crash reporting tools, we still log them in tools like Mixpanel or Firebase Analytics. That kind of information can be used to make UI and UX improvements. For instance, do you see errors being triggered by a bank details validation UI? Maybe you should improve the bank details form with a clearer design or more instructions for your user.

These analytics can also provide information about technical issues. If, for example, you observe an unusually high number of payment rejections, maybe your payment integration is failing.

Mixpanel shows us the most common API errors

Correlation Id

We’ve also learned that it’s critical to be able to track an issue through the entire platform. In crash reports, we usually get a stack trace. In most cases, that’s enough to diagnose the problem.

But what about API errors 4xx and 5xx? In our engineering stack (frontend + backend) we implemented a correlation ID solution. Because our backend is partially monolithic, partially split into microservices, with many asynchronous calls, we need to be sure that communication between the client and the various domains of our backend is fully traceable.

To make this possible, each of our clients (Android, iOS, Web) generates a unique ID that is sent as an X-Correlation-Id header, passed to all services in the calls chain and returned as a response header in unchanged form.This allows us to trace the error in our backend logs and accurately find the service or entire domain that is responsible for it. Thanks to tools like Kibana, for a high-level diagnosis you don’t even have to be a backend engineer to understand what was wrong.

For our apps, all API errors 4xx and 5xx tracked in our analytics tools have a correlation ID property, so it is much easier for us to track down causes, based on samples. Additionally, all “Something went wrong” popups display a correlation ID. That way, whenever a user sends us a screenshot of the problem, we can quickly check it in our logs.

Finally, for users who contact us via support chat directly from the app, under the hood we send a set of correlation IDs from errors that the app encountered in the last 24 hours. As a result, communication between customer support agents and software engineers is much more effective. Some of our senior CS agents already started digging into Kibana logs without our knowledge 😎.

Each support ticket has a list of Correlation IDs from the last 24 hours

While our solutions for error handling and tracking look complex, we believe that this is just beginning. We want to find even more effective ways to deal with unusual errors and build a better experience for our users. Our engineering team is testing solutions like automated production tests that literally click through client-side interfaces 24/7 to make sure that we know about problems before users do.

We have made significant progress on building an active monitoring system for our backend infrastructure that is now connected with fully automated on-call processes. These tools, and many more, help us to make sure that our service is always available, with the shortest resolution time for all edge cases.
Soon we’ll share more of our experiences in this field. Let us know what you think in the comments below.