99.9%+ Levels Crash-Free Rate of Trendyol Mobile Applications

Published in

Trendyol Tech

7 min readJun 15, 2021

“Incidents happen in life, let’s take preventive measures.”

A Crash-Free rate of 99.9% and above is pretty assertive, isn’t it? These numbers are sometimes 99.92%, 99.98%, and sometimes 99.99% in Trendyol Mobile Applications(changes based on version).

*crash-free rate is great, good job team 🤘😎🤘(we didn’t get the apps inside “try catch block”, did we?😛)*

In this article, as Trendyol Mobile Team, I will explain to you our crash management, what actions we take and what processes we follow for keeping our crash rates so low.

Before I start talking about these, I would like to share current numbers on the topic. So far, the Trendyol Android app is active on 20M+ devices and the Trendyol iOS app is active on 19M+ devices. As such, it is very significant to manage the crashes of the application that has been downloaded so many times.

People hate when apps crash, or even when they slow down or freeze for a few seconds, we are aware of this.

According to a survey by Dimensional Research, 61 percent of users expect mobile apps to start within four seconds, while 49 percent want responses to inputs in two seconds. If an app crashes, freezes, or has errors, 53 percent of users will uninstall it.

So what steps do we follow to stay at these levels?
What may cause crashes and the processes we follow?
How do we maintain? Conclusion?

Crash tracking

We follow our app in the store with the Firebase Crashlytics tool with Rollout Strategy.

Image on flutter-community by Promise Amadi

Crash monitoring is required on a regular basis. If this is not done, crashes may increase and users may be upset. Of course, it is also important to group crashes as urgent and important while doing this follow-up. For example, if 5 users face a particular crash out of 750k, it may not take first priority. The issues with the most crashes are given priority and their issues are opened.

At Trendyol, we try to do the best in order of importance and priority. We talk about this situation as a team through communication channels and take the necessary actions.

good morning☀️we have specific crashes at the **5.11.0 version**, they’re not many but can you still check it out?

Even if the number of users encountering a crash is not many, we talk about this situation and follow the crash in order to prevent trouble.

We can also easily follow crash notifications for both Android and iOS via Slack. (and via Jenkins)

Memory management

We know that memory management is an important issue in applications. It is also an extremely important issue for the absence of crashes. To give an example, we can open many product detail pages one after the other in Trendyol apps, if we do not do memory management well, we may crash after the Xth product detail page.

Thanks to Garbage Collection on the Android side and Automatic Reference Counting on the iOS side, we can control this situation.

Illustration created for “A Journey With Go”, made from the original Go Gopher, created by Renee French.

ARC: Automatic Reference Counting (ARC) is a memory-management solution that makes sure the memory for the different objects and functions you create are properly allocated and deallocated so the device on which your code runs doesn’t run out of memory. You don’t have to think about it because ARC does it automatically.

Garbage Collection: Memory utilization on the mobile app has a significant impact on customer experience. If your app creates a lot of objects, then the Android run time (ART) environment will trigger garbage collection (GC) frequently. Android garbage collection is an automatic process that removes unused objects from memory. However, frequent garbage collection consumes a lot of CPU, and it will also pause the app. Frequent pauses can jank the app.

LeakCanary is a memory leak detection library for Android. LeakCanary’s knowledge of the internals of the Android Framework gives it a unique ability to narrow down the cause of each leak, helping developers dramatically reduce OutofMemoryError crashes.

“A small leak will sink a great ship.” — Benjamin Franklin

By doing memory management carefully, we can manage crashes directly or indirectly.

Handle too many codes

Trendyol is a big application, for instance the iOS app has about 400k lines of code. There are many projects to be developed. As such, the code flow also needs to be controlled. For this, we do a lot of technical issues as Trendyol Mobile Team and we have a roadmap for all platforms.

Technical issues such as modularization, refactor, unit testing, rewrite, etc… These issues indirectly ensure that the crashes are less. Sometimes, if necessary, only taking technical issues in the sprint and making improvements to the code protects us from future bugs and crashes.

Some examples from our roadmap;

Identify unexpected changes

What do I mean by unexpected changes? As a mobile team, we work with many teams. Checkout, payment, discovery teams, etc.. also, teams that write our mobile services. We call this team Zeus(+Apollo+Atlas). Zeus has its Product Owner, Developer, and QA. The Zeus team is doing the necessary backend development. After each change (they are already doing API Testing on their side), they inform us that “we made changes here, it needs to be checked on the frontend too” if development is of interest to us.

On Jira, our issues are related to Zeus’ issue. So we can follow it easily, there are no unexpected changes.

Thanks to this communication, necessary pages, and cases are retested even if there is no change in the frontend (even if there is no change from the user’s point of view). This process is like a normal workflow. The product owner takes the required task into the sprint. This task is sometimes only a Test Task. Tests are made, if there is a bug, the issue is opened and finally, the task is done.

This process indirectly prevents the increase of crashes and becomes a safe test environment.

Test with lower internet speed

While developing our issues, we perform our tests with a normal internet speed. There may be a situation that causes crashes while switching to lower internet in applications and continuing to browse the pages.

Maybe you have encountered it, applications may slow down and crash while riding an elevator or switching to an environment with less internet access. To prevent this, while performing our tests, we reduce our internet speed from the phone settings and test applications in that case. So we can see directly whether there is a crash or not.

Variability testing

A crash on one device does not happen on another device, or a crash in one version does not happen on another. It is unfortunately in the life cycle of the software to encounter such situations. While developing our application, we try to handle this situation as much as possible. What are these variability?

1.Device types

2.Screen dimensions

3.Operating systems

4.Network connections

We test every changing environment. While testing our application, we test it on all models, different operating systems, and different resolutions. We also test on devices of different sizes. If the device we want to test does not exist physically, we install it from the emulator and perform the tests. As I mentioned in the previous title, we also test it with different internet speeds.

In this way, we minimize all differences and try to catch and fix crashes before our user.

Testing at every stage of development

As well as functional tests, I would like to list the checks we constantly make. The healthy functioning of mobile applications depends on many conditions.

Does our app work properly when multiple apps are open on our phone? Is our application as it should be in the background state? Does our application continue to operate from where it left off after it has been updated?

We consider and test many situations. Some of them are below:

This list can grow bigger based on;

How does our app behave if it receives errors related to mobile interruptions such as incoming calls, SMS, and battery notifications?
Are there bugs with Zoom in/Out and multi-touch issues?
Location permit tests? etc…

Testing all these situations allows us to catch all the crashes.

Conclusion

As we have seen, maintaining the crash-free rate has its own difficulties. In order to prevent these, it is useful to do the following in the most basic sense;

For more information about Rollout Process, you can refer to my previous article below.

Progressive Rollout Strategy on Trendyol Android App

“The sooner you plan, the better the journey.”

medium.com

If you want the app you are testing to be close to crash-free, following the above processes will help. For any questions, please do not hesitate at all.

Crash-free days!

Thanks to all my colleague in Trendyol Mobile Team who helps to maintain a crash-free rate.