A Saga of Improvement in Android App Performance — Part 3
Introduction
In Part 1 & Part 2, we explained the 4 steps improvement cycle and learned about the monitoring process at Tokopedia.
In this part of the blog series, we will explain how we profile and diagnose the performance issues for android applications at Tokopedia.
Why Profiling is an important and challenging step?
When we detect a regression or we want to improve the PLT (Page Load Time) of a particular page, we need to know which function is taking more time in overall page load time.
For profiling, android provides a few tools, which are systrace and Android Studio Profiler.
But the challenge with these tools is that they are difficult to use, and also they need a lot of manual intervention of a developer to execute the journeys and analyze the trace to find the bottlenecks. Also identifying the diff from the last trace in the case of regression is difficult and time-consuming which adds up to the cost as well.
To increase the speed of diagnosing the performance bottlenecks, we built the internal tool named “SHERLOCK”.
What is Sherlock? 🕵🏼
Sherlock is an internal tool that reports a list of hung methods (taking more than 32ms) and creates the flamegraphs for scenarios.
Sherlock was created to provide us actionable insights that we can use to fix the performance issues. We run sherlock every day on our release branches and we have also provided an option to run this tool on any particular branch.
It helps us to find the regressions, track the performance history, and finding the root cause
How do profiling reports look like?
We have built the profiling reports dashboard using DataStudio, which shows the hung methods list along with the time and percentage they took out of the page load time.
Each function is also linked with the flamegraphs and developers can further investigate the function call trace to find the root cause of the slowdown.
Sherlock also sends the summary report on the slack channel after each run, please refer to the sample report below:
How we build Sherlock?
We have build sherlock using the Jenkins pipeline and it includes the following steps:
- Create the apk for Tokopedia main application using the targeted branch after disabling proguard
- Download Test Apk
- Run the test on the targeted APK using the firebase test lab
- Download the trace files using gsutil
- Convert the trace file to flamegraphs using aflame.
- Parse the flamegraphs using a shell script and push the data into MySQL for all methods taking more than 32ms
- Send notifications and alerts with a report
- Create the reporting dashboard using google data studio
To build sherlock we have written an instrumentation test for all the pages we wanted to monitor and create the trace file using Debug.startMethodTracingSampling(); method.
Closing thoughts
In this part, we have described profiling, its importance, challenges, and how we do it at Tokopedia.
In this blog series, we have explained the monitoring and profiling of performance issues which are crucial and critical parts of the performance improvement cycle. By now you hopefully have some understanding of our philosophy on performance, and how we have built an internal ecosystem to support that.