Spotting Latency Regressions Ahead of Time at Teams Mobile

Published in

Microsoft Mobile Engineering

7 min readFeb 2, 2024

In the realm of mobile applications, subpar performance not only results in user frustration but also contributes to high user dissatisfaction scores (DSAT), a lower app Net Promoter Score (NPS), and increased uninstallations. Developers continually implement numerous code optimizations in each release, bringing about substantial improvements in App Vitals and critical user scenarios such as application launch, chat loading time, and channel loading time.

Our Android codebase sees contributions from over 350 developers on a monthly basis, with a staggering 50+ commits merged into the mainline every day. The pace of innovation remains high, as we introduce more than 20 new features daily to our internal users across the organization and our partners.

In this dynamic environment, the need to maintain optimal app performance is paramount, emphasizing the importance of identifying and mitigating latency issues before they have a chance to impact user experience.

All code is guilty until proven innocent.
— Uncle Bob

Scenario Measurement & Monitoring

We employ a method called scenario telemetry to measure execution time at the beginning and end of critical usage scenarios of our app. The latency telemetry for all usage scenarios is recorded in our database along with user details. The ScenarioContext handle facilitates the transfer of context information related to a user scenario throughout the app, including execution time, metadata, device information, status, steps, and correlation ID.

fun onCreate() {
  val scenarioContext = scenarioManager.startScenario(AppScenarioNames.App.APP_START_WARM_FIRST_DRAW);
  // do critical work, load screen
  scenarioManager.stopScenario(scenarioContext);
}

Creating dashboards at Microsoft is a straightforward process, involving crafting a query on our internal portal that our internal framework executes every minute to chart into graphs. You can send the usage telemetry to your internal or third party Database and host a Grafana instance for building charts and monitors.

The data is fed from app usage from internal team, testers, canary users, beta users and early adopters to detect early signs of regressions across any of our usage scenarios. However, relying solely on latency-based graphs for observation has proven less effective.

Late discovery of regressions during rollouts resulted in delayed releases and fixes in some cases when alerts were extremely late.
Pinpointing the cause of regressions became challenging once an increase in latency was noticed in a specific scenario, compounded by the numerous code changes introduced in each release as we have 1000s of commits going in release with many feature rollouts also in parallel.

Bouncer

Before<>After

While our previous methods effectively identified substantial performance changes, due to the downsides we decided to build another system to move the detection as left as possible i.e. just after the regression merges to our mainline branch.

Given Microsoft Team’s expansive scale, this entailed scrutinising hundreds of commits on a change by change basis to identify regressions as minute as 15% which we set as our benchmark.

We did that by building a performance testing system that runs on our continuous integration servers running a performance test suite with instrumentation tests executing each of the scenarios before and after each commit merged on our main branch.

We run these tests after building our commits on our agents at Azure Dev Ops pipelines, but Github actions can be used to run the tests via am instrument on built before-change/after-change APKs. For eg,

Before: { “featureFlag”: false } , After: { “featureFlag” : true }

Before: <HEAD>, After: <HEAD>~1

private fun testScenarioLatency(scenarioName: String, setup: () -> Unit = {}, test: (iterationCount: Int) -> Unit) {
  setup()
  for (i in 0 until numWarmup) 
    test(i) // Warmup Runs
  for (i in 0 until numIterations) 
    test(i) // Test Runs   
  writeAllLatencyDataToFile(scenarioname)
}

@Test
@LongTimeOut
fun testWarmLaunchByLauncher() {
testScenarioLatency(AppScenarioNames.App.APP_START_WARM_FIRST_DRAW,
setup = {
  MainPage().apply {
    navigate()
    validate()
  }
},
test = {
  device.pressBack()
  //.. execute the relevant scenario
})
}

Microsoft Hydralab (open-source) built at Microsoft allows you to run Espresso based instrumentation tests on a self-hosted mobile lab with plugged in physical mobile devices. Our lab uniformly consists of only Pixel 4A devices connected to a HL host machine.

We integrate our Android app with their gradle plugin passing it relevant apks via your CI pipelines and the run supports multiple test result formats with attachment functionality from devices to CI artifacts for analysis in the pipelines. After each test run, we extract and upload the test logs, latency/memory data recorded per scenario, test recordings, and system traces from our devices to be analysed by our scripts in the final step.

Our final step does a median comparison of latency dataset of 9 iterations from test runs of both before and after builds failing builds on a set thresholds which is 10–15% across scenarios.

Variance

At the core of our pursuit for stability lies the ability to quantify and control variance of the system. We kickstarted our journey by capturing latency values across 10 iterations in a distribution indicating 10 runs of the scenario under test and then using the coefficient of variation (CoV) as our guiding metric.

CoV = stdev([..distribution])/mean([..distribution])

We only onboard usage scenario metrics for feature teams on the framework only if CoV < 0.15, but we are striving to go even lesser.

The key to success lies in minimising variance during test runs, ensuring that performance metrics remain stable across different iterations. In this article, we delve into the strategies employed by our team to reduce variance in test runs on Android devices, providing insights into the challenges faced and the innovative solutions implemented.

On-Device Response Mocking

Recognising the impact of varying API call response latencies on test run stability, we adopted for on-device response mocking. By redirecting all relevant API calls for test scenarios to mocked responses from the dedicated mock server, we eliminated the variance caused by live backend services.

Leveraging Termux, a Terminal emulation app with Linux packages support for Android, we start our mock server directly on the device itself during tests and ensure our network layer polls localhost leading to a controlled and consistent results.

Enforce all API calls during the test to be mocked and fail tests if there are any live service calls.

Device Stability

Android adjusts CPU frequency dynamically to enhance battery life and manage phone temperature. Consequently, the phone’s speed fluctuated as the device heated up.

To maintain performance consistency across our run, we obtain root access on all our devices during onboarding them to the system via Magisk. That helps us perform below optimisations (very simplified for explanation) before every test run.

echo 940000 > /sys/devices/cpu/cpu(1..8)/cpufreq/scaling_max_freq # Lowest Supported frequency
echo 940000 > /sys/devices/cpu/cpu(1..8)/cpufreq/scaling_min_freq # Lowest Supported frequency
echo performance > /sys/devices/cpu/cpu(1..8)/cpufreq/scaling_governor

echo 0 > /sys/devices/cpu/cpu(1..8)/hotplug/state # CPU Hotplugging
setprop ctl.stop mpdecision # MPDecision Service
stop thermal-engine || true # AOSP thermal controller
stop perfd || true # AOSP profiler,tracer
stop vendor.thermal-engine || true # thermal controller by vendor
stop vendor.perfd || true # profiler,tracer by vendor

settings put system accelerometer_rotation 0 # No rotation!
am broadcast -a android.intent.action.CLOSE_SYSTEM_DIALOGS # close dialogs before run

# Animation related changes
settings put global window_animation_scale 0.0
settings put global transition_animation_scale 0.0
settings put global animator_duration_scale 0.0

The Google lockClock script was a good starting point but we have made a lot of changes on top such as in our choice of scaling_governor.

Additionally, A device health dashboard is critical to measure health metrics such as cpu metrics, temperature, storage, memory usage during test, median deviation across tests caused on this device.

Test Run Stability

Variance stability on test runs extends to not just the environment of execution but also the code under test.

Ensure the APK under test has been R8 optimised and has debugging disabled. We observed up to 50% stability in latencies just by the above 2 changes.

We execute full AOT(Ahead of Time) compilation using the below command before each test run and 3 warmup runs before all test suite performance measurement.

We also enable Sustained Performance mode used by game developers for consistent, reliable performance across diverse scenarios. Though many device manufacturers have not implemented a noop it’s good to enable this during test runs.

# AOT Compile test app
cmd package compile -f -m speed <package-name>

# Enable R8, disable debuggable
debug {
  minifyEnabled true
  debuggable false
}

Outcome

We now have commit by commit & feature rollout by feature rollout stable latency data as each request merged to our mainline going through trials of the rigorous performance gates. We also now receive alerts for potential performance regressions far less frequently, and when alerted, it is more likely to signify a genuine regression.

This has streamlined our workload, eliminating the need for manual adjustments to static performance thresholds after each false positive and we are totalling roughly at 100k test runs over the past few quarters.

Performance tests against pull requests, which were consistently red due to the high probability of breaching at least one scenario as false positives regularly, are now predominantly green. When performance tests do indicate a red status, we have increased confidence that a genuine performance regression exists while we rerun the tests.

Multiple such regressions caught and reverted in a day

Conclusion

Our hope is that by sharing these insights, others can navigate similar paths to triumph. The signal quality enhancements brought about by above design and methodologies have been a game-changer, allowing us to unearth regressions that once lurked in the shadows.

Bouncer has emerged as a crucial player in the performance team’s playbook at Microsoft Teams Mobile, rescuing teams from potential production-bound regressions.

Now, engineers can channel their efforts towards speeding up enhancements rather than grappling with regressions.