Flank: Past, Present, and Future

Running UI test automation tests in Firebase cloud.

Flank speed is a nautical term referring to a ship’s true maximum speed

Intro

If you’ve never used Flank, please read this blog post by Walmart Labs. In short, Flank is a massively parallel Android and iOS test runner for Firebase Test Lab.

The Past

Before Flank, to run inside the Firebase infrastructure, we used a python command line tool called gcloud. Gcloud would upload your apk and your test apk, and then run tests in a single device. Gcloud works fine, but it doesn’t currently support test sharding. Once you get to 30–50 UI tests, the process gets slow and you need parallelization. This is the reason Flank was created.

Flank started as a Java application. The first time you run Flank, it runs every test in a separate device to measure test durations. Flank saves those durations in a file called flank.tests which looks like this:

com.package.MyClass#testOne 18
com.package.MyClass#testTwo 64
com.other.package.MyOtherClass#myTest 24

Inside the configuration file, you can set a maximum duration per device with the shard-duration parameter. With flank.tests plus the shard-duration parameter, Flank decides how many shards to use and which tests to run on each device. Then, it gathers results and returns zero if everything passes.

The last Java version update was 2.0.3.

Problems with the Java version

The first big problem was that Flank depended on the gcloud cmd line tool. If you had too many shards when using threads to call gcloud, you might run into race conditions, and the whole run would fail. Gcloud was not made to run in parallel, and once your test suit grows, the number of python processes being used to call gcloud might reach OS limits.

The second big problem was recording test durations; they could only be measured when they ran individually. The first time we ran Flank, it spawned 400 emulators. When we used the Java version, tests were measured only when they were introduced and never updated. Every time a developer added a test, they had to run the test in insolation, update the flank.tests with the time and push the flank.tests file update to the git repo.

Fast-forward into the present and see how these issues were improved.

The Present

Today Flank is mainly maintained by bootstraponline, who introduced key changes to the project. At the time of writing this, the latest release is Flank version 4.1. These are the major improvements from version 2.0.3:

  • Completely rewritten in Kotlin
  • Replaced .properties config file with YML
  • Replaced gcloud with Firebase Test Lab APIs
  • iOS support
  • Aggregated JUnit XML reports
  • Smart Flank sharding algorithm
  • Test coverage

During an Expedia Group hackathon, we chose to contribute to the Flank open source project. Expedia Group mobile teams couldn’t use Flank 4.x because it was missing functionality present in version 2.0.3. After syncing with bootstraponline, we agreed to work on the missing functionality.

Our most significant contribution was the sharding algorithm. Before the Flank 4.1 version, the algorithm was distributing test into shards by count instead of by test duration. We weren’t pleased with the algorithm from the java version either. At Expedia Group, we have too many UI tests, and sometimes Firebase couldn’t provide us with enough devices to satisfy the number of shards Flank wanted to use. We wanted two things out of a new algorithm, a more informed distribution using previous test times and a fixed amount of shards to use. The feature is called Smart Flank.

Smart Flank Algorithm

To better understand how it works, let’s use two runs. The first time you run Flank, it will use the testShards parameter to determine how many devices to use, and tests will be distributed across the shards with a default time of ten seconds per test. Once Firebase runs the tests, Flank will download the JUnit XML output from each shard, merge the results into a single XML and upload the file to the smartFlankGcsPath. The second time Flank runs, it will use the previous run XML to determine test execution times and use that information to distribute tests across shards better.

Example

Let’s say we have three tests: test1, test2, test3, and our testShards value is two. Flank will create two shards, putting test1 and test2 in the first shard and test3 in the second shard. The final XML uploaded to smartFlankGcsPath will look like this:

<?xml version='1.0' encoding='UTF-8' ?>
<testsuites>
<testsuite name="MyClass" tests="3" failures="0" errors="0" skipped="0" time="18.00" hostname="localhost">
<testcase name="test1()" classname="MyClass" time="5.00"/>
<testcase name="test2()" classname="MyClass" time="5.00"/>
<testcase name="test3()" classname="MyClass" time="8.00"/>
</testsuite>
</testsuites>

The second time Flank runs, it will create a list of tests sorted by their time in our example that would be test3, test2, test1. Once the list is generated, we iterate over the tests assigning each test to the shard with the lowest execution time. Test3 will go to shard1, test2 will go to shard2 and test1 will go to shard2 since shard1 duration is 8 secs and shard2 duration is 5 secs. After Firebase finishes running all tests, Flank will download all the XMLs and replace the old merged XML at smartFlankGcsPath with the new one.

Here’s are some real-world examples of the sharding output:

Matthew Runo from American Express shared his numbers after some tweaking:

Smart Flank cache hit: 100% (1462 / 1462)
Shard times: 116s, 116s, 116s, 116s, 116s, 116s, 116s, 116s, 116s, 116s, 116s, 116s, 116s, 116s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s, 117s
1462 tests / 58 shards

With 58 shards, his shards run close to two minutes. This is great since Firebase charges rounding to the minute.

Matt Plotner from Egencia shared a nice tweak he did to improve his iOS Flank runs execution time. His second run output the following:

Smart Flank cache hit: 100% (27 / 27)
Shard times: 235, 241

In this case, shard1 would be charged for four minutes, and shard2 will be charged for five minutes. Total charge would be nine minutes, and the total execution would be around four minutes. Taking advantage of this information, he decided to add a shard. This was his output:

Smart Flank cache hit: 100% (27 / 27)
Shard times: 174s, 174s, 174s

Total charge would be three minutes per shard but the execution time would be faster.

As bootstraponline says, “shards trade budget for performance.” If budget is not an issue, try increasing the shard number and see what happens.

The Future

Expedia Group will continue to contribute to the project to keep making Flank better and better. You can go the issue tracker and see what’s coming. We are interested in improving how Flank estimates test execution times. 4.1 is only using the last run when it could be using an average of the previous n runs. We are also interested in seeing more log information to keep on making informed decisions on how to shard our tests.

The Flank community is very active. Join us on #flank in the firebase slack. We want your input to improve Flank!