Android Device Farm at Mercari
Hello everyone! Vishal from Mercari’s SET team.
Having offices in different regions, including Tokyo, San Francisco, and London, we thought it would make sense to share our devices across regions by setting up our own device farm. So, we did it by using OpenSTF on GCP. In this blog post, I will share our experience and hacks we used to setup STF at Mercari.
Why Device Farm?
Delivering high-quality apps across all of the different devices and OS combinations in use today is a major challenge for all mobile app developers. The only way to confirm app quality is by testing across various devices. This requires buying many devices, which creates another big problem — device management. It is very difficult to manage a fleet of devices across teams and projects. If we could setup a central system where all devices are connected remotely, and these devices could be accessed for development and QA purposes on-demand, that would be the ideal solution. And OpenSTF is the only open-source tool currently available that allows us to do so.
STF was developed by @sorccu & @gunta at CyberAgent with the objective to remotely control devices from within a browser, closely emulating using a physical device in hand, and the way it works is pretty impressive. There is a long list of features STF provides, such as:
- Remote Control physical devices
- Keyboard and mouse input with Multi-Touch
- Drag and Drop APK install
- Take screenshots and share within team using web links
- Display adb logs realtime with custom filters
- Run shell commands
- Android Studio integration
Before moving forward, it’s worth looking into the STF architecture as we will be deploying it. STF architecture uses microservices. Various independent services communicate with each other via ZeroMQ and Protocol Buffer. Unlike common web services that have only Client Side and Server Side, STF has one more side where code is run — Device Side. So overall, the architecture has three components:
- Device Side
- Server Side
- Client Side
To capture screenshots and trigger multitouch events on Android devices, STF uses minicap and minitouch, respectively, which run on the device and provide an open socket that transfers data between Server Side and Device Side through adb. More details can be found in their READMEs.
Along with the aforementioned native binaries, STF also deploys STFService.apk on the device which runs in the background as an Android Service. This service provides a socket API for monitoring and performing various actions on the device. Again, Server Side talks with STFService.apk through adb using Protocol Buffer.
Server Side consists of various independent NodeJS based microservices. These services communicate with each other via ZemoMQ. Server Side can be further divided into two categories:
- Provider Layer
- Application Layer
Provider Layer consists of microservices which are responsible for direct communication with the devices. For this, STF has stf-provider service. All communication with devices is done through adb. The stf-provider service keeps a device tracker using adb and fires notifications whenever a new device is connected or if the device gets disconnected. Upon connecting a new device, the stf-provider service forks a new NodeJS process called stf-device, which is responsible for all communication with that particular device. Overall, Provider Layer consists of two services: stf-provider & adb. These services should run on all the physical machines where devices are connected.
Application Layer consists of all the other microservices (such as stf-api, stf-app, stf-auth, etc.) which complete STF. Explaining each of them would be outside the scope of this blog. From the deployment point of view, these services can run anywhere. The only requirement is that they should be able to communicate with the provider through the network. Therefore it must be on the same network.
STF Client Side has been implemented using AngularJS. Most Client Side and Server Side communication is through websocket. STF also has a few APIs to list available devices and so on.
Official Deployment Guide uses a combination of Docker, systemd, Fleet, and CoreOS to deploy STF in production, but users are free to choose their own deployment environment and tools. Deployment requirements within the official guidelines are as follows:
- Physical machines to which the devices are connected should have CoreOS or any Linux based OS.
- All machines should have static IP addresses.
- Port Range 15000 ~ 25000 for all machines should be open to all users.
- Docker & systemd are available on each machine.
Interested readers can check out this tutorial which uses Vagrant to create a virtual CoreOS cluster on a local machine. You can find all necessary configuration files, scripts, and commands to deploy STF in this tutorial.
Limitation with official deployment
- Setup requires a CoreOS cluster and static IPs which make initial setup cost high.
- Access is only available within a local network. Though it is possible to access it from outside by using a VPN, it is very difficult to use it from cloud CI services (such as CircleCI).
- Maintenance cost is high, as all infra is on-premise.
STF Deployment at Mercari
The reasons we wanted to setup a device farm were:
- the ability to use devices across regions
- the ability to use devices from cloud CI services for test automation
Because of STF official setup limitations, we had to redesign the deployment architecture.
We did not want to host anything locally; this would increase operation costs too much because the STF could not be hosted on a cloud platform due to the physical devices that are involved. We also could not connect physical devices in cloud-based Virtual Machines. To host STF in the cloud, we had to figure out a way to connect local devices to cloud machines; the solution lies in adb architecture.
As I explained above, in the STF architecture overview, all communication with physical devices is done through adb. ADB tool was created so that developers can debug devices through their machines. Let’s understand how adb works. ADB has three components:
- ADB Daemon
- ADB Server
- ADB Client
ADB Daemon runs inside the device. Whenever a user turns on the developer debug option on an Android device, this daemon starts. ADB Server and client exist within the same binary and run on the development machine. ADB Server listens to port 5037 by default. All client queries (such as
adb devices) go to this port and are handled by ADB Server. This is the key point for our deployment architecture. If somehow, we can forward all ADB Client queries (running on a cloud platform) to ADB Server (running locally, where devices are connected), we can actually do the complete setup in the cloud. However, forwarding ADB Client requests from the cloud to a local machine is not an easy job, since we do not have any public IP addresses for this purpose. The way we solved this problem was by creating a Reverse SSH Tunnel. From the local machine (where the devices are connected), we can create a Reverse SSH Tunnel to the cloud machine (where stf-provider is running), so that all ADB Client requests on the cloud provider will be forwarded to ADB Server locally. This way, the cloud-based stf-provider will assume the devices are connected on the cloud VM.
This is the magic command:
ssh -f -N -T-R :5037:127.0.0.1:5037 user@cloud-host
Overall STF Setup Architecture
Now that the device connection problem is solved, we can design the setup in any way we want. We have hosted STF on GCP. Setup uses GCP load balancer and it proxies all the traffic depending on the region. Each region has its own master with all STF microservices running. Where only one instance is required (such as stf-triproxy), services are running solely in that region. Devices are connected to Mac minis where only the adb is running. We call them the local provider. Each local provider is connected to the cloud provider through Reverse SSH Tunnel.
About Latency & Stability
At this point, you may be doubting the stability of this setup, since the entire system is dependent on SSH Tunnel. This setup was completed almost 5 months ago and I waited this long to write this blog so that I could collect real data to prove its stability. We use autossh to keep SSH Tunnel alive at all times and use systemd to manage all microservices. Whenever a disconnect occurs, autossh will restart the tunnel and systemd will restart the cloud provider, restoring all the devices. For the past few months, we haven’t seen any major disconnects. We run automated tests every night on these devices from CircleCI and, as of yet, no test has failed due to connection issues.
Admittedly, latency is a problem — but not a major one; it does not make STF so slow that it is impossible to use the device. To solve for this, we always try to first connect with devices in our respective regions. Sometimes, STF becomes very slow if the client has a poor Wi-Fi connection or limited throughput, but this is obviously not a deployment issue; it is a user-side problem. One way to reduce this issue is by setting stf-provider’s
screen-jpeg-quality option to
25. This will reduce the image size by 75%, while keeping the visual quality largely the same.
This device farm has empowered us to use any cloud CI service to run automated tests. Before this, we only had two options: either use cloud device farms (such as AWS device farm) which are pretty expensive, or use Jenkins locally. Maintaining local Jenkins slaves and plugins in each region is not something I really want to do, since it doesn’t scale well of us with our current resources. With this setup, we can connect STF device from any CI service using STF API. To understand how to run automated tests on STF devices, you can read more here: stf-appium-example.
Maintaining device infrastructure is one of the biggest hurdles in the mobile test automation journey. This setup has helped us tremendously in solving this problem. I hope our learnings will help you too!