Migrating Automation Lab to the Cloud (Part 2)
This is part 2 of a 2 part series, check part1 if you haven’t already.
As part of the migration, we wanted to rewrite our test runner and architect it to improve overall speed and efficiency.
Test-Runner is the backbone of our automation cluster. It takes the list of tests and test machines available and runs them in parallel on the cluster.
We use Jenkins as our automation server. The cluster management and some part of test running are delegated to Jenkins and the test runner is always in sync with it while running Automation.
We started with creating a Jenkins job to run a single test. It takes the test machine, test to run, and the TestRail id to report to as parameters and runs the test on an AWS Linux spot instance pointing to the test machine’s address. As soon as the test is done, Test Runner sends the result to TestRail server.
Step1 ( Test-Runner ):
- Compiles a list of tests to run
- Creates the list of test machines in the available cluster
- Creates a new test plan for reporting/monitoring and get’s the plan ID
- Starts the Job to create an AWS Linux spot instance cluster
- Starts the poll for “Run Single Test” job availability
Step2 ( Test-Runner ):
- Triggers a “Run Single Test” job with parameters as soon as poll finds availability and test machine is free to run the test
- Runs several tests for each machine and pings for the status of each test
- As soon as the test completes, reports back to TestRail
- Does a single rerun if there are any environment issues for a test
Step3 ( Cleanup ):
- Stop build if any of the tests stuck
- Stop the Jenkins poll thread
- Delete any files created in the process
The major change with this new architecture is that the Test-Runner starts a thread to poll for “Run Single Test” job availability and starts a new test as soon as something is available.
This small change had a huge impact. Previously we used to start a bunch of jobs and recursively check for them to start and proceed with the next batch again. If there was an issue or if a single job took longer than the normal time to start, the Test-Runner was stuck waiting for it to finish.
In a similar fashion Test Runners polls for the status of the single test run and reports back to TestRail as soon as it done and closes that particular thread.
- We gained a massive boost in the time taken to complete a test run with the new architecture, from 4.5hrs to 1.5 hrs
- No more re-runs. Part of the 4.5hr job is because includes several re-runs to eliminate the platform issues which we no longer need.
- Completely platform agnostic, we can finally run the code on Mac, Linux and Windows (We are actually using all three platforms — Linux for automatic runs, Windows and Mac for development.)
- Removed the legacy code and upgraded the 3rd party libraries to the latest and greatest
- Updated our open source TestRail client library and also merged a PR contribution.
It was a fun project to work with and understand the underlying architecture of the Automation Infrastructure. Due to tight time constraints and an office move we had to quickly finish the project.
Now that we are done with the major milestone, here are the couple of things we want to focus on next:
- Some of the functional tests are still brittle, since they try to test all the possible scenarios. We are working towards making them reliable.
- There are few edge cases with the new test runner we are trying to address
- With this knowledge we want to experiment on building an isolated test environment with the VM ( Run the tests in a container like a service rather than on a separate VM)
- Invest on building more tooling around automation (like a slack bot) and reduce the test run time to ~10 minutes to enable us to achieve continuous deployment.
Are you creative and want to work on fun projects like this? If yes, come work with us at Zoosk. Check out our openings here