Migrating Automation Lab to the cloud (Part 1)

This is part 1 of a 2 part series on Migrating Automation Lab to the Cloud.

TL;DR: Our journey on migrating from .NET Framework to .NET Core. How we leveraged Jenkins to create an on-demand cloud cluster for automation runs and saved costs on hardware and time on maintenance.

At Zoosk, we have a huge set of end-to-end tests that run a few times a day in our homegrown Automation Lab cluster. The whole infrastructure is based on windows VMs running on Hyper-V (on a monster Windows machine with 100GB RAM), with around 100 VMs running at any given point in time.

The goal was to move infrastructure to the cloud and make it cross-platform ( run on mac, linux and windows )

There were a couple of challenges associated with this:

  1. Port to .NET core
  2. Fix the broken pieces
  3. Build infrastructure

Port to .NET core:

Microsoft has good documentation and tools to help to migrate from .NET Framework to .NET core smoothly. (check out the simple guide here)

Visual studio 2017 or later needed to work with .NET core cross-platform so our first step was to port our project from visual studio 2015.

Visual studio uses project files in the format .csproj and the fun part is VS2017’s format of .csproj changed from the VS2015 version. We started by using the one-way upgrade VS2017 provides but none of the project compiled and it has few hundred errors.

It’s a pain to look for the errors and fix them but the new format got us excited because it removed a lot of clutter from the project file and made it easier to read and understand

For example, all the embedded and compile resources inside a project can be removed, like the ones below:

<!-- the defaults -->
<Compile Include="**\*.cs" />
<EmbeddedResource Include="**\*.resx" />

And simplified project dependency definitions like these below:

<ProjectReference Include="..\Payment\Payment.csproj">
<Project>{1A2VA732-BN78-09OD-OPD9-5698AHD9OP02}</Project>
<Name>Payment</Name>
</ProjectReference>

are replaced by:

<ProjectReference Include="..\Payment\Payment.csproj" />

Fix the broken pieces:

After fixing the .csproj files there were few more errors left before we can could actually compile and run on VS 2017 (i.e unsupported 3rd party libraries).

With .NET core completely written from scratch and focusing on cross-platform some of the libraries didn’t support .NET core yet. Because of this, we had to find alternatives and also rewrite the code to use the new library patterns.

Using the API Portability Analyzer tool provided by Microsoft, we generated a report on the project ( see image below) to see how much effort was needed.

Report from the API Portability Analyzer tool

The changes were broken into two steps

  1. Replace/ find an alternative for the 3rd party libraries with .NET core support and rewrite
  2. Strip the windows API dependency code in the automation framework and rewrite to support cross-platform
a sneak peak of 3rd party library analysis

Finding a replacement for the current XMPP implementation was the toughest. There isn’t any well-written documentation available for this and the only place we ended up was the RFC from Internet Engineering Task Force (IETF).

We also added our custom packets on top of the XMPP standard so re-writing that part took longer than what we budgeted for. Mail library was another piece that we had to do a heavy rewrite on.

After all the changes and rewrites were made to finally compile and run, there were about 4k lines added. But the cool part was we had a chance to remove the old and unused code as part of the migration and reduced the footprint by about 8k lines.

Final pull request after the code changes.

Build infrastructure:

And, the final piece of the puzzle… Setup Jobs to kick off automation on our nightly and master branches daily.

We use Jenkins as our automation server and now that we could run the automation test suite on Linux, It was time to retire the Hyper-V hypervisor and use AWS spot instances.

Spot instances were a perfect fit for us, because they were up to 90% cheaper than the regular instances and were provisioned to use for few hours and destroy

We started by creating an AMI for the spot instances and launched them on demand using the EC2 plugin for Jenkins.

Installing Google Chrome for CentOS is not straightforward but I found a script on a blog post by Intoli to easily install Google Chrome on CentOS to run it headless, which we used to install while creating the AMI.

Once the configuration for the AMS details was set in Jenkins, it was a piece of cake to start instances with a pipeline script.

parameters {
string(name: 'count',
defaultValue: 10,
description: 'count of the instances to start')
}
node {
for(i=1; i<= "${params.count}".toInteger() ; i++){
string name = "Creating cloud instance "+i;
stage(name) {
ec2 cloud: 'EC2', template: 'your cloud template'
}
}
}

This pipeline script takes a count optional parameter and creates instances on demand on AWS, and copies the slave.jar file and launches it as a Jenkins slave computer.

By using the spot instances, we no longer needed blade runner with a cluster of VM’s.

That freed us from a lot of pressure points and improved our overall processes. A few of them are:

  • No more occasional VMs going rogue and tests failing due to unknown config issues etc.
  • No more cleanup/maintenance of the VM needed
  • Updating the AMI reflects the changes to every instance launched after
  • Scalable — now that we have this infrastructure we can let developers run automation for their changes as needed.
  • Massive time improvements (more about this in part 2 )

Shout out to Conor Callahan for helping me out with a lot of dev-ops related work needed to make this possible.

Part 2 goes into details of the test runner architecture.