Performance at Scale — Taking Your On-premise Application to the Limit (Quickly)

Adam S
Quali-TechBlog
Published in
5 min readSep 22, 2019

Improving application performance is like staying in shape — most programmers won’t bother with it most of the time, but when you suddenly must run to catch a bus, your lungs start to burn and your legs turn to jelly.

Obviously, you’d want to be prepared for that future — the time when your application will be used by thousands of users simultaneously. But, how did we do it with an on-premise distributed system like CloudShell Pro?

This article will describe a testing framework we at Quali created to test end-to-end user scenarios at scale during a performance spike we had in the company.

Every few months, we shift away from a feature-focused sprint, to improve our product quality; we call it a quality-spike sprint — a 3–4-week sprint with clear KPIs (e.g. making our product faster by X%).

Sprints that involve performance-spike enhancements are one of the most challenging, as the KPIs from our product teams are usually “make it better!” and are very hard to measure.

With such a vague challenge, we figured that we needed to be prepared for anything — and we decided to invest in new infrastructure.

We decided to build a continuous testing framework that would be flexible enough to run multiple user-flows in our system (CloudShell Pro), using a web interface or APIs, and to provide results that are easy to compare. Our goal was to create a framework that would gather inputs from our product team, and then provide a readable result for comparison. Once we improve the application performance. Preferably, with fewest pre-test steps as possible. We aimed for one-click results.

We identified three major aspects of the project:

1. Orchestrating the environment

2. Executing the load test

3. Providing an end-to-end (E2E) automation — easy click and run functionality

Step 1: Rise of the Machines — Orchestrating the Environment

Our product, CloudShell Pro, is an on-premise Environment as a Service tool. Most of our customers install it on private data centers, while others install it on the cloud. For this reason -we decided to test it on the cloud, as we needed high-spec machines and easy deployment.

The heart of the project is the configuration of the CloudShell Pro machines, for example:

  1. CloudShell Pro server (Backend) + CloudShell Pro portal (frontend)
  2. Execution Server (CloudShell Pro execution agent)
  3. SQL Server

Repeatedly orchestrating this environment in a quick and consistent way is complex. So, we looked for a tool that would provide maintainability, flexibility, easy integration with our CI tool.

Building such complex orchestration ourselves seemed high effort and we decided to use Quali’s own SaaS solution — CloudShell Colony — for the orchestration.

CloudShell Colony is web-based Environment as a Service (EaaS) platform that creates on-demand environments, that contain predefined Virtual Machines and application.

CloudShell Colony provides a solid integration with CI tools (fast API), fast migration between cloud providers (AWS to AZURE), consistent deployment, and easy maintenance.

We designed a blueprint in CloudShell Colony that defines each of the CloudShell Pro components, installs them on-demand, and configures them to use one another. Out of the box, we got all aspects of connectivity and debugging capabilities from CloudShell Colony, so no extra work was needed.

One of CloudShell Pro blueprints we used — in this configuration, two machines only
One of CloudShell Pro blueprints we used — in this configuration, two machines only

Step 2: Easy run around the block — Executing the load test:

Once we made the application, we were able to deploy in minutes and debug easily, we needed to start writing and executing the actual tests. We needed a framework that would execute many concurrent real users both from our web-interface and APIs.

The tools for the job:

  1. JMeter
  2. BlazeMeter

JMeter is a well-known tool for performance — it’s an easy “plug and play” utility that our QA team uses to record some of our clients most used user flows; this includes UI testing and APIs (XMLRPC, REST, etc.).

JMeter allows out of the box distribution, so many users can access our application at once; however, orchestrating, configuring, and collecting the results from JMeter is not easy and we needed a quick tool to do those tasks on our behalf. For that, we use BlazeMeter. BlazeMeter is a performance SaaS tool, that can execute our JMeter scripts for us without any need to change them. Why it is better than simply running the scripts locally?

  1. BlazeMeter runs in parallel, on a selected cloud provider, and scales easily.

BlazeMeter has a reporting system — it aggregates the results and creates graphs for analysis.

Early reports in BlazeMeter

Step 3: Assembling the pieces — E2E automation:

Building a framework is not always enough; you want to make sure it will be easy to use, and others will embrace it. We wanted to create a one-click E2E that would allow any user — product, QA, or developer — to provide a set of inputs and get the execution results from the new automation.

Our “Initial Click” that starts the process is a TeamCity build. TeamCity, the ultimate CI tool, will provide tracking for changes to our code, results history, and an easy way to provide and retrieve execution inputs. Results history will be used later for comparison and analysis.

Initiated by TeamCity, sandbox will be created in Cloudshell Colony and will be tested by BlazeMeter. Report will be sent to TC for later analysis.

Now, we have a system that gives us a feedback loop — we can identify an issue, fix it, and run the E2E automation process to get results regarding the latest performance improvement, with comparisons to previous runs.

one-click E2E that will allow any user to execute high scale performance test

Work in progress:

This project is still in progress — many improvements can be added. For example, Configuring TeamCity to Pass\Fail according to a condition based on the results. If 90% of our virtual users experience a 20-second delay, that’s probably unacceptable, so the test should fail.

Also, we can pull and collect the logs from the VMs to understand what happened inside CloudShell Pro during the test.

Summary:

Complex on-premise applications, like CloudShell Pro, are naturally harder to test. Orchestrating the resources, configuring the services, and providing a stable way to execute tests repeatedly are usually too much work for a small team in a short time. Utilizing Cloudshell Colony, BlazeMeter, TeamCity, and JMeter enabled us to achieve a pipeline that can be used both in our daily CI pipe and for any performance-spike we will have in the future, testing future versions of CloudShell Pro or other applications with similar structure.

--

--

Quali-TechBlog
Quali-TechBlog

Published in Quali-TechBlog

Learn how we develop software at Quali. This is a publication containing stories from Quali’s engineering group.