Include sensitivity analysis in your SLAs: How many servers do I need in practice?
In a nutshell; the software we build enables people to make investment decisions by utilizing advanced mathematical models. Our products range from stand-alone APIs to complete platforms and our users range from individuals to large institutional investors like pension funds, insurance companies and sovereign wealth funds. Performance wise, we are facing the challenge of offering individuals the same level of accuracy w.r.t. financial modeling as we provide to our institutional clients. However, due to increasing use of our services via third party apps (e.g. online banking environments, robo-advice platforms, or financial planning tools) we need to provide calculations within a time-frame of several seconds. On top of that, we have to allow for a certain level of user concurrency in our systems. In this post we expose how we monitor and manage performance requirements of our services within our development process. Since our core business is designing and deploying mathematical models one should not be surprised we added some math in this methodology.
We develop our software in biweekly sprints and are now able to deliver new software versions accordingly. However, during the transition, at the beginning of this decade we had introduced ‘end’-sprints prior to client deliveries. This waterfall inheritance acted as a buffer for focusing on non-functionals such as documentation, final functional and performance testing. I believe this method, prior to the devops revolution, was business as usual back then. Since finding performance flaws after several months of coding is like finding a needle-in-a-haystack we soon automated our performance tests en questioned the need of this ‘end’- sprint. By incorporating performance tests in our continuous deployment pipeline teams were quite confident in the applications performance and any possible flaws we tracked down almost immediately after a commit. As our products grew, the applications performance and our growing users-groups’ need-for-speed in combination with more functionality soon became an interesting challenge. Our computations, typically simulations executed in parallel, can consume all the servers’ CPUs quite efficiently. Hence, not only do we need to secure calculation times on idle systems, but we also have to take system waiting times into account. To meet a SLAs in the form of: 99% of all service requests should respond within X seconds our systems needed to scale horizontally, meaning adding more machinery. So now the interesting question became:
How many machines do you need to meet a certain SLA?
Intermezzo: Dynamic scaling with Cloud Native Technologies
One can argue that dynamic scaling with cloud native technologies with containers and a orchestration layer under the hood should solve this challenge as well. Yes, I agree. However, the services described in this post have an ‘always-on’ property, and our scale allows us to cross the break-even point in the on-premise vs. cloud trade off in favor to on-premise. Regardless of container and VM based deployments, when hosting on premise, it is of vital importance to allocate sufficient on-premise resources to your application cluster.
The Queuing-view of deployment systems design
Finding a solution to the aforementioned question has similarities with the challenges in the design of service oriented systems like call centers. Here the incoming calls equal the applications’ requests and the call center employees represent the servers in the system. Many literature is available on optimizing the service levels of call centers with queuing theory. The literature generalizes Queuing systems by Kendall’s notation. Here queuing systems are described using three factors A/S/c where A denotes the arrival process to the queue (e.g. Markovian, Deterministic or a General Distribution), S the service time distribution and c the number of servers in the system. Depending on the assumed distributions for A and S one can derive several statistical properties of the system, like: average waiting time, average time in the system and average number of customers in the system (Little’s Law ). Most interesting is the PASTA (Poisson Arrivals See Time Averages) property when dealing with Possion arrivals in the system. This well researched characteristic states that the probability of the state as seen by an outside random observer is the same as the probability of the state seen by an arriving customer . It follows that under the condition of λ < μ (arrival rate < service rate) the users in the system are Poisson (λ) distributed.
We found that the requests of our APIs servicing large user groups, larger than 50.000, are actually following this Poisson process. Hence, if we extract the parameter λ from our access logs we are able to get a glance of user concurrency. For example for the data set used in the below graphs we found that inter arrival times are exponentially distributed with μ = 4.18 meaning on average we receive a request every 4.18 seconds. So if we assume our system is stable, that is a combination of having enough resources (c) given the average time it takes to handle the request(1/μ), the distribution of the number of users in the system is given by:
From the above graph we can already conclude that for this load; deploying just two servers would already be sufficient, given that P(X ≤ 2 |μ) is almost 1.
So using the PASTA property as a quick scan for the right configuration of your deployment is very helpful. However, if one wants to make cost effective decisions we are interested in the behavior of different configurations on the SLA. From the above example it could very well be true that a single server also satisfies a given SLA on response times. In order to do further analysis we first have to define a model:
M/M /c Queues
When, for example, one has an API for which we assume the request arrivals are following a Poisson Process with arrival rate (λ ) of 1 request per second and the computational time is exponential, with an average duration of let’s say 1 second (so a service rate μ of also 1 requests per second). Without going into to the mathematical details one can analytically derive many statistics (see the wiki for details) for this type of model (M/M/c). There are online calculators available which let you sandbox your problem. the results below are from supositorio and show the derived statistics of aforementioned setting with two servers, hence a M/M/2 model:
This simplified model is very insightful but its limits for practical use are twofold. First, an exponential distributed runtime on the server is not always realistic. If the application load is homogeneous one could assume a deterministic service time (D in Kendall’s notation) resulting in a M/D/c system. However retrieving system insights analytically for systems other then M/M/c is challenging since in its derivations the memoryless property of the Markovian distributions are exploited.
Second, the analytical approach does not give insight in every KPI of your interest. The SLAs where certain behavior is guaranteed for a large percentage (e.g. 95 or 99%) of the load require a numerical approach instead. In order to extract the right figures we build an Discrete Event Simulator (DES) where by means of Monte Carlo the system is repeatedly emulated in order to gain insight in the tail events. Herewith we nightly monitor our applications behavior under several server setups. This also allows us to set alarms if previously agreed SLAs are running in danger due to new functionalities. Below a screenshot of our daily reporting. Here the system performance (in seconds on the y-axis)is displayed for the past 12 months. One can observe that the 95% SLA (red dotted line) for this specific usecase can only be met with 2 servers up til a arrival rate of 400 requests p/h. One even needs a third server to support 500 requests p/h or more
Part II of this blog series we dive deeper into our toolset, which we plan to open-source upon the blogposts publication. In addition we show how these tools can be integrated in your CI pipeline.
Using queuing theory for SLA management is not uncommon. We described our approach and showed that performance tests in idle servers are helpful but insufficient when dealing with large number crunching applications. And that analytical derivations of queuing models can be used to generate ballpark estimates but for solving the resource allocation quest one needs additional numerical generated results.
 Little, John DC, and Stephen C. Graves. “Little’s law.” Building intuition. Springer, Boston, MA, 2008. 81–100.
 Wolff, R. W. (1982). Poisson arrivals see time averages. Operations Research, 30(2), 223–231.