On the Latency Distribution of WSO2 APIM

Introduction

When we talk about performance metrics used for evaluating the performance of a system/application, two metrics comes to our mind: 1) throughput and 2) latency. The throughput is the number of requests/tasks/messages a system/application can process in a unit time. In a typical client-server (e.g. HTTP server/client) model, the latency (of a request) is the total round-trip time, i.e. difference in time between the time at which the response is received and the time at which the request has started

There are two factors that contribute to the server side latency 1) the waiting time and 2) the service time. The waiting time of a given request is sum of all individual waiting times. Here the individual waiting times refer to the time that the request may wait in a queue to get access to a shared resource, time the request is delayed due to garbage collection pauses and so on. It is important to note that the waiting time is (highly) dependent on the arrival rate of requests into the system. For example, higher arrival rates results in higher contention which leads to higher latency. When we test the performance of a system, we increase the arrival rate by increasing the number of concurrent users accessing the system.

The service time is the time it takes to process a request and it is a constant (for a given request) on a given system and does not depend on the arrival rate of tasks into the system (Note that it is possible to improve service time by writing more efficient code. I will discuss this topic in a separate blog)

The service times are distributed according to a distribution. For example, the service times of web requests closely follow long-tailed distributions. This means that there a large number of requests which have a very small service times and a small number of requests have a very long service times.

The service time distribution of a requests processed by a middleware servers (such as WSO2 APIM), however, may not be long-tailed although this may be the case for service times of back-end servers.

In this blog I will (mainly) focus on behaviour latency (of WSO2 APIM) under different scenarios (i.e. different number of concurrent users etc) and have a brief look at the behaviour of the throughput.

WSO2 API Manager (WSO2 APIM)

WSO2 API Manager is a complete solution for designing and publishing APIs, creating and managing a developer community, and for securing and routing API traffic in a scalable way. WSO2 API manager has 4 main components 1) API Gateway, 2) API Key Manager, 3) API publisher and 4) API Store. API Gateway secures, protects, manages, and scales API calls. It intercepts API requests, applies policies such as throttling and security using handlers and manages API statistics. API Key Manages the client, security and access token-related operations. The Gateway communicates with the Key Manager to check the validity of tokens, subscriptions and API invocations. The development and management of APIs are done using the API publisher (which is a Web interface). The API store provides API publisher to host and advertise their APIs and for API consumers to self register, discover, evaluate, subscribe to and use secured, protected, authenticated APIs. A detailed description of these components can be found here.

Deployment Architecture

The performance results that I present in this blog have been obtained by running APIM 2.0 on an 8GB/4 core VM. This means that we run API Manager, work-load generator and the back-end service on the same VM. The results have been obtained when Gateway cache enabled, Key Manager cache disabled and GWT generation enabled. This is one of the most commonly used scenarios in APIM. For more information about the caching and configurations available refer to here and here. Note that we use multiple JVM deployment model where we deploy key manager and gateway on separate JVMs. The initial and maximum heap memory size for VMs are set at 1GB and 4GB respectively. The the back-end service is simply an echo service hosted on a Tomcat server (this means that the service times of the back server are very small).

Workload Generation

We use JMeter to generate HTTP requests. Each performance test is run for a period of 20 min with an initial warm-up period of 10 min. We populate the key manager back-end database with certain number of tokens and configure the JMeter to use these keys/tokens.

The Effect of Concurrency on the Performance

Let’s now have a look at the performance under different concurrency levels. The results presented in this blog have been obtained when the number of tokens is 5000. Note that APIM maintains cache which stores the validity of tokens. The size of this cache is 10000. If we use 5000 tokens, when the system is in steady state there will be no cache evictions (and therefore no back-end db calls) within a predefined period of time . I will present the results for number of tokens> cache registry size case later in a separate blog.

Latency

Let’s now have a look how the latency varies under different concurrency levels. The following figure shows the behaviour of latency over time when the concurrency = 50.

The following graph shows behaviour average latency and latency percentiles under different concurrency levels.

Clearly, we see an increase in the average latency and latency percentile values with the increasing concurrency. When we fit a linear regression model onto average and percentile curves, we get a coefficient of determination (r-squared) >0.99. This indicates that average and latency percentiles increase linearly with the increasing concurrency. For example, r-squared value for 99% is 0.9984. The regression curve for this case is illustrated in the below.

We can get a better understanding of variations in latency percentiles by plotting a percentile values against the percentiles. The following figure shows the behaviour of latency percentile values when the concurrency = 100.

As the percentile approaches 100% we note that there is rapid increase in the percentile value. This implies that the majority of latency values are very small while there are a very few large latency values.

Latency distribution

Let’s now have a look at how the latency distribution behaves under different concurrency levels. The following plots show the histogram of latencies under concurrency levels.

We note from the above histograms that under low concurrency levels, the latencies are relatively small. As we increase the concurrency, there is an increase in the latency values (note how the bins of the histogram changes as the concurrency increases). Under high concurrency levels, the processor cores are shared among a larger number of (processing) threads. This means under high concurrency levels, each thread (serving a request) will receive relatively less CPU time (within a given fixed period of time), this results in the request latency to increase. In addition, under high concurrency levels, the number of context switches will be high. This will also have an impact on the latency. The other reasons for the increase in the latency will include, the higher contention and increase in the GC overhead.

The following figure shows the probability density functions (PDF) of latency under different concurrency levels.

The behaviour we see here is similar to the behaviour that we saw before in the histograms. We can make the following observations:

  1. The probability density functions of latency are right-skewed
  2. As the concurrency increases the skewness of the PDF decreases
  3. As the concurrency increases the stand deviation of the latency values increases. This means that as the number of concurrent users increases the variability in the latency values increases. The following figure shows the behaviour of standard deviation.

Throughput

Let’s now have a brief look at the how the concurrency behaves. The following figure shows how throughput varies with the number of concurrent users.

We note that throughput increases with the number of concurrent users. The rate at which the throughput increases, however, decreases with the increasing concurrency. Once the system reaches its maximum capacity the throughput remains constant up to a certain concurrency level.

Conclusion

In this blog I investigated the effect of number of concurrent users on the latency distribution of a middleware server. We noted that the average latency and latency percentiles increase linearly with the number of concurrent users. We also note the latency distributions (i.e. PDF) are right-skewed (i.e. positively-skewed). The skewness of the latency distribution decreases with the increasing concurrency. The variability of latency values, on the other hand, increases as the number of concurrent users increases.

The throughput increases with the number of concurrent users while the rate at which the throughput increases, decreases with the increasing number of concurrent users.