Evaluating Critical Performance Needs for Microservices and Cloud-Native Applications

Microservices are coming on strong, but reality bites

Let’s start with the good news — it’s now widely accepted that microservices are the way to build truly cloud-native applications. However, that leads us to the less-good-news — there are inherent complexities to be conquered when it comes to communicating between these distributed microservices over the network that did not exist in monolithic applications.

For example, this article “Containers and Microservices: Five Key Truths” looks at microservices, often deployed in containers, as leading technologies that enable greater efficiency in cloud computing. It states the number one issue is that “Complexity can become a problem” citing latency, scalability, reliability as factors.

And that is consistent with this report, “Workload Characterization for Microservices” by IBM researchers, which concludes: “We observed a significant overhead due to the microservice architecture; the performance of the microservice model can be 79.1% lower than the monolithic model on the same hardware configuration. The microservice model spent much more time in runtime libraries to process one client request than the monolithic model by 4.22x on a Node.js application server and by 2.69x on a Java EE application server.”

We observed a significant overhead due to the microservice architecture; the performance of the microservice model can be 79.1% lower than the monolithic model on the same hardware configuration.

Yes, you read that right, 79.1% slower! At Netifi, our mission is to tackle this problem. We’re working on a next-generation platform for reactive microservices and cloud-native applications that delivers high performance and, at the same time, automates away the complexity of building distributed systems for the developer. Our platform is based on the open source RSocket network protocol, along with our software broker, which are specifically designed for microservices and cloud-native applications.

Naturally, we wanted to know how well the Netifi platform would stand up to the real-world communication needs of a microservices architecture generating significant network traffic. To validate Netifi’s performance we used the same testing methodology (Acme Air) cited in the IBM paper. It provides an objective, standardized test to gauge performance which can then be used as a benchmark for comparisons. As shown in the first graph, IBM publishes testing results for Istio, an increasingly popular technology designed to manage communications between microservices, which provided us with a good basis for comparison with Netifi performance.

The Results

For our testing, we used the same 60 users as the published test results for Istio and the results were impressive and surprised even us. Netifi was able to process 16 times more requests during the test, averaging over 16,000 requests per second. Netifi delivered nearly four times higher throughput in comparison with Istio, while maintaining three times better latency.

[NOTE: We used the same testing methodology. Specifics on our testing environment are at the end of this post.]

Istio Results

Istio published test shows 4,380 transactions per second and average response time of 12 milliseconds.

Netifi Results

In preliminary testing, Netifi average throughput measured 16,300 transactions per second (nearly four times higher than Istio) and average response time of 4 milliseconds — one-third the latency compared with Istio.

In the top chart, you can see the 4,380 transactions per second measured for Istio. It’s important to note that this is published testing — not ours. Compare that with the chart above that shows 16,300 transactions per second for Netifi in our preliminary testing — nearly four times higher! And, take a look at average response time of 12 milliseconds for Istio versus 4 milliseconds for Netifi — one-third the latency!

Comparing results of published Istio testing versus Netifi preliminary testing shows Netifi delivers 372% faster throughput with 300% less latency — dramatically higher performance.

Ramping Things Up

Next, we cranked things up to 300 users using the same Acme Air testing methodology to see how our Netifi technology would scale and we were pleased to see terrific results. Naturally, we kept cranking up to 1,000 users and then finally all the way up to 1,500 users (25 times more users than the Istio test). I think you’ll agree that the results shown below are very impressive — at 1,500 users there were 27,000 transactions per second and average response time of 53 milliseconds — which most of us would still find acceptable.

Shown here are our preliminary test results for Netifi running with 300 users, then 1,000 users and finally 1,500 users.

Conclusion

These test results affirmed the inefficiencies of taking an infrastructure approach (API proxy sidecars for every application) to communicating between microservices versus our protocol approach with open source RSocket. Netifi and RSocket were designed to solve the fundamental communication problem with microservices and these tests prove the merits of our approach.

We know that performance is one of the many criteria that customers consider, along with ease of deployment and integration with existing infrastructure, and of course there is the non-trivial issue of cost. Look for some more blog posts coming soon from us as we discuss these topics.


Test Environment Details

Tests consists of same 4 services as Istio test. CPU is restricted on each container, as Prometheus could possibly be running on same box. In addition to traffic from the test, metrics were streaming using Netifi as well. Because there was only one instance of a service all calls resulted in a network hop.

JMeter Test Driver

Jumpbox that runs the JMeter script to generate load on the servers.

  • n1-highcpu-16 (16 vCPUs, 14.4 GB memory)
  • Intel Haswell

Kubernetes Cluster

  • 4 nodes — this is standard GCP Kubernetes
  • n1-standard-4 (4 vCPUs, 15 GB memory)
  • Intel Broadwell

Cloud Sql Database

  • 32cpus and 208 gigs of ram
  • The larger database was necessary as the number of supported connections is directly related to the size of the database — so we needed a larger database to handle the larger number of connections that we were able to support.