RPC Thunder Dome

Part II

Published in

Netifi

7 min readDec 12, 2017

The natural habitat of microservices is the cloud. This is more than just a trend. Because they are designed to scale independently, microservices take advantage of how cloud computing works. Cloud providers let you deploy microservice architectures quickly and easily. With the click of a button, you can spin up 1000 instances. The problem is that all thousand instances you just spun up need a way to communicate.

In Part I of this post, I examined three different ways microservices can communicate: Proteus RPC, gRPC, and HTTP (represented by Ratpack). The test was a simple ping/pong style test. But, the test was run on a laptop — hardly a realistic location for real-world deployments. This week, we move the test to where microservices live. Off to the cloud…

The Test

I was lucky enough to run this test using the greatest ever cloud provider, Azure. And by luck, I mean I had free credits. I used F16 instances — they’re tailored toward compute and have, you guessed it, 16 cores. The operating system was Ubuntu 16. More information can be found here. I kept the scenario simple again — we’ll save the fancy tests for later.

The test was still a simple ping/pong test, except this time, the client and server were on separate VM instances. Each client created at least 16 connections to the server. Netty roughly assigns incoming connections to an event loop. Since Netty creates one event loop per core, 16 connections should be enough to fully utilize a 16-core CPU.

On each run, the client sent 100,000,000 messages to the server. Two memory settings were used: 2gigs and 128megs. Running with 2gigs is typical of microservice instances, whereas running with 128megs demonstrates performance in a memory constrained environment such as a mobile phone. Both the client and server had the same memory settings. The test was run 5 times with each setting. The best run results from each run were kept. HdrHistogram was used to record the results.

Ratpack

Like last time, we’ll get a baseline using the HTTP/1.1 REST approach. HTTP was represented by the Ratpack server and Reactor-Netty client. HTTP/1.1 cannot multiplex messages. This is a fancy way of saying you can only send one message per connection at time. More than 16 connections are required for good performance. Reactor-Netty does this by default. I didn’t need to make any charges to either the client or server from the last test. With 2gigs RAM, Ratpack got 90,754 RPS and 11.8ms p995 latency. With 128megs Ratpack got 76,910 RPS and 16.3ms p995 latency. Restricting memory didn’t dramatically affect performance or latency.

gRPC

Next up was gRPC. gRPC only needed 16 connections to the server. This is because HTTP/2 can multiplex messages. I improved on the test from last time and used their StreamObserver API rather than the Futures-based API. It’s a shame Google chose not to support Reactive Streams. I still had to wrap their object to use the standard Reactive Streams interface. Additionally, gRPC doesn’t automatically detect the Linux epoll API, which generates less garbage and generally, improves performance compared to the NIO based transport — so I had to add that as well. To do this, I had to dig through gRPCs benchmark code. Why epoll detection isn’t automatic is beyond me! With 2gigs, gRPC got 699,903 RPS and 15.4ms p995 latency. This is an improvement over HTTP/1.1. When you lower the memory to 128megs, things go downhill quickly. gRPC creates lots of garbage. With 128megs, throughput is 5 times worse. Latency at p995 is 4 times worse. It only got 128k RPS with a p995 of 62.5ms.

Proteus RPC

Finally, I tested Proteus RPC. Proteus RPC runs on top of RSocket. I used TCP RSocket again. RSocket is also multiplexed, requiring only 16 connections. Proteus RPC automatically detects epoll. I updated the client to create a connection per CPU. There is a load balancer with connection pooling, but that will be used in a later test. Proteus was able to achieve over 1,000,000 RPS with 2 gigs of ram. It got 1,134,246 RPS with a 3.5ms p995 latency. Proteus is nearly 50% faster with 150% better latency than gRPC. Proteus with 128 megs performed better than gRPC did with 2 gigs of ram. It got 751,150 RPS and 11.8ms.

With 11 times less memory, Proteus RPC is 140% less latent and faster.

Again, the communication protocol that was specifically created for microservices, not using a browser protocol, performed better. It outperformed all other competitors with less ram. This brings us to an interesting point. There is a misconception with Java that it has problems with garbage collection. Java doesn’t have a GC problem — it has a problem with developers making too much garbage. The people who wrote Netty knew this. Netty provides a great byte buffer library that recycles objects to prevent waste. Proteus and RSocket go the extra mile to use this. They use zero copy techniques to prevent creating extra garbage. To drive the point home, another test was run with Proteus RPC using a 92 megabyte JVM. Think about it — a JVM test with less than 100 megabytes. I tried running the same test with gRPC, but it crashed with OutOfMemory exceptions. With 92 megabytes Proteus got over 548,000 RPS, and 15ms p995 latency. Bells and whistles can always be added, but the core of a library is hard to change. Proteus RPC has a demonstrably superior core.

Conclusion

As awesome as 1 million RPS is, that isn’t the whole story. So, what is? Well, a couple of things. First, when you build your application, the libraries you choose incur overhead. This includes your RPC. You don’t want an RPC library greedily hogging all the memory and sucking up precious CPU. The more efficient it is, the more resources you have for what matters to you. Secondly, superior performance results create a safety net for your application. Taking up all the CPU and memory leads to dangerous situations where your production systems will crash. If you’re running a resource hog, you don’t have room to deal with dangerous spikes in traffic and latency. Proteus RPC is efficient. It leaves room for what you care about. With Proteus RPC’s superior performance, latency and traffic spikes don’t affect you anymore. You have room to breathe. Everything just got a lot easier.

If you’ve been programming long enough, someone’s quoted Donald Knuth to you:

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”

What they never mention is the second part — the important part — what you optimize:

“Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; after that code has been identified.”

The first sentence bears repeating — “Yet we should not pass up our opportunities in that critical 3%”. What is the critical 3%? In a microservice architecture, where they live and die by communication, your RPC is in that 3%. If you care about your microservices, try Proteus RPC.

The tests from Part I and Part II produced interesting results. They were a lot like drag racing — they showed how fast each RPC went. Now we know the speed. The final results are include below. In Part III, we’ll see how they handle more realistic scenarios with multiple clients and servers. Next time we’ll include stats about CPU utilization, network utilization, and maybe a little chaos testing…

Appendix

The source code from test can be found here.

Requests per second based on JVM memory settings

The request per second results from each test compared side-by-side. gRPC and Ratpack were unable to complete the 92 megabyte JVM test. They failed with out of memory errors.

Latency in microseconds with 2 gigabyte JVM

Latency distribution comparison of Proteus RPC, gRPC, and Ratpack with a 2 gigabyte JVM. The graph is in microseconds, and captures up to the p99.99 percentile.

Latency in microseconds with 128 megabyte JVM

Latency distribution comparison of Proteus RPC, gRPC, and Ratpack with a 128 megabyte JVM. The graph is in microseconds, and captures up to the p99.99 percentile.