Testing the gRPC API: Can it replace CQL drivers?

Author: Tomasz Lelek

DataStax
Building Real-World, Real-Time AI
6 min readDec 30, 2021

--

We’re constantly working to make it easy to build applications on Apache Cassandra®. After releasing the new Stargate gRPC API we put it through several performance tests to make sure it is a good replacement for CQL drivers. This is what we found.

In our earlier post, we described how the new Stargate gRPC Remote Procedure Call API (gRPC) lets you use Java to call any microservice in any cloud. Now you can stop wasting time installing, updating and searching for (sometimes in vain) drivers for your purpose. But what good would that do if you end up spending the same amount of time waiting for the API to run? In order to see if that was going to be the case, we decided to put the gRPC API to the test.

Our performance tests focus on proving that the DataStax Astra DB solution works well end-to-end with gRPC. So we’re not testing the Stargate gRPC API in isolation, but rather in the context of Astra DB. This way we’re sure that the end solution provides our users with high performance and low latency.

Until now the most efficient API provided by Stargate and Astra DB has been the Cassandra Query Language API (CQL). Our goal was to validate that the gRPC API can perform just as well. To generate the traffic for our test, we’re using our benchmarking suite NoSQLbench, which is easy to use and able to benchmark the same workload across different Stargate APIs. We picked the Cassandra-optimized key-value workload for our workload. This is the table schema for it:

We want both workloads (for CQL and gRPC) to run for the same Astra DB setup. So for all of these tests we use the default Astra DB setup, which is provided by creating a new database. This will cover the baseline workload for our tests.

Once we established the baseline performance, we began testing the gRPC API at scale. As of this writing, we’re still testing these higher throughput scenarios. So, we’ll cover those in a future blog post.

Our current test setup involves three client nodes on Amazon Elastic Compute Cloud (Amazon EC2) that generate traffic to our test database using NoSQLBench. Each node generates a third of the total traffic we want to validate. This setup will saturate the traffic we want to achieve, whereas one client node may not.

Figure 1. The test setup consists of three client nodes on Amazon EC2 running one third of our desired traffic each.

We experimented with a different number of requests per second and found that the basic free-tier Astra cluster can serve around 5,500 operations per second for the CQL API without any problems. We can use this as a baseline when we want to test if the gRPC API can perform as well as the CQL API.

Figure 2. The baseline for the CQL API was 5,500 operations per second from a basic Astra DB cluster.

Our first performance tests with the gRPC API returned some worrisome results

We mirrored the CQL performance tests setup for a gRPC API. From this point, all the requests go through the Astra DB gRPC API. As you will see, the first run resulted in a lot of errors related to HTTP 2 protocol that is used by gRPC.

It also gave us a huge number of errors per second:

Figure 3. Operations per second, latency and number of errors for gRPC API.

The latency was disturbingly high. After some investigation, we increased the number of http2_max_concurrent_streams to 512 (from a default 128) and keepalive_requests to 4000 (from a default 1000), both of which are NGINX settings specific to our gRPC processing. These configuration changes decreased the number of errors substantially to around four per second.

Figure 4. While the number of errors decreased drastically, the performance was still far from stable.

These errors were occurrences of UNAVAILABLE or DEADLINE_EXCEEDED errors. Since UNAVAILABLE errors are safe to retry on the client-side we improved our NoSQLbench gRPC driver to retry them.

After this change, the gRPC API was able to process all traffic generated by the benchmarking tool. However, the number of retries was non-negligible:

Figure 5. Number of total tries and number of successful operations — we can use it to calculate retries.

We retried around 10,000 (28,999,994–28,990,485) operations in an example run. So the graph for the number of operations is not smooth, and the performance is not stable. The number of successful operations per second varies from 3,000 operations per second to around 5,000 operations per second. However, on average, it’s only around 4,000 operations per second.

If we compare these successful operations per second to the baseline benchmark, it is not near the stable 5,500 operations per second generated by the CQL workload.

Surpassing our expectations with NGINX persistent connections

We found that the grpc_pass directive causes NGINX to stop sharing connections between gRPC requests. In order to share connections when using grpc_pass, we also have to use the upstream directive.

The gRPC API uses the HTTP2 protocol for better performance. One of the most important benefits of this setup is the ability to share connections between requests rather than closing them after each completed request. For that reason, we implemented gRPC HTTP2 connections sharing with the upstream directive.

When re-running the performance tests, the results surpass our expectations. The version of gRPC with connection sharing was able to handle a stable 5,500 operations per second. So we were able to achieve a performance comparable to the CQL API.

Figure 6. In the re-run the gRPC API is handling a stable load of 5,500 operations per second.

The number of client-side retries dropped to around 20 operations per second. As we saw, each retry is caused by a gRPC exception. For the previous version — without connection sharing — it was around 10,000 operations per second, so we were able to reduce the number of errors by nearly 99%.

Figure 7. Result success vs number of tries. Only 21 retries executed.

Clearly, optimization is an iterative process. With this test we’ve shown that the gRPC API can be used as a replacement for CQL drivers. This is a huge benefit to developers that want to use Cassandra or Astra DB — even if their language of choice doesn’t have a supported or well-maintained driver. We will cover our planned testing of more demanding scenarios in future blogs.

Follow the DataStax Tech Blog for more developer stories. Check out our YouTube channel for tutorials and here for DataStax Developers on Twitter for the latest news about our developer community.

Resources

  1. Native driver alternatives using Stargate gRPC API in Java
  2. DataStax Astra DB
  3. Stargate — the open source data gateway
  4. Cassandra Query Language API (CQL)
  5. NoSQLbench from DataStax
  6. Stargate APIs
  7. NGINX
  8. Status codes and their use in gRPC
  9. Join our Discord: Fellowship of the (Cassandra) Rings

--

--

DataStax
Building Real-World, Real-Time AI

DataStax provides the real-time vector data tools that generative AI apps need, with seamless integration with developers' stacks of choice.