EXPEDIA GROUP TECHNOLOGY — SOFTWARE
Improving First Input Delay by Leveraging gRPC
Improving core web vitals at Vrbo
For Vrbo landing pages, performance and improving response times is always a priority. Whether it’s a real user finding a destination through a web search or a bot crawling web content, we need good response time to improve engagement and SEO.
Within the many calls needed to get the whole content of a landing page, the first one is link a path to its destination id in the system so this response is a blocker for the rest of the calls that depends on the destination identifier and it’s crucial how this first call responds considering metrics like First Input Delay (part of Core Web Vitals), which measures the time from first user interaction with a page to the browser being able to process handlers in response to that interaction. In the context of landing pages, the improvement of first call performance could have a direct relationship with landing pages First Input Delay improvement.
With that in mind, we started thinking on adopting gRPC in our platform beginning with that service. We built a gRPC server to replicate our http service, and we wanted to compare their performance to understand if pushing for gRPC in our platform and send more traffic to the gRPC version was something that could really improve our performance or not.
Our metrics and Datadog dashboards indicates that gRPC performance was quite promising, so we want to use a load test to stress both of our options, http and gRPC, with the same configuration and datasets to check their limits and differences.
Hypothesis
gRPC is designed for low latency and high throughput communication, which makes our service the perfect candidate to benefit from gRPC.
We’ve implemented gRPC server in our service, added metrics and started to serve production traffic controlled by an AB test; and now we want to compare http and gRPC performance under the same configuration and input data.
We’re expecting that the test reports will show an improvement in both latencies and throughput.
Application under test
The application subject to this test has two operations: lookup for destinations data (identifiers and attributes) for a given path and reverse lookup (getting a path for a given identifier).
It gets its data from two sources:
- The primary one is RocksDB, populated with our paths Kafka topic.
- As fallback if the request is not found in RocksDB, our Cassandra database.
Our http service response is a json payload. gRPC returns its response as a binary object, which usually has a smaller size of a json containing the same data. Considering the size of our json payload we don’t think we’re obtaining much improvement from the gRPC version, but operations with bigger payloads than ours could benefit more from using it.
Application performance
For measuring latencies we are using p99 and p95 metrics. These are percentile values that indicate the upper threshold for the defined percentage. E.g. a p99 of 35 ms indicates that the 99% of the calls are taking 35 ms or lower. Both lookups were originally build as http endpoints with quite good performance:
The gRPC server added to our application replicates the same endpoints than the http version, and is receiving a 10% of traffic originated from our frontend client, also with good latencies:
Although this latencies improves the http version, before just increasing its traffic percentage we want to run a load test for both options using same configuration and dataset.
Performance testing
We’ve an integrated load test for our application through our CI pipeline. This test runs two scenarios (one for each lookup) that are defined using a Taurus yaml template, and this templates are used in our pipeline to run the test in Blazemeter.
Adding gRPC to our performance test
We want to be able to duplicate our current scenarios in our existing load test to run the same load in gRPC and http versions at the same time. Our http scenarios are defined using Taurus syntax, which allows to use different engines (executors) including JMeter, default one and the one we’re using currently for http.
Using just Taurus syntax you can define a scenario and behavior to call http endpoints.
So our first approach was trying to do something similar for gRPC, though gRPC is not natively supported by JMeter or Taurus so we can’t just define a gRPC scenario using only a yaml template.
JMeter doesn’t support gRPC natively, but it allows to create custom plugins or samplers to add new functionalities. We’ve implemented one of the JMeter abstract sampler definitions (AbstractJavaSamplerClient in this case) to define how a call should work. In our case for gRPC services it allows to use a client based in our service definition. The generated sampler class is referenced in a JMeter script (a JMX file using XML syntax) where test behavior is defined. These JMX files can be referenced from a Taurus template, so they can be executed in Blazemeter.
First iteration
Once we’ve created this new samplers (one for each lookup operation), we include them in our existing Taurus template to run the test with the four scenarios to compare results.
This first test shows no difference between gRPC and http endpoints, having almost the same p95 and p99 response time latencies (both cases quite high considering our server latencies) and similar hits per second distribution, obviously not the results we were expecting:
Both p99 and p95 graphs show similar latencies for http and gRPC scenarios, moving around 7500 ms in p99 graph (with peaks of 15000 ms) and around 1000 ms in p95 (with peaks of 3000 ms).
Hits per second graph show the four scenarios moving around 300 and 500 hps, without much difference them.
The reason for this results is that we’re creating a new gRPC client every time we call the service, so we’re not getting the benefits of reusing the channel and the multiplexing of multiple HTTP/2 calls over a single TCP connection. So we decide to change the sampler in the next iteration to make the client to be reused by each thread.
Second iteration
After modifying our gRPC samplers to reuse the gRPC client per thread, we’ve obtained a new BlazeMeter report, and comparing response times and hits, we observe a huge improve from gRPC scenarios:
gRPC response time latencies show a p95 line moving around 10 and 12 ms, and p99 around 20 and 30 ms, closer to our expectations and a logic result considering our server latencies.
But http scenarios are still showing the same numbers as the first iteration:
http response times are moving around 1000 ms for p95 and 7500 ms for p99.
If we compare by hits, gRPC scenarios reach the top threshold set for the test (1500) while http scenarios move around 400.
This huge difference led us to think that the http scenarios were creating a new client each new call so compare these two cases wasn’t fair and didn’t give us a real idea of their performance difference, so for the next iteration we decide to add a new html sampler that creates the client in the same way we prepare it for gRPC, being reused by thread. In theory, this allows us to compare types using their clients in the way we expect them to be used, and should show the improvement in performance that gRPC could add by how it manages connections and channels.
Third iteration
In this third iteration we are using our gRPC samplers along new http samplers using http4s as http client, both types are initialising their client so it can be reused by each thread. We’ve set a high top threshold that we don’t expect to be reach (3000 hps), so we can check their limits, and as in the others iterations we’ve run the four scenarios together.
While gRPC still having better performance, the obtained results make more sense than the previous ones and they give us a good idea about how much our performance could improve by using gRPC.
We observe that p99 response time is still quite better in gRPC scenarios, moving around 50ms the whole test while http scenarios reached to 150ms and peaks of 350–400 ms.
In p95 graph we can see gRPC scenarios moving around 35 ms and http scenarios around 50 ms.
Finally, comparing hit we can see how gRPC has been able to manage more traffic using the same config as http.
Results
Conclusions and next steps
- gRPC equals http in the worst scenario (not reusing client by thread).
- gRPC was found to be approximately 2.5 times faster than http for p99 and approximately 1.5 times faster for p95.
- gRPC scenarios are getting approximately 1.5 times higher throughput than http scenarios.
- This performance improvement could be superior to what we gain using a cache method, so it offers a different way to improve our latency.
- One of Google Core Web Vitals metrics (adopted into their search algorithm) is First Input Delay: measuring the time from when a user first interacts with a page to the time when the browser is actually able to begin processing event handlers in response to that interaction. This application response is one of the first steps to load our frontend client content, because it returns if a path is valid or not and its association with an id, so improving its response time could help to our score.
- gRPC performances improvements comes from its use of HTTP/2 and channel reusing, it depends not only on how the server has been created but also the client. Situations where a client needs to call the server frequently are the best to get this benefits from the use of gRPC, so a push for gRPC should be part of a common decision to build the applications in a way that they benefit from gRPC.