Benchmarking QUIC
Introduction
QUIC is an emerging transport protocol built on top of UDP to improve security and performance over TCP/HTTP2. Officially introduced to the public in 2013 by this blog post from Google, QUIC is currently used in production for Google, Facebook, and Uber.
The biggest difference between QUIC and TCP is that QUIC is implemented in user space whereas TCP is implemented in the kernel. The advantages of QUIC being implemented in user space are that developers can write/choose their own implementation and developers can quickly release fixes and improvements to their QUIC stack via normal application updates (e.g iOS/Android app updates on the client side and package updates on the server side). This is also possible in large part since most companies will only use their QUIC stack when communicating with their own services, in which case they own both the client and server implementations.
Side note. Interoperability between QUIC stacks is one of the most important tasks being worked on right now before QUIC becomes an RFC. Check out Robin Marx’s work to learn more.
On the other hand, one of the largest downsides to QUIC is the performance penalty of processing network packets in user space rather than in the kernel. To expand further, implementing a network stack in user space involves copying extra packet data from kernel memory to user memory, performing syscalls to write to a socket for non-data related packets such as ACKing and flow control, and frequent context-switching. Determining whether QUIC, through its careful design, can overcome these user-mode penalties is an important task for researchers, developers, and hobbyists.
Side note. Technically, there are ways to avoid kernel bottlenecks for processing UDP packets using XDP kernel-bypass or smart NICs, but this is really only applicable for internet giants such as Facebook and Google. I really don’t expect cloud providers such as AWS to be doing all these optimizations for your QUIC server running on some VM.
Benchmarking QUIC Clients
Preface
I actually attempted to benchmark QUIC servers before I moved the goalposts of my project to benchmarking QUIC clients. To make a long story short, the performance of Facebook, Chromium, and Cloudflare’s open source QUIC servers were abysmal compared to that of apached HTTP2. This is because these QUIC server implementations are highly unoptimized in terms of hosting web pages since they only offer ‘toy’ http server code in their open source repos. All was not lost however since I learned that there were production QUIC endpoints being hosted on Facebook’s CDN for testing purposes. This meant that I could more accurately benchmark the client side instead.
Setup
Given that I was using public endpoints, I had a limited set of URLs available to test. These included various web pages that displayed random text with sizes ranging from 0 Bytes to 10 MB. This meant that each webpage would only have 1 object which translates to a single QUIC stream.
The QUIC clients I used were (H2 = HTTP2 = TCP, H3 = HTTP3 = QUIC):
- Google Chrome Canary (H2 + H3)
- Curl (H2)
- Ngtcp2 (H3)
- Facebook Proxygen (H3)
I used Puppeteer to automate Chrome. Curl, Ngtcp2, and Proxygen are command line clients so I simply used Python to measure the time elapsed after triggering a subprocess running these respective clients.
As for Firefox, I ended up retracting it from the client list after I experienced various issues with it. I am considering writing an article in the future detailing the various experiences I’ve had with Firefox HTTP3 this Summer.
A Chrome idiosyncrasy. During benchmarking, I discovered that Chrome would take a very long time to load webpages with data > 5 MB. After some tinkering, I found out that this performance issue only occurred when loading large webpages in the foreground rather than in the background (another tab).
So what exactly is causing this behavior that is unique to Chrome? By observing TCP packet traffic during both scenarios, the root cause of such behavior can be attributed to the change in network flow control (window size) during the request.
What’s most likely happening is that when loading a page in the foreground, Chrome will read a large chunk of data from the underlying socket and proceed to render that data. When rendering, Chrome is no longer reading from the socket, so the window size decreases as the socket buffer is filled up with data. Only when Chrome is finished rendering a section of the page, will it then read from the socket again, which causes the increase in window size. This phenomenon occurs when using either HTTP2 (TCP) or HTTP3 (QUIC) so it seems Chrome has ported this ‘partial read-render’ behavior to their QUIC stack. Remember that UDP has no built-in flow control so the application QUIC’s stack is responsible for flow control and thus, the type of behavior shown above.
Network Simulation
Users will not always have high-speed, reliable internet so it’s important to benchmark scenarios with varying levels of bandwidth, packet loss, and delay. On Linux, the most frequently used network simulation tool is tc-netem. There are some OSX equivalents such as pfctl+dnctl but the easiest network simulation tool to use on a Mac is by far Network Link Conditioner. While Network Link Conditioner does not provide the granularity of control that tc-netem or pfctl+dnctl do, as it applies its rules for all IP packets that go through the Wifi interface, it’s simple to use and provides the features I needed for benchmarking. If one wanted to throttle bandwidth or incur loss on packets with specific protocols, hosts, ports, etc then it would be necessary to use a lower level tool.
The network scenarios I tested were:
I actually tested more scenarios than the ones above but I found these scenarios to be generally representative of the overall results.
Results
Given the nature of benchmarking page-load times on the open internet with a 15 mbps home internet connection, one is bound to encounter variance in network performance between iterations. One can offset this variability by running numerous iterations. As a result, I ran 10 — 20 iterations for each client. When testing probabilistic network conditions, such as random packet loss, I increased the amount of iterations.
Bandwidth
When limiting bandwidth to 10 mbps, the performance of various QUIC clients was generally equal with the performance of TCP clients. Below are graphs showing results for 10 MB bandwidth:
At a simple glance, it is clear that the discrepancy in performance between QUIC and TCP when dealing with webpages ≥ 1 MB is negligible. Even when looking at the small endpoints graph, Chrome’s QUIC and TCP performance are practically the same. What’s interesting however, is Chrome’s clear better performance for small endpoints.
This discrepancy is caused by the browser’s tendency to reuse the same TCP or QUIC connection even when ‘refreshing’ the page. As a result, on non-initial page requests, Chrome does not have to undergo a network handshake which saves it ~50ms when averaging the data out.
Overall, when bandwidth is limited to 10 MB, the performance of QUIC is equal with TCP when dealing with ‘ideal’ network conditions where there is negligible loss and delay.
It’s important to note however that this equality in performance may not hold true under high-bandwidth scenarios (i.e intra-datacenter bandwidth where > 1 gbps is common). 10 mbps can be easily handled by today’s CPUs in terms of processing packet data and copying data to user space. As a result, we may not be actually pushing these TCP or QUIC stacks to their capacity when benchmarking at relatively low network bandwidth. When network bandwidth is at intra-datacenter levels, the kernel plays a much larger role in maintaining high throughput and low latency since it must handle 100–1000x more packets per second.
Delay
Below are graphs showing results for 0% loss, 200 ms RTT delay (100 ms on downlink and uplink), and 10 MB bandwidth. I ran 20 iterations for each case.
From the graphs shown above, the addition of delay does not introduce any new discrepancies between QUIC and TCP.
Loss
Given the nature discarding packets using a uniform distribution, the results will be more variable and thus harder to interpret. To slightly offset this, I ran 50 iterations for each endpoint. Below are graphs for 5% loss, 0ms added delay, and 10 MB bandwidth.
When examining the individual data points for ngtcp2, there were a couple times where it took 30 seconds for it to finish loading speedtest-0B (0 Bytes). The 30 seconds value is caused by the default timeout on a connection for ngtcp2. What’s most likely happening is a sequence of ‘lost’ packets that prevents the connection from progressing. To confirm this, I examined the qlog of a connection that hit the 30 second timeout limit. This is what it showed:
It seems that ngtcp2’s large variation is mainly caused by the client’s exponential backoff when resending handshake packets.
EDIT: As of 7/23/20, after discussing these results with some folks at Facebook and ngtcp2, it appears the issue was the Facebook servers were not resending handshake packets back to the ngtcp2 client. After this issue was fixed, I obtained these results:
Once again, QUIC can achieve equal performance with TCP in the face of loss.
Loss + Delay
Now this is where things get interesting. Below are graphs showing results for 10 mb bandwidth, 5% loss, and 200ms added RTT delay. Since we are dealing with random loss again, I increased the amount of iterations for each endpoint to 40.
When compounding random packet loss with delay, QUIC performs better than TCP as the size of the requested webpage increases. What I find most interesting about the above graphs are the poor performance of Chrome H2 and the minimal std dev for Proxygen and Chrome H3 when dealing with a large RTT and random loss. These phenomenon are definitely worth examining in the future.
Conclusion
From the results I collected, the performance of QUIC clients is equal or better compared to the performance of TCP clients when it comes to requesting single-stream resources on limited bandwidth. In the real world however, network connections rarely comprise of only a single stream and can be on network links with gigabit bandwidth. Therefore, it’s important in the future to benchmark QUIC performance using ultra-high bandwidth links and to test multiple-stream resources on production servers.
One might ask, if QUIC’s performance is equal to that of TCP, why bother migrating? This is a great question and one that many organizations will face in the next coming years. So far, Facebook, Google, and Uber have shown that using QUIC can greatly improve tail latency and p99 performance on the open internet. My prediction is that QUIC will see mainstream adoption for internet traffic after it becomes an RFC. However, when it comes to non-internet traffic (i.e intra-datacenter traffic, communication between microservices) where bandwidth is often > 1 gbps, TCP will still be widely used for the coming decade due to its kernel advantage.