Benchmarking QUIC

Summer 2020

Alex Yu
9 min readJul 20, 2020

Introduction

QUIC is an emerging transport protocol built on top of UDP to improve security and performance over TCP/HTTP2. Officially introduced to the public in 2013 by this blog post from Google, QUIC is currently used in production for Google, Facebook, and Uber.

General overview of difference between QUIC and TCP. It’s important to note that QUIC is built on top of UDP, which means it can be implemented in user space along with in the kernel. Image is slightly outdated since TLS1.3 is being adopted for TCP and QUIC now. (Image from the Uber blog post linked above).

The biggest difference between QUIC and TCP is that QUIC is implemented in user space whereas TCP is implemented in the kernel. The advantages of QUIC being implemented in user space are that developers can write/choose their own implementation and developers can quickly release fixes and improvements to their QUIC stack via normal application updates (e.g iOS/Android app updates on the client side and package updates on the server side). This is also possible in large part since most companies will only use their QUIC stack when communicating with their own services, in which case they own both the client and server implementations.

Side note. Interoperability between QUIC stacks is one of the most important tasks being worked on right now before QUIC becomes an RFC. Check out Robin Marx’s work to learn more.

On the other hand, one of the largest downsides to QUIC is the performance penalty of processing network packets in user space rather than in the kernel. To expand further, implementing a network stack in user space involves copying extra packet data from kernel memory to user memory, performing syscalls to write to a socket for non-data related packets such as ACKing and flow control, and frequent context-switching. Determining whether QUIC, through its careful design, can overcome these user-mode penalties is an important task for researchers, developers, and hobbyists.

Side note. Technically, there are ways to avoid kernel bottlenecks for processing UDP packets using XDP kernel-bypass or smart NICs, but this is really only applicable for internet giants such as Facebook and Google. I really don’t expect cloud providers such as AWS to be doing all these optimizations for your QUIC server running on some VM.

This is a basic overview of how a TCP packet traverses through a Linux machine. For packets without any application data (e.g SYN, empty ACKs, CLOSE, WND_UPDATE, etc), the kernel does not need to copy data from kernel socket buffers to protocol recv buffers. Since QUIC is built on top of UDP, all packets are copied to user space. As a result, QUIC requires more data copying (which is expensive). Here is an interesting video on Facebook’s experience with connection-establishment performance between QUIC and TCP in production.

Benchmarking QUIC Clients

Preface

I actually attempted to benchmark QUIC servers before I moved the goalposts of my project to benchmarking QUIC clients. To make a long story short, the performance of Facebook, Chromium, and Cloudflare’s open source QUIC servers were abysmal compared to that of apached HTTP2. This is because these QUIC server implementations are highly unoptimized in terms of hosting web pages since they only offer ‘toy’ http server code in their open source repos. All was not lost however since I learned that there were production QUIC endpoints being hosted on Facebook’s CDN for testing purposes. This meant that I could more accurately benchmark the client side instead.

Setup

Given that I was using public endpoints, I had a limited set of URLs available to test. These included various web pages that displayed random text with sizes ranging from 0 Bytes to 10 MB. This meant that each webpage would only have 1 object which translates to a single QUIC stream.

The QUIC clients I used were (H2 = HTTP2 = TCP, H3 = HTTP3 = QUIC):

  • Google Chrome Canary (H2 + H3)
  • Curl (H2)
  • Ngtcp2 (H3)
  • Facebook Proxygen (H3)

I used Puppeteer to automate Chrome. Curl, Ngtcp2, and Proxygen are command line clients so I simply used Python to measure the time elapsed after triggering a subprocess running these respective clients.

As for Firefox, I ended up retracting it from the client list after I experienced various issues with it. I am considering writing an article in the future detailing the various experiences I’ve had with Firefox HTTP3 this Summer.

A Chrome idiosyncrasy. During benchmarking, I discovered that Chrome would take a very long time to load webpages with data > 5 MB. After some tinkering, I found out that this performance issue only occurred when loading large webpages in the foreground rather than in the background (another tab).

Loading a 10 MB page in the foreground
Loading a 10 MB page in the background. Crazy difference! Also note the protocol being used is h2, so this ‘feature’ is applicable regardless of the network protocol used.

So what exactly is causing this behavior that is unique to Chrome? By observing TCP packet traffic during both scenarios, the root cause of such behavior can be attributed to the change in network flow control (window size) during the request.

TCP Window Size over time when loading 10MB web page in Chrome foreground
TCP Window Size over time when loading 10MB web page in Chrome background. Note the x-axis scale difference!

What’s most likely happening is that when loading a page in the foreground, Chrome will read a large chunk of data from the underlying socket and proceed to render that data. When rendering, Chrome is no longer reading from the socket, so the window size decreases as the socket buffer is filled up with data. Only when Chrome is finished rendering a section of the page, will it then read from the socket again, which causes the increase in window size. This phenomenon occurs when using either HTTP2 (TCP) or HTTP3 (QUIC) so it seems Chrome has ported this ‘partial read-render’ behavior to their QUIC stack. Remember that UDP has no built-in flow control so the application QUIC’s stack is responsible for flow control and thus, the type of behavior shown above.

Network Simulation

Users will not always have high-speed, reliable internet so it’s important to benchmark scenarios with varying levels of bandwidth, packet loss, and delay. On Linux, the most frequently used network simulation tool is tc-netem. There are some OSX equivalents such as pfctl+dnctl but the easiest network simulation tool to use on a Mac is by far Network Link Conditioner. While Network Link Conditioner does not provide the granularity of control that tc-netem or pfctl+dnctl do, as it applies its rules for all IP packets that go through the Wifi interface, it’s simple to use and provides the features I needed for benchmarking. If one wanted to throttle bandwidth or incur loss on packets with specific protocols, hosts, ports, etc then it would be necessary to use a lower level tool.

The network scenarios I tested were:

I actually tested more scenarios than the ones above but I found these scenarios to be generally representative of the overall results.

Results

Given the nature of benchmarking page-load times on the open internet with a 15 mbps home internet connection, one is bound to encounter variance in network performance between iterations. One can offset this variability by running numerous iterations. As a result, I ran 10 — 20 iterations for each client. When testing probabilistic network conditions, such as random packet loss, I increased the amount of iterations.

Bandwidth

When limiting bandwidth to 10 mbps, the performance of various QUIC clients was generally equal with the performance of TCP clients. Below are graphs showing results for 10 MB bandwidth:

Graph showing page-load times during 0% loss, 0ms added delay, and 10MB bandwidth. Y-axis represents time elapsed. X-axis represents distinct server endpoints tested. Each dot on the graph represents the mean page-load time over ~10–20 iterations. The arrows represent std. dev. Dotted line is purely for visual purposes. Firefox is left out for being difficult to work with.
Same network conditions as graph above. Only difference are endpoints shown on X-axis, which are much larger than endpoints in above graph.

At a simple glance, it is clear that the discrepancy in performance between QUIC and TCP when dealing with webpages ≥ 1 MB is negligible. Even when looking at the small endpoints graph, Chrome’s QUIC and TCP performance are practically the same. What’s interesting however, is Chrome’s clear better performance for small endpoints.

This discrepancy is caused by the browser’s tendency to reuse the same TCP or QUIC connection even when ‘refreshing’ the page. As a result, on non-initial page requests, Chrome does not have to undergo a network handshake which saves it ~50ms when averaging the data out.

Wireshark capture of QUIC packets for 5 consecutive requests to speedtest-0B using Chrome. As you can see, the QUIC handshake is only performed once at the top. Also the same DCID (Destination Connection ID) is used for all requests. The DCID is intended to be a random sequence of bytes for each QUIC connection.
Wireshark capture of QUIC packets for 5 consecutive requests to speedtest-0B using ngtcp2. You can see the QUIC handshake being performed multiple times (once for each request). Also, different DCIDs are used throughout the packet capture, which indicates separate connections.
Wireshark capture of QUIC packets for 5 consecutive requests to speedtest-0B using Proxygen. Again, we see multiple sets of initials and different DCID’s in the capture which shows multiple established connections.

Overall, when bandwidth is limited to 10 MB, the performance of QUIC is equal with TCP when dealing with ‘ideal’ network conditions where there is negligible loss and delay.

It’s important to note however that this equality in performance may not hold true under high-bandwidth scenarios (i.e intra-datacenter bandwidth where > 1 gbps is common). 10 mbps can be easily handled by today’s CPUs in terms of processing packet data and copying data to user space. As a result, we may not be actually pushing these TCP or QUIC stacks to their capacity when benchmarking at relatively low network bandwidth. When network bandwidth is at intra-datacenter levels, the kernel plays a much larger role in maintaining high throughput and low latency since it must handle 100–1000x more packets per second.

Delay

Below are graphs showing results for 0% loss, 200 ms RTT delay (100 ms on downlink and uplink), and 10 MB bandwidth. I ran 20 iterations for each case.

As stated before, Chrome has an advantage over command-line clients since it will reuse the same connection. This explains why Chrome performs considerably better on small webpages compared to other clients. Given that the the initial handshake for QUIC or TCP will take even longer when introducing delay, the gaps shown in this graph are justified.
We can see the effects of the handshake delay for large web pages too, albeit less pronounced due to scale of the graph

From the graphs shown above, the addition of delay does not introduce any new discrepancies between QUIC and TCP.

Loss

Given the nature discarding packets using a uniform distribution, the results will be more variable and thus harder to interpret. To slightly offset this, I ran 50 iterations for each endpoint. Below are graphs for 5% loss, 0ms added delay, and 10 MB bandwidth.

Couple interesting things about this graph. First, ngtcp2 has an absurd amount of variation since we can’t even see the std dev arrows on the graph. Curl’s performance is also inconsistent. Lastly, Proxygen seems to catch up to Chrome in the face of loss. It might be worth examining Proxygen’s loss-handling logic in the future to see how it can match Chrome’s performance despite undergoing a handshake each iteration.

When examining the individual data points for ngtcp2, there were a couple times where it took 30 seconds for it to finish loading speedtest-0B (0 Bytes). The 30 seconds value is caused by the default timeout on a connection for ngtcp2. What’s most likely happening is a sequence of ‘lost’ packets that prevents the connection from progressing. To confirm this, I examined the qlog of a connection that hit the 30 second timeout limit. This is what it showed:

The visualization above shows the sequence of packets between the ngtcp2 client (left) and the Facebook CDN server (right). We can see that ngtcp2 will undergo exponential backoff when resending the handshake packet. In this case, the client did not receive an ACK from the server for any of its 8 handshakes. As a result, the default timeout was hit and the connection aborted.

It seems that ngtcp2’s large variation is mainly caused by the client’s exponential backoff when resending handshake packets.

EDIT: As of 7/23/20, after discussing these results with some folks at Facebook and ngtcp2, it appears the issue was the Facebook servers were not resending handshake packets back to the ngtcp2 client. After this issue was fixed, I obtained these results:

The std dev may look just as large as the previous graph, but notice the Y-axis scale difference. Performance is much more consistent here.
In this graph, Curl continues to perform worse. Rest of the implementations have equal performance.

Once again, QUIC can achieve equal performance with TCP in the face of loss.

Loss + Delay

Now this is where things get interesting. Below are graphs showing results for 10 mb bandwidth, 5% loss, and 200ms added RTT delay. Since we are dealing with random loss again, I increased the amount of iterations for each endpoint to 40.

Curl continues to perform poorly in the face of loss. This is the first time we see Chrome H2 perform poorly relative to other clients.
We finally see an instance where all QUIC implementations clearly perform better than any TCP implementation!

When compounding random packet loss with delay, QUIC performs better than TCP as the size of the requested webpage increases. What I find most interesting about the above graphs are the poor performance of Chrome H2 and the minimal std dev for Proxygen and Chrome H3 when dealing with a large RTT and random loss. These phenomenon are definitely worth examining in the future.

Conclusion

From the results I collected, the performance of QUIC clients is equal or better compared to the performance of TCP clients when it comes to requesting single-stream resources on limited bandwidth. In the real world however, network connections rarely comprise of only a single stream and can be on network links with gigabit bandwidth. Therefore, it’s important in the future to benchmark QUIC performance using ultra-high bandwidth links and to test multiple-stream resources on production servers.

One might ask, if QUIC’s performance is equal to that of TCP, why bother migrating? This is a great question and one that many organizations will face in the next coming years. So far, Facebook, Google, and Uber have shown that using QUIC can greatly improve tail latency and p99 performance on the open internet. My prediction is that QUIC will see mainstream adoption for internet traffic after it becomes an RFC. However, when it comes to non-internet traffic (i.e intra-datacenter traffic, communication between microservices) where bandwidth is often > 1 gbps, TCP will still be widely used for the coming decade due to its kernel advantage.

--

--