Cloud Latency

I once wrote a popular article on Light Reading on whether latency considerations should affect the choice of data center location for cloud services. In writing that piece and addressing the large number of comments it became clear that what I should have done is first written a tutorial on cloud latency considerations. Here it is.

Part 1. Laws of Physics

First, simple physics and even simpler mathematics. The speed of light in glass is two thirds the speed of light in a vacuum, so it travels through fiber at 125,000 miles/second. It also travels through fiber at 200,000 km/s because it doesn’t know which units you prefer. The circumference of the earth is about 25,000 miles or 40,000km. Divide the circumference by the speed, and you’ll end up with 200ms in units that we can all agree on.

So if it was just about the speed of light, the maximum ROUND TRIP anywhere on the planet would not much exceed 200ms (sure, cable paths are not straight, but there are very few populated areas exactly opposite each other either).

Part 2. The Human Animal

It happens that 200ms is also the maximum “ear-to-mouth” delay for “Very Satisfied” voice callers (according to ITU recommendation G.114 figure 1). So phone calls from Cape Town to Hawaii or Auckland to Madrid can sound just fine. Our planet is just the right size for voice calls, which has always struck me as a particularly odd coincidence.

When it comes to more typical cloud applications, most potential interactions are over much shorter distances. For a straight fiber cable, the light delay from Sydney to San Jose is 60ms (120ms round trip), while across the Atlantic (London to New York) is less than 30ms (60ms round trip).

So what latency actually matters for interactive cloud services? Let’s first think about the latency in our own bodies for a bit of context: We’re all living in the past — about 80ms in the past to be exact, which is the time it takes for our brains and nervous systems to synchronize stimuli arriving on different neural paths of different latencies. If you see a hand clap, you perceive the sound and sight at the same time even though the sound takes longer to arrive and to process. Your brain allows itself 80ms or so to reassemble events correctly. That’s why a synchronization delay between video and audio suddenly becomes annoying if its more than 80ms — your built-in sensory auto-correct flushes its proverbial buffer.

That doesn’t mean that you can’t perceive that there’s a gap between events that are 80ms apart — both events will be perceived 80ms late but they’re still 80ms apart. You can physically react on those kind of timescales (think table-tennis) but at a cognitive level you certainly can’t tell the difference between a 100ms and 200ms lag, and even page response times of up to 300ms just seem pretty much instant.

That provides a bit of perspective which helps simplify the very complex topic of latency. Ten milliseconds just doesn’t matter for interactive apps. So anything less we can pretty much ignore — CPE and network packet processing times (tens or hundreds of microseconds), packet latency due to serialization (1ms for a 1500 byte packet on a 10Mbps link), even the user-plane radio latency in LTE (less than 10ms assuming no radio congestion).

Part 3 — Protocols

TCP is still used for most things cloud, even streaming media. It is a protocol that is probably older than many of the people reading this article. It was first described four decades ago, became the sole ARPANET transmission protocol three decades ago, and first sat under HTTP to transport HTML-based web pages two decades ago. It has come a long way since then, but was not designed for a world where a single page is 1.2MB, and pulls in 80 resources from 30 domains, which is how Google characterize today’s typical page in describing their rationale for work on QUIC (a protocol workstream attempting improve cloud latency).

The latency challenge with TCP is that a connection is established and acknowledged before the transmission of data starts — that’s an extra round trip. If the page uses encryption (https) then there’s another round trip for TLS (or another two if credentials are not cached). Only then do we start the round-trip to request and start sending back content. So there’s two (http) or three (https) round-trips before data starts arriving for a single TCP connection. That’s if DNS and TLS credentials are cached and there’s no redirects which is true for many well-designed lightweight cloud apps (think Software-as-a-Service) but not the case for high-volume media-dense sites that require complex load-balancing and geographic distribution of content — they’ll require an extra round-trip or two.

Connection-oriented protocols like TCP also introduce a special set of constraints related to the buffering of unacknowledged packets. I’ve kept this discussion of the “TCP Window” for a long separate section later in the post.

Part 4 — Page Structure

Here’s the most significant thing that people new to cloud latency often forget: We’re not dealing with a single TCP connection. Remember we just said 80 resources on the page? We need a TCP connection in support of each of those http/https requests (less any application-level multiplexing smarts performed by the browser, which are of minimal impact at present). Fortunately we don’t need to do those connections one after the other, but we can’t do them entirely in parallel either. First the basic HTML needs to get loaded so that the URIs for the other connections are known. Then the browser limits the number of concurrent connections per server because establishing more can decrease overall responsiveness. That limit is 2 to 8 depending on the browser, but typically 6 nowadays (for reference, see detailed database at browserscope.org). So the high priority parts (CSS, key scripts, impact content) get loaded next in parallel. By now hopefully the user is feeling like they’re starting to get a response, and the rest of the content arrives on subsequent waves of connections. Not all cloud app pages are as complex as the one described above, but few would get away with less than three connection waves, and many popular web sites have a much higher request counts.

Serious cloud applications will set static resources to “expire” only in the distant future (then change filenames if a change is to be rolled out). This means that infrequently changed elements like css and scripts are cached locally. This is the reason that scripts often represent a large percentage of the page size, but often a much smaller percentage of the response time. It is also common practice to “flush” the buffer on the server side so that the HTML <head> is sent while the rest of the HTML page is being prepared. This enables the CSS and javascript to be loaded in parallel to the rest of the HTML arriving.

Part 5 — Serialization and Queuing

A bigger packet has a longer latency than a smaller packet because it takes longer to serialize (i.e. transmit as a series of clocked bits on a serial link). But even large packets (1500 bytes) add little round-trip latency — if the slowest link is 10Mbps downstream (and no one else is using it) a 1500 byte packet has a serialization delay of just over a millisecond. Serialization time matters a lot to overall page load time because for big pages there can be more than a thousand such packets, but its impact on round-trip latency is negligible.

So what makes a big difference to round-trip latency? It’s not serialization, nor is it the packet processing time of cpe or network equipment (which add only fractions of milliseconds). What makes the biggest difference is queuing — a packet that sits in a router buffer somewhere while it waits for other packets that arrived before it (or that have higher priority) and need to use the same link for their next hop. If a route is more complicated you increase the likelihood of queuing, and the more hops of the route that are outside your control the more likely that that queuing is significant due to congestion. To manage that congestion and available queue space, routers may discard packets which further increases latency, especially for our friend TCP which must resend packets when acknowledgements are not received in time.

Ping is not an especially reliable tool for measuring latency since network equipment may prioritize it differently to other traffic (and this is inconsistent across vendors) but it does illustrate the variability of round-trip times for a single small packet. Pinging a major site in the same city might show a variation from say 5ms to 20ms, while a ping a quarter of the way around the planet will often be in the 100–400ms range. Assuming a good fixed broadband access connection, anything over and above the speed-of-light latency is mainly network queuing/congestion latency and it matters because it can be several times the speed of light latency.

Part 6 — Mobile Matters

If you do a ping on WiFi from your smartphone to a well-connected server in the same city (like a peered content provider) then do the same ping with WiFi turned off on a modern cellular network you’ll probably find a difference of 10–40 milliseconds (depending whether its LTE or a 3G variant) and more if it’s a busy time for the radio network.

For LTE the latency added to round trips should not exceed 10ms on uncongested cells (except the first packet if the data connection has been idle and needs to be reestablished, which can add up to 100ms to the first packet). For earlier generations of radio latency is more signifiant.

Part 7 — Endpoints and East-West Latency

Obviously if the computers at either end of a connection are overloaded that can add enormously to the latency. TCP packet and protocol handling is performed in the kernel on the web-server or offloaded to a network processor — meaning the latency of processing TCP packets is not significantly affected by how busy the web server is with other tasks. So if you do web page tests using a tool like webpagetest.org you might expect to see that the time taken to establish the initial TCP connection is roughly equal to the time between sending the http request and receiving the first byte of response (“time to first byte”). In practice there is often a significant difference, which is the application processing (as distinct from network processing) time on the server.

This server processing time may be long for several reasons. The task asked of the server may be genuinely time consuming e.g. decryption or a set of complex queries. Or the web server or one of the supporting application or database servers may be overloaded. But what can have a much bigger impact on server response time than actual server load is the network considerations sitting behind the server. For example it is reasonably common for one or more of the databases to be located in a different data center. A good example of this would be government or enterprise sites where the web-front end is located for good Internet peering but the databases remain in legacy data centers. The connection from client to server is sometimes called “North-South” and between servers (in one data center or several) is called “East-West”.

That said, well engineered applications running on well engineered infrastructure will start streaming their responses very quickly — including starting to stream the top of an HTML page before the rest of it has been computed. And developers will often use “pre-compute” to build personalized pages in advance of them being requested, with changes to the database just triggering recompute of the pages affected. This means that calls to remote databases are not required in order to compute the page.

Part 8 — Page Load Example

The term “Latency” in network-speak is usually defined as the time between sending a request and starting to get a response. As we’ve learnt, TCP requires at least two such round trips before real data starts flowing. If we take the case of a cloud application, that would be the time it takes for the HTML to start arriving — and it is important to understand that we don’t start loading anything else (cached or otherwise) until that HTML arrives and tells us what other elements need to be fetched. So the bulk of the data doesn’t start flowing until there’s been four round trips.

Web pages vary dramatically in their design and size, but typically there’s less than 100kB for the HTML and CSS, around 400kB for scripts and then the rest is media. The main scripts and CSS are often cached, so now we’ve got to go fetch the media. First there’s a repeat performance of our two round trips (300ms) and then begins the heavy lifting of actual data transfer.

Let’s convert all that to actual response times, using a specific example. Let’s connect from a browser in San Jose on an uncongested fixed broadband connection (10Mbps downstream) to a 1MB page hosted in Sydney, that we’ve visited before. To start, we’ll assume no redirects, no encryption and no CDN or network caching. We’ll also assume no network congestion and a well-engineered application and server infrastructure.

We enter the URL and click. The DNS resolution time is negligible (cached) and we establish the TCP connection (one round trip) then send the HTTP request (another round trip). Each round trip is about 150ms (120ms light, 20ms queuing, 10ms server) so we’re at 300ms when the HTML starts arriving and the browser starts rendering — the CSS, key scripts and layout images are locally cached, and we have some text content in the HTML.

The HTML file might only be 20KB if scripts and style information are correctly externalized, so it arrives fast (less than 20ms of serialization delay) and the fetching of other elements begins. Let’s say we still have 50 elements and 500KB that are not cached locally and so need to come from California. There’s two round trips before the first byte of those elements starts arriving (300ms) then the data transfer time. These fetches don’t necessarily occur in parallel because the browser limits the number of concurrent TCP connections to a particular host, but once the first wave of data starts arriving the limiting factor usually becomes the serialization delay in the last mile — we’ve got 500KB of data to transfer, which is going to take 400ms to arrive.

So from a user perspective, the page starts forming more or less instantly (300ms), then content images start rendering after about 700ms (four round trips plus serialization delay of the first image) and the page completes in a second (four round trips plus serialization of 500KB). If the page is much larger it will take longer, but as long as there is plenty for the user to focus on they won’t notice the fact that some content takes longer to arrive (especially if its off the bottom of the first page, or video content in a sidebar).

Part 9 — CDNs, Redirects and other complexities

We can shorten the latency (but not the serialization delay) by using CDNs and network caching to reduce the speed-of-light latency and the probability of hitting congestion. On the flip-side, redirects add a round-trip, encryption (SSL) adds a round trip (or two if credentials are not cached), LTE may require up to 100m to establish the data connection if you’ve not been using it, and other mobile technologies add more latency.

And everything we’ve discussed becomes pretty irrelevant if radio or IP network congestion adds a few hundred milliseconds to each round trip, or if virtual servers are overloaded — neither of which are uncommon.

Part 10 — TCP Window Size

No discussion of cloud latency would be complete without mentioning this topic, even though it is much less of an issue than used to be the case.

Like any connection-oriented protocol the maximum throughput of a TCP connection is not just a function of the size of the pipe, but also the ability of the end systems to buffer and process the amount of unacknowledged data that corresponds to keeping the pipe full.

The amount of buffer required to keep the pipe full is equal to the product of bandwidth*delay. As an example, if there’s one second round trip delay between two systems then the sending system needs to store one second of data before it starts receiving acknowledgements. So on a 1Gbps link, that would requuire a buffer of one gigabit of data (125MB). The TCP protocol attempts to ensure that the rate at which each system sends data does not exceed the buffer of the receiving system by using the TCP Window Size mechanism.

In some older operating systems there is a limit of 64k on the TCP Window Size, notably Windows XP — which as of writing still represents over a quarter of desktop users, even though it is no longer supported by Microsoft. To be specific, the issue is that these operating systems do not support the TCP Window Scaling option defined inRFC 1323 in 1992. This means that XP only has a 2 byte pointer for the window, so limits the window to 64k rather than 1GB which is the protocol limit if scaling is enabled (all other major OS including Windows since Vista). You might be tempted to think that the 64k window on the client would be irrelevant for transactional scenarios since the data flow is asymmetric (the http requests are minuscule, so we care about the buffer on the server rather than the client). The problem is that if you don’t have the Windows Scaling option enabled neither end can use it, so its a bigger deal than the OS just limiting the client-send window size to 64k. To be clear, having Window Scaling enabled does not mean the OS doesn’t limit TCP Window Size — in the case of Microsoft Windows Vista and later the default is to limit it to 16MB per TCP connection.

This was a bigger deal a couple of years ago when many enterprises still hadn’t migrated from XP, but it’s still an issue worth considering since XP is still over a quarter of the desktop market if you believe browser stats. That will now presumably fall off more quickly, though still be propped up by the pirate market in parts of the world. Anyway, let’s see what happens if you have a 64k Window.

For interactive cloud services (which I keep coming back to since it is by far the main use case today) there is now a throughput limit of RTTx64k for EACH TCP CONNECTION. This only becomes relevant (in terms of affecting overall performance) where the element being fetched is larger than the window size (64k) which is a small percentage of the elements on a typical page (especially if the main scripts and CSS are already cached locally which is always the case for frequently used applications).

But what if we are retrieving several large images all bigger than 64k — unusual, but a good illustration. As explained above (in part 4), these images are NOT fetched sequentially — they are fetched on parallel TCP connections. There’s typically a limit of six parallel TCP connections to a particular server (varies by browser, but converging on six for modern browsers). In a majority of pages the elements will get fetched from multiple different servers, but let’s assume the worst case where they’re all coming from one server. Then the throughput is limited to 64kB (512kbits)/200ms = 2.5Mbps x 6 connections = 15Mbps.

In other words, on a good broadband connection (say 10Mbps down, 1Mbps up) the limiting factor is the serialization rate of the broadband, even for an old-fashioned operating system. If you have a faster connection, the impact of using XP trans-ocean or even trans-continent could indeed be material to the experience for pages with many large elements.

On modern operating systems, the TCP stacks will negotiate a window of whatever is appropriate (up to at least 16MB by default on Windows, or up to 1GB if you wanted, though that would probably break other things and isn’t necessary). To put that in perspective — for six concurrent TCP connections on a 200ms RTT that’s 16MB x 8 (bytes) x 6 (concurrent connections) ÷ 200ms = nearly 4Gbps, i.e. not even close to a consideration for interactive apps.

The TCP Window and bandwidth*delay product are still an important consideration for continous file transfers on very high bandwidth connections, for example between two data centers, but are hardly a consideration for interactive cloud services.

That’s all folks.

I hope you found the article useful. Please do post comments — both for errors/omissions and to let me know what other topics you’d like to see covered either in updates or separate posts.

About the author. Philip Carden is a founding partner of Number Eight Capital and a well-known figure in the telecommunications industry. He chaired two of the first telecoms industry conferences on customer experience, and is extensively published on that topic as well as security, telecommunications engineering and operations management. He was formerly the global head of the Consulting Services business division at Alcatel-Lucent.

Like what you read? Give Philip Carden a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.