Web Performance: what you don’t know will hurt your customers

Published in

The CTO’s toolbox

9 min readJun 8, 2018

The impact of poor performance on business outcomes is startlingly consistent. Famously, Google increased the number of results of search on each page from 10 to 3⁰¹. Traffic and revenue dropped by 20%. Amazon estimated that a 100ms slowdown in performance causes a 1% revenue impact. If you are running an e-commerce site, after 3 seconds, 40% of customers will abandon the site².

How do you ensure that you have the right performance built in to your product? There are plenty of papers on the web which tell you the engineering aspects of performance. What is missed out is that performance thinking starts before the first line of code is written. This post covered key principles in end to end story of great website performance.

Great Performance starts with great designs

Designers often don’t think about performance in their designs — that is just an engineering concern, so they think. It is most certainly not. A well thought out design takes performance considerations into account. Your customers may be around the world, with varying network latencies and CPU’s. There is a strong correlation between JavaScript and the customer’s CPU — a larslackge dose of JS will work fine on the standard developer machines but will suck on a CPU that is 5 years old — which is likely representative of many of your customers.

A great design team will take this into account and work with engineers to propose ‘light’ designs for network and CPU constrained customers, with an option to upgrade to the full experience.

Recommendation: Develop designs that will work for all customers. A key way to develop this is shared later — adaptive designs that account for network and CPU latencies that your customers are likely to encounter.Flex your functionality, fix your performance budget, not the other way around.

Measuring the 99th Percentile performance is key

Companies often measure the 50th and 90th percentile performance. The 50th percentile (median performance) is useless as a performance metric. Looking at median performance is equivalent to saying ‘we don’t really care about the bottom 50% of our queries/customers’.

At 90th percentile, you still have a problem. Let’s assume that you have 10 million requests per day. The 90th percentile gives you no insight into 1Mrequests. How is that ok?

Performance tends to have an interesting characteristic. From the 50th percentile to the 90th percentile, it is typically easy to extrapolate: there is a linear relationship.

The linearity starts breaking up beyond the 90th percentile. By the time it is at 99th percentile, numbers have spiked significantly. By the 99.9 percentile, you are seeing a hockey stick of terror.

The really bad news is that:

TP99 happens all the time

This is quite counter-intuitive. TP99 refers to the 99th percentile of performance. So how can it happen all the time? This is because what I refer to as ‘The Tyranny of Microservices’. Even if underlying micro-services honor a single TP99 standard, they will collectively be unable to honor the SLA.

Let’s assume that user response is dependent on a single service (a monolith, for instance) and it has a 1% probability of responding outside your SLA. Then 1 in 100 requests is going to be impacted by poor system performance.

What happens if a page being rendered depends on 10 services with a 1% probability of responding outside SLA?

1 — (.99)¹⁰ = 9.5% of your requests will not respond in time.

Well, what if you actually were making 100 calls to services with 1% probability of violating SLA at TP99?

1 — (.99)¹⁰⁰ = 63% of your requests will violate your SLA.

At 160 calls, there is an 80% probability that a given request encountered a 99th percentile response.

Recommendation: Be wary of going too far with your microservices. Fine grained services sound fantastic from a velocity and independence perspective but can seriously hurt your performance.

Think about the granularity of your calls. Making repeated calls for small amounts of data is bad from a resiliency and performance perspective.

In addition, World class companies do not stop at TP99 — they also monitor TP99.9, TP99.99, TP99.999. To solve for performance beyond the 99th percentile, you need to think out of the box. ‘The Tail at scale’³ is an excellent read for patterns that will help solve for outliers

Inconsistent performance happens consistently

Unexpected attributes of the customer environment can cause inconsistent behavior. Networks that your customer have can be inconsistent, especially on shared bandwidth like DSL. Overly aggressive microservices can result in TP99 behavior often, as explained in the previous section. Browser cached assets can be slower than fetching from the network (yes, you read that right)

Customers work on multiple application in parallel, typically. This means that the OS time shares across everything running locally, including the browser. We have learnt the hard way that cached assets in the browser can be abnormally slow to load due to the underlying scheduling algorithms in the OS, virus scanners and other resource intensive applications running in parallel. Actual faults in the underlying hardware can slow down disk accesses — something that happens intermittently⁴

Network connectivity poses another problem. Users can face as much as a 5% packet loss within a two minute window on DSL connections, in our experience. This is more common when the user is accessing web applications while being mobile. All of these lead to underlying retries and sub optimal network performance.

Recommendation: A search for “chrome waiting for cache problem” will show a range of solutions that are suggested including running the good old chkdsk. As application developers, implementing service workers that support network cache race strategy can alleviate the issue significantly⁵.

For the Network connectivity, something you have no control over, a few things can be done:

Pushing critical assets to users ahead of the time of need via service workers, push notifications⁶
Enable Brotli compression to reduce the time spent on the network⁷

Test your code on real world machines

Many of the problems above can be warded off by using hardware representative of what customers actually will be using. If you work with millions of customers, a substantial portion of your customers are likely not to have the screaming hot performance of your developer machine — Instead, they will have PC’s running older OS’s and hardware. The performance of a poorly designed page is likely to be significantly worse on a real world computer.

Recommendation:

At Intuit, we have set up a Real world performance lab, consisting of machines scavenged off ebay. These are reflective of the actual machines used by customers across the world. This has helped us develop a sense of empathy in addition to understanding how code will actually perform. The chrome developer tools⁸ are very useful in both simulating slow CPU/network conditions and understanding performance.. Intuit is also a big users of WebPageTest⁹ that help us test in varying conditions, which is the next point below.

Testing Early

The best time to test for performance bugs is every time a PR is made. Every code checkin — front and backend — results in a performance test job using our CI/CD pipeline. When the code change shows a degradation in performance, the changes are quarantined till the results can be examined. This has resulted in us detecting many regressions and ensuring that customers have never seen them. We make extensive use of WebPageTest for this validation. We could not recommend it strongly enough.

Know your customer

Not everything is under your control. Your Javascript is parsed on your customers machine. The specs of the customer’s machine has an impact on your performance.

Much of this can be measured. You have good insight in the customer’s hardware via chrome API’s (crack open your developer console and check out navigator.hardwareconcurrency¹⁰, navigator.deviceMemory¹¹, Network Information API¹²). This will not always be available for customers using different browsers given the current experimental status — but the data can be extrapolated based on known measurements.

Recommendation: Measure information about your customers environment where possible. Over time, you can use this data to have a deep understanding of your customers machines and be able to generalize it. For example you may choose to have a different product experience for customers in India who often have high latency — and the default experience for customers in India can be different

Measure the right thing

It is critical to start by measuring the customer experience — not what your servers are seeing. In between you and the customer are lots of devices — CPU’s, networks, storage, all of which could cause the customer to have a poor experience. Real User Monitoring — measuring from the browser — is a key metric.

This can be measured in several ways. There are a number of 3rd party tools that allow this measurement. Intuit measures it with a custom library that allows greater flexibility in filtering and changing metrics.

Measurements does not mean ‘what is my software performance’. It needs to be end to end. That means you understand DNS performance, TCP, SSL and request/response processing. This W3C image¹³ shows what you need to think about:

Intuit measures this data starting at fetchStart. The end event varies depending on the nature of the page). For some pages, it is loadEndEvent. For others it is domInteractive.

Recommendation: Understand your performance budget end to end. Measure it on all browsers. Understand what your budget it prior to the first byte being fetched. You might be surprised of how much of your budget is taken up by your CDN, DNS, TCP, SSL (and radio setup, if you are on mobile).

It is easy to plan for asset size. A simple spreadsheet will allow you to type in your latency, bandwidth and a few other numbers and it will tell you what your asset size should be to meet your goal.

Statelessness allows web scale

Expressed differently: If you have stateful services, you will not be able to scale. If you are unable to scale, you will not have a good time.

Statelessness is a key component to great performance at scale. Statelessness means that you scale infinitely to handle more traffic. If you have a stateful service, you will face bottlenecks in performance during peak times. Your 99th percentile performance is tied to your behavior under stress.

The real cost of JS

Parsing Javascript is expensive, and is very much a function of the CPU. The chart below (from Addy Osmani’s presentation at Chrome Dev Summit 2⁰¹⁷¹⁴) shows the startling difference between a iphone 8 processor at the top and an Asus Zenfone 5 at the bottom.

Recommendation:

Think hard about what the right JS size is to get the right user experience.

Take care of the fundamentals

It is surprising to see how often fundamentals are not taken care of. Are all your assets in a CDN? If you are serving them out of your web tier, you are hurting performance. Do you share consistent CSS in the organization? When customers move from page to page, they should be able to reuse the same CSS. Do you have OCSP stapling enabled¹⁵? That will help save round trips from the client to the Certificate Authority. If you are using a cloud provider, do you have location affinity for customers?

In Conclusion

Performance is a fundamental quality attribute of your system. In a distributed system, performance needs to be thought through carefully. Microservices give scale characteristics but requires more careful thinking around performance.

Acknowledgements

The performance tiger team in the Small Business unit helped both validate the content but more importantly helped put improvements in front of customers. Tapasvi Moturu reviewed and added content to the document.