What does making software fast mean?

bkchung
18 min readApr 13, 2015

A Performance Engineer’s perspective

Note that I used the word “fast” in the title. This is different from “s/fast/faster/g”. Faster means relatively fast, fast would mean, well, fast. Faster might seem like it’s something easier to solve than fast, because it already means there’s something to compare it with to be faster “than”. Though, faster might be easier than fast, but it doesn't mean it’s easy. Now I’m playing with words. Ok, to be frank, the sentence doesn't really matter. But there is a point to this.

If you know that the mobile phone that you’re holding is, say, 3 years old — which is ancient time in the mobile world — and you know how long it takes to do daily operations, changing your phone to a new one will probably be both “faster than the existing phone” and also “perceived as fast”. If you had a 6 months old one and you’re changing it to something zero-day, the difference might not be noticeable depending on what you do daily. You might already “felt that it’s fast” using the existing phone, but the new phone could be “slower than the existing phone”. Or maybe because it was “slower than”, you might say it’s “slow”, who knows.

If you know who Flash is, you would know that he’s fast, but is he faster than the Reverse Flash, the dude in the title background image(image source).

As you’ve guessed, being performant has at least 2 meanings in performance engineering, “being faster than” and “being perceived as fast”. Though the latter has a bigger scope, the former is more actionable.

Faster vs Fast

In engineering terms, they are also 2 very different work. To know that something is perceived fast, the element of “perceived” needs to be determined. Simply optimizing some code snippet will not do justice, might not even leave a dent in being performant. Reliability, scalability all affects perceived performance. Being resource-wise(or time-wise) feasible would also fit in that category of being perceived as fast. Like calculations taking months to yield a result would need to be optimized or else re-evaluated as a non-solution. To know that something is faster, you need to find the right thing to compare it with and how to do it objectively. But still, faster doesn’t mean fast, and making a part faster doesn’t mean faster as a whole.

Simply think of the gradual increase of cpu clock speed. Computers came a long way, it’s everywhere even in your pocket now. But how much increase in speed do you actually get to perceive. You buy a new laptop and probably the processing power would be hundred or thousand times faster than your old one you bought quite some time ago. But is putting up an application actually taking hundred or thousand time faster than the old laptop? Probably not. Complexity would have increased, OSes would be more sophisticated, and there would be yet hundreds of different factors combined that affect your perception.

Lets say you’re assigned to do performance work for a certain area or feature of some software. You define metrics, find proper places to instrument, and add the instrumentation, measure the software and gather some samples, compare it with the proper baseline/control, analyze the result. And now you think you know some idea of some aggregated metric value that you think is right(Leaving out the fact that a lot of cases, some only do half of the work listed). Now that you have some idea or anecdote, you found areas that you can optimize or fix and start coding and/or re-factoring. You measured and think you gained 20% by the work done comparing both cases with and without the new work, which is a fairly good gain.

Benchmarking in a controlled environment(usually your own machine, but at least you’re comparing on the same machine) can be an indication that one is faster or slower than the other. Profiling the execution is also a way to find ideas of performance by looking at pieces relative to each other and multiple profiles can yield a pattern. Improving a frequently called code snippet from O(n^2) to O(nlogn) by analyzing it asymptotically would be an obvious improvement for the code chunk.

But software performance is much complicated than that. You’re not working on a monolithic piece of sequential function, you’re working on a complex bulk of software that is organic influenced by a lot of external and internal factors. What if the 20% gain was because of something specific to your machine setup. What if some other long pole operation that someone else wrote diminishes that gain. What if on a different load/stress the result turns out to be different or on a different topology and scale of your system the gain doesn’t matter? Now what? How many people are working on your product and how does your work affect others in terms of performance?

Then unexpectedly which is expected in any software, you found out that the users use the functionality that goes through that code path very rarely. Does that mean you didn't do you’re job? Of course not. But perhaps the engineering effort might have approached it too naively. There’s nothing wrong with the work that was done to gain 20%, but was it prioritized correctly? Were there more bigger gains to pursue, as in, smaller gain but bigger impact where the users use it a lot? Did you do a good job at balancing and/or distinguishing faster and fast?

Illusions

I've been thinking about and working on this topic for years, halfway through my career. I never meant to do performance as a career, but when my organization at Microsoft shut down, it(the people working there) got merged into another org and (un)fortunately, I landed on a performance team. Not that I've never done performance work before, but a dedicated performance team was different. It was a new way of looking at the big picture of performance, not only as a function of a feature or work but like performance as a whole.

xkcd.com

One of the things that I learned while working on “faster”, is that people fall for the illusion of a constant performance number that represents the whole and ask for it all the time. And at times, anecdotes win over statistics because it’s easier to comprehend the instance forgetting that it’s an instance. I’m no statistician but even anything slightly complex than a simple average or median or a quartile calculation, people tend to create a mental barrier. Counts are usually easier to comprehend and relatively stable, but mostly not representative by itself.

It’s good to have an aggregated number that look simple and believable and easily explainable, like a pass/fail, but it doesn’t mean there is. Yet, there has to be this number and that number is compared to a constant threshold(that usually comes out of nowhere) and declared pass/fail (sometimes with some confidence values such as p-values, but with 5 samples?). There will be some number, but there isn’t always a pass/fail formula that rules them all which makes it an illusion. You’ll just believe in what you believe in, not because it’s a proven fact.

Some compare two sets of results from two different machines which might luckily show seemingly valid results but is usually not the right thing to do. Not to mention that in cases some use 1 sample to compare, and it’s a well known fact that people like and rely on anecdotes than data mumbo jumbo. Even a well defined metric value representing performance is a synthetic number on the exact set of environment. And on a different load, on a different machine, on a different time, on a different configuration, with a different metric, and so on, it’s mostly possible that it’s going to be different.

When you read some article or experiment result and try to apply them, do you make sure that you read between the lines on the assumptions and whether if it applies to your situation and environment? “We improved performance by 2x” is a pure marketing language that’s a no-no, how do you even make “performance” 2x? “We improved the server response time by 2x while on 100 tps load average with various profiles” would make a little more sense. Any difference or change of metric can just simply show different results.

What people don’t (want to) understand usually end up in a lower priority bucket, and the job to help people understand turns into something other than engineering or maybe it’s part of the engineering. This is a large chunk of work, proving and convincing. (And, yes, it should be included in the job description of a performance engineer.) A simple performance exit criteria saying “no regressions” or “10% performance increase” can mean anything.

You measured the execution time of a component 5 times and let’s say they each showed 5,5,5,8,5 seconds. You would assume that the performance would be closer to 5, but can you guarantee that it’s going to be closer to 5 when you do 10 times? What if it shows as 5,5,5,8,5,8,8,8,8,8(it could be a distribution with 2 peaks or it could be that 5 was some incidental condition)? But you’ll never know if you’re always doing 5 iterations and probably use 28/5=5.6 as your definite number. It might work for your one-off benchmarks, but what if you want to create something larger scale for your whole software. Wouldn't that be relying on luck?

The situation will be different under load. Suddenly the routine that you thought was fast quadratically gets slower under load for some reason. Then would that 5.6 number be the correct indication of the performance? Would you argue that load testing is for later development phase? The instrumentation itself can be a problem if it interferes with the execution or adds too much latency.

How would you solve this if you want to scale and cover more? You would at least do your best by trying to fit in more within the given constraint, focus on lowering variability which might not always work, and do the right math(or find the right person with the expertise), and make sure it’s contained to be explainable. And getting more historical samples would help with math and inferencing also. There’s simply no magic number that rules them all. Will 10 samples work? or 100? or 1000? Or until you get a statistically significant result? How much would yield enough power? How much samples you need would depend on the metric and the complexities of what you’re measuring plus uncontrollable factors assuming that you minimized the controllable ones.

Running average of an artificial 30ms regression on an example metric.

Of course, engineering is about solving problems with compromises and constraints. Tight threshold/tolerance will always have the possibility of struggling with false positives that would waste resources. A loose one can also be a trap that constantly allow real problems to be ignored(true negatives) killing performance by a thousand cuts, making it hard to improve back. You also need to be careful of things like an average of averages trap.

The illusion also tricks you to believe that your controlled environment is enough. Maybe that daemon that helps you run things would not be a big deal, maybe the network not being isolated wouldn't be much of a problem, maybe some small I/O won’t be big of a deal, maybe something is just OK. But they keep biting you from behind later when the cost might be higher the same as any other bug found later than earlier. You might as well just eliminate every factor how small it is that you can find. You need to look beyond the illusion.

Thousand Cuts

Then you might think about it the other way around, how much faster is going to contribute to the perception of being fast as a whole, where do you draw the line. Prioritizing by bigger gains that contribute to being fast is good, but does that mean small gains is supposed to be negligible? The whole point of prioritizing is that lower priorities can be dealt later and unless you have infinite resources, it can easily fall off the list of work that needs to be done.

As mentioned in the previous section, the loose tolerance that would allow little causes that interferes with being faster can add up. And while it accumulates, it can start to be noticeable which comes to the idiom “death by a thousand cuts” or a “boiling frog.”

This happens all over the development cycle. Slip a couple of millisecond here, more milliseconds there, or tens and hundreds of milliseconds sometimes hidden by some other improvement bigger than the loss that might hide it. There also can be code that selects clarity/readability over optimizations which also could be a good choice. You might struggle over whether if premature optimization is appropriate and think of refactoring later, and it actually might be the way to go at times. Or if it’s a more architectural problem that leaks tens of milliseconds, you might think again about fixing it later. It would be up to your judgement, but documenting or adding comments of what you think will also help performance quality.

Quite some time ago, I used to work on what you call world readiness these days. Back then, there were so many things that involved extra code paths to support all the customizations needed for different customers, we used to say that the expectation of international version is 15% or 20% slower in general, taking it for granted. This was a side effect of patching over english first versions. And it’s less the case these days and modern OSes and tools do a much better job there than before. But thinking how long it took to come to the current state could be a good example of what a thousand cut can result in.

There’s always ideas that can save more milliseconds and if the cost is also small, I’d recommend not discouraging it by saying “who cares about a couple of milliseconds.” because you should care. And also a millisecond is very different depending on in what context you’re talking about especially when where it happens is closer to the user. Low cost good practices are the best ones. I’ll talk about the culture aspect in a section below.

Synthetic

Again, it’s engineering with a limited amount of resource including time, and we don’t have time to gather that much samples and there’s never a lack of scenarios or code paths you need to measure. And eliminating what you can and containing your measurement early as possible will help you go extra miles.

You always have to remember that you’re doing a synthetic measurement with certain resource constraints, and that’s the dilemma you need to think about. You would say that you do consider it — and I heard that a lot — but in reality, you fall into the illusion again and act upon that illusion. For example, allocate so little resource to it believing that it should do, or think some small factor would not affect the result, or pick a metric available that’s easier to obtain that might not represent what you want. (I guess that’s why you need a performance engineer to keep giving a nudge that it’s not the case.)

Memory spikes, leaks and also increased consumption, crashes, network or storage I/O contention, GC, locks, execution order, throttling, affinity, blocking operations, load, anti-patterns, and just functional bugs, etc. There are tons of causes and factors to synthesize that can make things slower or make it feel slow, which means that it’s also the data point that you can use to make your software fast. A reason why we also try to measure, monitor and analyze these as necessary. Or control and contain them. A good strategy for gathering and storing these data would be much needed. You would know the drill(hint: big data).

Newsflash, your synthetic world has a gap.

The purpose of synthetic measurements is to estimate expected performance and prevent bad performance on real usages — finding out if the expectation matches the results in the wild. There’s always a gap between your machine or your synthetic environment or your lab’s controlled sandbox, and some random user.

But that doesn’t mean you always have to mimic the real world because normally even just approximating a real world could be too costly if the model is too complex and even if you do, the complexity of mimicking will make things hard to isolate and find the right cause. Effort should be the other way around if that’s the case, meaning you synthesize the experiment to be more isolated, designed to make it is easier to find the cause. Anything that’s not actionable is not worth synthesizing. As if trying to reproduce a bug by synthesizing the environment to be as close as possible when it happens.

There’s still a lot of value in stress/fuzz/leak/load/soak/peak and more aspects that require simulating real usages, you also need to make sure your model is not too loose and you gather the right data that you need to diagnose because being able to reproduce the issue is an important factor as well as having the right code coverage prioritized. You might be able to synthesize the load/capacity, but thinking that you capture and replay a certain point of time, will not even be close to the dynamics of patterns that show in the real world over time. You might be able to find a performance issue after running a server product for a couple days, and without the data collected, you would need to take a couple of more days to reproduce it trying to find the cause.

You create tools to synthesize scenarios and measure, all the time. There’s no “fits all” tool that can do what you need. Luckily if there is something, most cases you need to customize to fit your need, and automate to scale. Anyways, you cannot synthesize every possible possibility, so reasonable modeling, prioritizing would be the fair way of mitigating it.

User Data

The best way to determine what to synthesize, of course, is to gather real usage data. You would have set some expectation while planning for the software, but that usually has to be refined when you actually release/deploy. Analytics, RUM, production performance, telemetry, community feedbacks and bug reports, whatever you call it. Whatever you do during your development is for the user to use it. A lot of times, you get too focused in the product itself and you forget what the product is trying to achieve, and performance is also something in that category.

User Data.

Looking at the real user’s performance data can yield really surprising results and insights. Just the fact that the user is experiencing lower performance itself is a motivation to make your software more faster. You would have thought that your synthetic tests showing great results is what the user is experiencing similarly. How real users tend to use your software(or service) will usually be the key to changing performance inefficiencies.

Even with the real world data, defining what to measure and the discrepancies of what you actually can get is another hard problem. Setting the privacy issues of PII aside, you can’t snoop into the user’s machine and pick what you want. There’s plenty of information that can help, for example, what’s installed on the machine(real time malware or virus scanners, network firewalls, browser plugins, or even an actual malware/adware that throttles things), the machine configurations(os and hardware), your location(and network routes/speed), and so on. It’s just constant analysis of what you have and refining what to collect over time without intruding.

You can also fall into the problem of not having enough data, like any other data. When you analyze and start segmenting it, it’s always “not enough” and you just have to infer as much as possible with the given data. Let’s say you got impressions you think is enough from a web site. Divide it into network speed and then OS groups and then divide that into browser groups and then into browser version groups. How many impressions would be left for each group? Would you just stick to the aggregated impression and analyze without dividing? But, at the least, you get a chance to match your synthetic data with the real user data that you have.

There are certain frameworks that are being refined and applied for understanding the users — the humans — that you can use to aim your performance goals. The eye detects anomalies in 10~20ms grains, less than a 100ms feedback for an action is said to be ideal, a person’s blink is 300~400ms and eye movement is much faster than that, while browsing people starts losing interest after a second and 10 seconds is beyond an attention span and they’re probably gone. And depending on how granular you want to think of perf, you can apply something like the Fitts’ law, or 16.6 ms per frame. Or for more advanced topics, you might need to find a UX expert.

These frameworks can be different depending on circumstances, also since we’re not dealing with one individual that fits a framework, statistically there’s going to be a distribution, so A/B testing would also be a good option than to speculate or to fall in the trap of relying on some law that might not apply. If possible that is, native apps are not easy to achieve this, but rather can turn on/off feature flags on multiple pre-releases to loosely experiment. And I’ve seen false speculations proven otherwise a lot for performance.

Solving

As long as you’ve found/diagnosed or already have something actionable, a bug fix, tuning and optimization, an architectural change, a new functionality and improvement, applying best practices and patterns, there should be plenty of work to do, solving the problem. There’s no single scope of where the problem can root from or a single way to solve a broad performance problem/expectation, from a bad user experience down to the kernel interrupts or byte code/binary to the hardware design and input. This part is a bit too broad so I’ll just leave it at that.

Culture

Making things fast also means trying to hold the fort or having a performance budget — preventing regressions from creeping in but allowing some room to breath — which in turn can help shave some of the thousand cuts. There’s less advantage in making something faster after it got slower. It’s like a public company, if you miss your earnings expectation, your stock price goes down. That is, “trying” to make sure that fixing/changing something doesn't degrade the performance than expected. But there’s always some compromise that you have to make. Adopting a new technology, adding a new feature, enriching the capabilities, side effects of a new dependency, and more.

Then you make a choice, and the priority should be balanced between “fast” and “faster”. Should new important features be blocked because of performance regressions? How early should you detect performance problems?

© Gapingvoid LLC

If I would want one thing for you to take away from this write up beyond distinguishing fast and faster, it would be the cultural aspect, which has been the hardest challenge but most fruitful that pays off. The performance culture, that’s what a performance engineer would work towards. It’s not something like arguing about “early optimization is bad”. Actually, the “how early” question would be the wrong question. Performance shouldn’t be only something applied in some stage, optimization is just one way of making things fast.

Working on anecdotes is different from creating something systematic for performance. Without a culture of valuing performance as one of the pillars of quality, the work can be just unnecessarily burdening. Someone will introduce bugs regardless, of course not intentionally. Controlling quality is also about anticipating and acting upon these bugs, and it’s all the same for performance.

Sharing and maintaining best practices and guidelines and education for performance, adding tools that can aid measuring your code easier while developing, preventing check-ins with potential regressions as much as possible with the given resource and without compensating too much productivity, continuous performance tests in continuous integrations, performance considerations and criteria and modeling for early designs, defining metrics and instrumentation dev work as high priority work items, reasonable performance SLAs/contracts across the development team, measuring than to speculate and not treating performance as a function, (and I would also slip in valuing milliseconds) and more.

You’ll probably be able to point out that there’s a lot missing, especially the processes/tools/vocabulary and so called rules found in performance engineering textbooks that’s certain to a specific area. Also emphasis on parts are going to be different depending on what kind of software you’re talking about. I guess having so many things to think about is the reason why a lot of software is so slow or feel slow.

I tried to focus on things that happened over and over among different types of my limited experience, so these aspects of what making software fast means, is likely subjective. Yet still, I hope whomever read this would be on board of making a performance culture, trying to understand what making software fast means, and think twice not to treat performance as some one-off task that you check off in a plan or a sign-off. I’m sure even coming up with a minimal viable product will be hugely different performance-wise, among those who have the performance culture in mind and those who don’t.

An attempt on summarizing my thoughts on this fun area called performance, I guess I’ve ended up mixing in too much, some might be too obvious…and I know I’m not much of a writer, might not be that convincing, but fwiw.

--

--

bkchung

Principal MTS at Tableau Software, Data Modeling & Calculations