Branch Prediction — Everything you need to know.

Harshal Parekh
Jun 23 · 9 min read

In computer science, predication is an architectural feature that provides an alternative to conditional transfer of control, implemented by machine instructions such as conditional branch, conditional call, conditional return, and branch tables. Predication works by executing instructions from both paths of the branch and only permitting those instructions from the taken path to modify architectural state. [1]

To better understand the concept, let’s take an example:
“Processing a sorted array is faster than processing an unsorted array”

The code above takes ~12 seconds to run. But on commenting line 15, not touching the rest, the same code takes ~33 seconds to run.
(running time may wary on different machines, but the proportion will stay the same).

This is very peculiar yet intriguing, why would processing a unsorted array take almost 3 times as the time taken to process a sorted array?

Let’s travel back to 1700s to consider a real-life scenario, and you are the operator of this junction and you hear a train coming. You don’t know where the train is headed, so you stop the train, ask the driver and switch tracks.
But stopping and restarting a train takes a lot of time.

What if you were able to predict where the train wants to go and switch beforehand? If you’re right, train continues; if not, driver backs up and you switch.

If you guessed right all the time, the train would not have to stop.

This is what happens at the processor level when it encounters an “if” statement — Modern processors are complicated and have long pipelines. So they take forever to “warm up” and “slow down”.

In essence, this is branch prediction. If the processor guessed it right, it continues executing or else backs up and restarts. Now looking back at the program above, the culprit is the “if” statement:

A quick visualization for the sorted array:

And now, a visualization for the unsorted array:

Now that you probably understood the problem, how would you strategically guess to minimize the number of times that the processor must back up and go down the other path? You try to identify a pattern and follow it. This is more or less how branch predictors work. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless. [2]

Is “(a*b != 0) is faster than (a != 0 && b != 0) in Java”?

The above code is another example of branch prediction. Running the commented code takes longer than the current code. This is for a simple reason that the commented code will compile to 2 memory loads and two conditional branches as compared to a multiply and one conditional branch.

The multiply is likely to be faster than the second conditional branch if the hardware-level branch prediction is ineffective. This raises the question if this is at compiler level or is it at the hardware level? The latter, I believe.

Does ordering of if…else if… else affect probability?

It seems obvious that the sorted version would be faster, however for readability or the existence of side-effects, we might want to order them non-optimally. It’s also hard to tell how well the CPU will do with branch prediction until you actually run the code. So, how to order the if… else if statements?

As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt’s work.

After that, the branch goes into a branch prediction cache, and past behavior is used to inform future branch prediction.

So in a tight loop, the effect of misordering is going to be relatively small. The branch predictor is going to learn which set of branches is most likely, and if you have non-trivial amount of work in the loop the small differences won’t add up much.

In general code, most compilers by default (lacking another reason) will order the produced machine code roughly the way you ordered it in your code. Thus if statements are forward branches when they fail.

So you should order your branches in the order of decreasing likelihood to get the best branch prediction from a “first encounter”.

This raises a question, is “if” expensive?

At the hardware (lowest) level, yes, “if” statements are expensive. In order to understand why, you have to understand how pipelines work.

For most RISC architectures, instructions are all a constant length, so the Program Counter (PC) can be incremented by a constant amount. For most instructions, the PC of the next instruction is just the current PC plus the length of the current instruction.

For branch instructions, however, the next instruction to be executed is not the next location after the current instruction. Branches are gotos — they tell the processor where the next instruction is. Branches can either be conditional or unconditional, and the target location can be either fixed or computed.

Conditional vs. unconditional is easy to understand — a conditional branch is only taken if a certain condition holds (such as whether one number equals another); if the branch is not taken, control proceeds to the next instruction after the branch like normal.

The branch target is another important issue. Most branches have a fixed branch target — they go to a specific location in code that is fixed at compile time. This includes if statements, loops of all sorts, regular function calls, and many more. Computed branches compute the target of the branch at runtime. This includes switch statements (sometimes), returning from a function, virtual function calls, and function pointer calls.

When the processor sees a branch instruction appear in its pipeline, it needs to figure out how to continue to fill up its pipeline. In order to figure out what instructions come after the branch in the program stream.

Thus, the reason why if statements are expensive is due to branch mispredictions.

Is & faster than &&?

To put this clearly, are bitwise operators faster than logical operators?

&& lazily evaluates the right-hand-side expression (i.e. only if the LHS is true), which implies a conditional, whereas in Java & in this context guarantees strict evaluation of both (boolean) sub-expressions. The value result is the same either way.

when value >= x and value <= y are as likely true as false with no particular pattern, would using the & operator be faster than using &&?

Let’s look at the bytecode:

As seen, withANDAND() generates two conditional jumps however, withAND() generates three conditional jumps.

Though I’m not that very much experienced with Java bytecode and I may have overlooked something, it seems to me that & will actually perform worse than && in every case: it generates more instructions to execute, including more conditional jumps to predict and possibly fail at. [3]

What’s sets me thinking is the Google Guava’s Power of Two method:

Is this use of & (where && would be more normal) a real optimization?

Intel ASM code for the original method:

Intel ASM code for the modified method:

The JIT compiler generates far less assembly code for the && version than for Guava's & version. So everything points to Guava’s & method being less efficient than the more “natural” && version.

Or is it?

On compiling the same methods with Java 7, the assembly code generated for the & method by the JIT compiler, has only one conditional jump now, and is way shorter!

So, is this use of & (where && would be more normal) a real optimization?

The answer is the same, even for this (very!) specific scenario: it depends on your JVM implementation, your compiler, your CPU and your input data.

Now, let’s go back to the first piece of code where this all started and look at the solution on how to fix it i.e. make it “branchless”.

The “if” statement can be replace with some bitwise operators.

Comparing the results (AWS t2.micro, Java 8):

If you give the Intel compiler the branchless code, it just out-right vectorizes and is just as fast as with the branch (with the loop interchange).

To conclude, what is branch aware programming?

Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?

Highly unlikely, at least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect.

What should I keep in mind to produce branch-friendly code in a high level programming language?

Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you’ve measured and found a problem.

I’m not a micro-optimization wizard. I don’t know exactly how the hardware branch predictor works.

  1. Profiling:
    That’s the very first thing you should be looking into if you don’t know how to do this (profiling). Most of these micro-level hotspots are best discovered in hindsight with a profiler in hand.
  2. Branch Elimination:
    There is an easier, high-level way to mitigate branch mis-prediction, and that’s to avoid branching completely. Read more.
  3. Zero Cost Exception Handling:
    In the case that an exception is thrown, they’re now slower than ever before. Yet in the case where an exception isn’t thrown, they’re now faster than ever before and often faster in successful scenarios than code like:

There are many more ways to avoid failed branch predictions, that’s for you to explore!

I hope you enjoyed the article! Find more in my profile. Visit my portfolio here:

The Startup

Medium's largest active publication, followed by +479K people. Follow to join our community.

Harshal Parekh

Written by

The Startup

Medium's largest active publication, followed by +479K people. Follow to join our community.