# Branch Prediction — Everything you need to know.

Jun 23, 2019 · 9 min read

In computer science, predication is an architectural feature that provides an alternative to conditional transfer of control, implemented by machine instructions such as conditional branch, conditional call, conditional return, and branch tables. Predication works by executing instructions from both paths of the branch and only permitting those instructions from the taken path to modify architectural state. [1]

To better understand the concept, let’s take an example:
“Processing a sorted array is faster than processing an unsorted array”

The code above takes ~12 seconds to run. But on commenting line 15, not touching the rest, the same code takes ~33 seconds to run.
(running time may wary on different machines, but the proportion will stay the same).

This is very peculiar yet intriguing, why would processing a unsorted array take almost 3 times as the time taken to process a sorted array?

Let’s travel back to 1700s to consider a real-life scenario, and you are the operator of this junction and you hear a train coming. You don’t know where the train is headed, so you stop the train, ask the driver and switch tracks.
But stopping and restarting a train takes a lot of time.

What if you were able to predict where the train wants to go and switch beforehand? If you’re right, train continues; if not, driver backs up and you switch.

If you guessed right all the time, the train would not have to stop.

This is what happens at the processor level when it encounters an “if” statement — Modern processors are complicated and have long pipelines. So they take forever to “warm up” and “slow down”.

In essence, this is branch prediction. If the processor guessed it right, it continues executing or else backs up and restarts. Now looking back at the program above, the culprit is the “if” statement:

A quick visualization for the sorted array:

And now, a visualization for the unsorted array:

Now that you probably understood the problem, how would you strategically guess to minimize the number of times that the processor must back up and go down the other path? You try to identify a pattern and follow it. This is more or less how branch predictors work. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless. [2]

# Is “(a*b != 0) is faster than (a != 0 && b != 0) in Java”?

The above code is another example of branch prediction. Running the commented code takes longer than the current code. This is for a simple reason that the commented code will compile to 2 memory loads and two conditional branches as compared to a multiply and one conditional branch.

The multiply is likely to be faster than the second conditional branch if the hardware-level branch prediction is ineffective. This raises the question if this is at compiler level or is it at the hardware level? The latter, I believe.

# Does ordering of if…else if… else affect probability?

It seems obvious that the sorted version would be faster, however for readability or the existence of side-effects, we might want to order them non-optimally. It’s also hard to tell how well the CPU will do with branch prediction until you actually run the code. So, how to order the if… else if statements?

As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt’s work.

After that, the branch goes into a branch prediction cache, and past behavior is used to inform future branch prediction.

So in a tight loop, the effect of misordering is going to be relatively small. The branch predictor is going to learn which set of branches is most likely, and if you have non-trivial amount of work in the loop the small differences won’t add up much.

In general code, most compilers by default (lacking another reason) will order the produced machine code roughly the way you ordered it in your code. Thus if statements are forward branches when they fail.

# This raises a question, is “if” expensive?

At the hardware (lowest) level, yes, “if” statements are expensive. In order to understand why, you have to understand how pipelines work.

For most RISC architectures, instructions are all a constant length, so the Program Counter (PC) can be incremented by a constant amount. For most instructions, the PC of the next instruction is just the current PC plus the length of the current instruction.

For branch instructions, however, the next instruction to be executed is not the next location after the current instruction. Branches are gotos — they tell the processor where the next instruction is. Branches can either be conditional or unconditional, and the target location can be either fixed or computed.

Conditional vs. unconditional is easy to understand — a conditional branch is only taken if a certain condition holds (such as whether one number equals another); if the branch is not taken, control proceeds to the next instruction after the branch like normal.

The branch target is another important issue. Most branches have a fixed branch target — they go to a specific location in code that is fixed at compile time. This includes `if` statements, loops of all sorts, regular function calls, and many more. Computed branches compute the target of the branch at runtime. This includes `switch` statements (sometimes), returning from a function, virtual function calls, and function pointer calls.

When the processor sees a branch instruction appear in its pipeline, it needs to figure out how to continue to fill up its pipeline. In order to figure out what instructions come after the branch in the program stream.

Thus, the reason why if statements are expensive is due to branch mispredictions.

# Is & faster than &&?

## To put this clearly, are bitwise operators faster than logical operators?

`&&` lazily evaluates the right-hand-side expression (i.e. only if the LHS is true), which implies a conditional, whereas in Java `&` in this context guarantees strict evaluation of both (boolean) sub-expressions. The value result is the same either way.

when `value >= x` and `value <= y` are as likely true as false with no particular pattern, would using the `&` operator be faster than using `&&`?

Let’s look at the bytecode:

`public withANDAND(III)Z   L0    LINENUMBER 8 L0    ILOAD 2    ILOAD 1    IF_ICMPLT L1    ILOAD 2    ILOAD 3    IF_ICMPGT L1   L2    LINENUMBER 9 L2    ICONST_1    IRETURN   L1    LINENUMBER 11 L1   FRAME SAME    ICONST_0    IRETURN   L3    LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L3 0    LOCALVARIABLE x I L0 L3 1    LOCALVARIABLE value I L0 L3 2    LOCALVARIABLE y I L0 L3 3    MAXSTACK = 2    MAXLOCALS = 4// access flags 0x1  public withAND(III)Z   L0    LINENUMBER 15 L0    ILOAD 2    ILOAD 1    IF_ICMPLT L1    ICONST_1    GOTO L2   L1   FRAME SAME    ICONST_0   L2   FRAME SAME1 I    ILOAD 2    ILOAD 3    IF_ICMPGT L3    ICONST_1    GOTO L4   L3   FRAME SAME1 I    ICONST_0   L4   FRAME FULL [test/lsoto/AndTest I I I] [I I]    IAND    IFEQ L5   L6    LINENUMBER 16 L6    ICONST_1    IRETURN   L5    LINENUMBER 18 L5   FRAME SAME    ICONST_0    IRETURN   L7    LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L7 0    LOCALVARIABLE x I L0 L7 1    LOCALVARIABLE value I L0 L7 2    LOCALVARIABLE y I L0 L7 3    MAXSTACK = 3    MAXLOCALS = 4`

As seen, withANDAND() generates two conditional jumps however, withAND() generates three conditional jumps.

Though I’m not that very much experienced with Java bytecode and I may have overlooked something, it seems to me that `&` will actually perform worse than `&&` in every case: it generates more instructions to execute, including more conditional jumps to predict and possibly fail at. [3]

What’s sets me thinking is the Google Guava’s Power of Two method:

Is this use of `&` (where `&&` would be more normal) a real optimization?

Intel ASM code for the original method:

`# {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest'  # this:     rdx:rdx   = 'AndTest'  # parm0:    r8:r8     = long  ...  0x0000000003103bbe: movabs rax,0x0  0x0000000003103bc8: cmp    rax,r8  0x0000000003103bcb: movabs rax,0x175811f0     ;   {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}  0x0000000003103bd5: movabs rsi,0x108  0x0000000003103bdf: jge    0x0000000003103bef  0x0000000003103be5: movabs rsi,0x118  0x0000000003103bef: mov    rdi,QWORD PTR [rax+rsi*1]  0x0000000003103bf3: lea    rdi,[rdi+0x1]  0x0000000003103bf7: mov    QWORD PTR [rax+rsi*1],rdi  0x0000000003103bfb: jge    0x0000000003103c1b  ;*lcmp  0x0000000003103c01: movabs rax,0x175811f0     ;   {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}  0x0000000003103c0b: inc    DWORD PTR [rax+0x128]  0x0000000003103c11: mov    eax,0x1  0x0000000003103c16: jmp    0x0000000003103c20  ;*goto  0x0000000003103c1b: mov    eax,0x0            ;*lload_1  0x0000000003103c20: mov    rsi,r8  0x0000000003103c23: movabs r10,0x1  0x0000000003103c2d: sub    rsi,r10  0x0000000003103c30: and    rsi,r8  0x0000000003103c33: movabs rdi,0x0  0x0000000003103c3d: cmp    rsi,rdi  0x0000000003103c40: movabs rsi,0x175811f0     ;   {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}  0x0000000003103c4a: movabs rdi,0x140  0x0000000003103c54: jne    0x0000000003103c64  0x0000000003103c5a: movabs rdi,0x150  0x0000000003103c64: mov    rbx,QWORD PTR [rsi+rdi*1]  0x0000000003103c68: lea    rbx,[rbx+0x1]  0x0000000003103c6c: mov    QWORD PTR [rsi+rdi*1],rbx  0x0000000003103c70: jne    0x0000000003103c90  ;*lcmp  0x0000000003103c76: movabs rsi,0x175811f0     ;   {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}  0x0000000003103c80: inc    DWORD PTR [rsi+0x160]  0x0000000003103c86: mov    esi,0x1  0x0000000003103c8b: jmp    0x0000000003103c95  ;*goto  0x0000000003103c90: mov    esi,0x0            ;*iand  0x0000000003103c95: and    rsi,rax  0x0000000003103c98: and    esi,0x1  0x0000000003103c9b: mov    rax,rsi  0x0000000003103c9e: add    rsp,0x50  0x0000000003103ca2: pop    rbp  0x0000000003103ca3: test   DWORD PTR [rip+0xfffffffffe44c457],eax        # 0x0000000001550100  0x0000000003103ca9: ret`

Intel ASM code for the modified method:

`# {method} {0x0000000017580bd0} 'isPowerOfTwoANDAND' '(J)Z' in 'AndTest'  # this:     rdx:rdx   = 'AndTest'  # parm0:    r8:r8     = long  ...  0x0000000003103438: movabs rax,0x0  0x0000000003103442: cmp    rax,r8  0x0000000003103445: jge    0x0000000003103471  ;*lcmp  0x000000000310344b: mov    rax,r8  0x000000000310344e: movabs r10,0x1  0x0000000003103458: sub    rax,r10  0x000000000310345b: and    rax,r8  0x000000000310345e: movabs rsi,0x0  0x0000000003103468: cmp    rax,rsi  0x000000000310346b: je     0x000000000310347b  ;*lcmp  0x0000000003103471: mov    eax,0x0  0x0000000003103476: jmp    0x0000000003103480  ;*ireturn  0x000000000310347b: mov    eax,0x1            ;*goto  0x0000000003103480: and    eax,0x1  0x0000000003103483: add    rsp,0x40  0x0000000003103487: pop    rbp  0x0000000003103488: test   DWORD PTR [rip+0xfffffffffe44cc72],eax        # 0x0000000001550100  0x000000000310348e: ret`

The JIT compiler generates far less assembly code for the `&&` version than for Guava's `&` version. So everything points to Guava’s `&` method being less efficient than the more “natural” `&&` version.

Or is it?

On compiling the same methods with Java 7, the assembly code generated for the `&` method by the JIT compiler, has only one conditional jump now, and is way shorter!

So, is this use of `&` (where `&&` would be more normal) a real optimization?

Now, let’s go back to the first piece of code where this all started and look at the solution on how to fix it i.e. make it “branchless”.

The “if” statement can be replace with some bitwise operators.

Comparing the results (AWS t2.micro, Java 8):

`//  Branch - Randomseconds = 10.93293813//  Branch - Sortedseconds = 5.643797077//  Branchless - Randomseconds = 3.113581453//  Branchless - Sortedseconds = 3.186068823`

If you give the Intel compiler the branchless code, it just out-right vectorizes and is just as fast as with the branch (with the loop interchange).

# To conclude, what is branch aware programming?

Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?

Highly unlikely, at least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect.

What should I keep in mind to produce branch-friendly code in a high level programming language?

Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you’ve measured and found a problem.

I’m not a micro-optimization wizard. I don’t know exactly how the hardware branch predictor works.

1. Profiling:
That’s the very first thing you should be looking into if you don’t know how to do this (profiling). Most of these micro-level hotspots are best discovered in hindsight with a profiler in hand.
2. Branch Elimination:
There is an easier, high-level way to mitigate branch mis-prediction, and that’s to avoid branching completely. Read more.
3. Zero Cost Exception Handling:
In the case that an exception is thrown, they’re now slower than ever before. Yet in the case where an exception isn’t thrown, they’re now faster than ever before and often faster in successful scenarios than code like:

There are many more ways to avoid failed branch predictions, that’s for you to explore!

I hope you enjoyed the article! Find more in my profile. Visit my portfolio here: https://harshal.one.

## The Startup

Get smarter at building your thing. Join The Startup’s +787K followers.

### By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

Medium sent you an email at to complete your subscription.

Written by

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Written by

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

## The Graph’s Flight Path to Mainnet

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app