Branch Prediction — Everything you need to know.

Harshal Parekh
Jun 23, 2019 · 9 min read

In computer science, predication is an architectural feature that provides an alternative to conditional transfer of control, implemented by machine instructions such as conditional branch, conditional call, conditional return, and branch tables. Predication works by executing instructions from both paths of the branch and only permitting those instructions from the taken path to modify architectural state. [1]

To better understand the concept, let’s take an example:
“Processing a sorted array is faster than processing an unsorted array”

The code above takes ~12 seconds to run. But on commenting line 15, not touching the rest, the same code takes ~33 seconds to run.
(running time may wary on different machines, but the proportion will stay the same).

This is very peculiar yet intriguing, why would processing a unsorted array take almost 3 times as the time taken to process a sorted array?

Let’s travel back to 1700s to consider a real-life scenario, and you are the operator of this junction and you hear a train coming. You don’t know where the train is headed, so you stop the train, ask the driver and switch tracks.
But stopping and restarting a train takes a lot of time.

What if you were able to predict where the train wants to go and switch beforehand? If you’re right, train continues; if not, driver backs up and you switch.

If you guessed right all the time, the train would not have to stop.

This is what happens at the processor level when it encounters an “if” statement — Modern processors are complicated and have long pipelines. So they take forever to “warm up” and “slow down”.

In essence, this is branch prediction. If the processor guessed it right, it continues executing or else backs up and restarts. Now looking back at the program above, the culprit is the “if” statement:

A quick visualization for the sorted array:

This is easy for the processor to predict, with just one switch

And now, a visualization for the unsorted array:

This is difficult for the processor to predict, resulting in 3x more time to run

Now that you probably understood the problem, how would you strategically guess to minimize the number of times that the processor must back up and go down the other path? You try to identify a pattern and follow it. This is more or less how branch predictors work. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless. [2]

Is “(a*b != 0) is faster than (a != 0 && b != 0) in Java”?

The above code is another example of branch prediction. Running the commented code takes longer than the current code. This is for a simple reason that the commented code will compile to 2 memory loads and two conditional branches as compared to a multiply and one conditional branch.

The multiply is likely to be faster than the second conditional branch if the hardware-level branch prediction is ineffective. This raises the question if this is at compiler level or is it at the hardware level? The latter, I believe.

Does ordering of if…else if… else affect probability?

Shall conditions be ordered this way? Or the other way round?

It seems obvious that the sorted version would be faster, however for readability or the existence of side-effects, we might want to order them non-optimally. It’s also hard to tell how well the CPU will do with branch prediction until you actually run the code. So, how to order the if… else if statements?

As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt’s work.

After that, the branch goes into a branch prediction cache, and past behavior is used to inform future branch prediction.

So in a tight loop, the effect of misordering is going to be relatively small. The branch predictor is going to learn which set of branches is most likely, and if you have non-trivial amount of work in the loop the small differences won’t add up much.

In general code, most compilers by default (lacking another reason) will order the produced machine code roughly the way you ordered it in your code. Thus if statements are forward branches when they fail.

So you should order your branches in the order of decreasing likelihood to get the best branch prediction from a “first encounter”.

This raises a question, is “if” expensive?

At the hardware (lowest) level, yes, “if” statements are expensive. In order to understand why, you have to understand how pipelines work.

For most RISC architectures, instructions are all a constant length, so the Program Counter (PC) can be incremented by a constant amount. For most instructions, the PC of the next instruction is just the current PC plus the length of the current instruction.

For branch instructions, however, the next instruction to be executed is not the next location after the current instruction. Branches are gotos — they tell the processor where the next instruction is. Branches can either be conditional or unconditional, and the target location can be either fixed or computed.

Conditional vs. unconditional is easy to understand — a conditional branch is only taken if a certain condition holds (such as whether one number equals another); if the branch is not taken, control proceeds to the next instruction after the branch like normal.

The branch target is another important issue. Most branches have a fixed branch target — they go to a specific location in code that is fixed at compile time. This includes if statements, loops of all sorts, regular function calls, and many more. Computed branches compute the target of the branch at runtime. This includes switch statements (sometimes), returning from a function, virtual function calls, and function pointer calls.

When the processor sees a branch instruction appear in its pipeline, it needs to figure out how to continue to fill up its pipeline. In order to figure out what instructions come after the branch in the program stream.

Thus, the reason why if statements are expensive is due to branch mispredictions.

Is & faster than &&?

To put this clearly, are bitwise operators faster than logical operators?

&& lazily evaluates the right-hand-side expression (i.e. only if the LHS is true), which implies a conditional, whereas in Java & in this context guarantees strict evaluation of both (boolean) sub-expressions. The value result is the same either way.

when value >= x and value <= y are as likely true as false with no particular pattern, would using the & operator be faster than using &&?

Let’s look at the bytecode:

public withANDAND(III)Z
L0
LINENUMBER 8 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ILOAD 2
ILOAD 3
IF_ICMPGT L1
L2
LINENUMBER 9 L2
ICONST_1
IRETURN
L1
LINENUMBER 11 L1
FRAME SAME
ICONST_0
IRETURN
L3
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L3 0
LOCALVARIABLE x I L0 L3 1
LOCALVARIABLE value I L0 L3 2
LOCALVARIABLE y I L0 L3 3
MAXSTACK = 2
MAXLOCALS = 4
// access flags 0x1
public withAND(III)Z
L0
LINENUMBER 15 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ICONST_1
GOTO L2
L1
FRAME SAME
ICONST_0
L2
FRAME SAME1 I
ILOAD 2
ILOAD 3
IF_ICMPGT L3
ICONST_1
GOTO L4
L3
FRAME SAME1 I
ICONST_0
L4
FRAME FULL [test/lsoto/AndTest I I I] [I I]
IAND
IFEQ L5
L6
LINENUMBER 16 L6
ICONST_1
IRETURN
L5
LINENUMBER 18 L5
FRAME SAME
ICONST_0
IRETURN
L7
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L7 0
LOCALVARIABLE x I L0 L7 1
LOCALVARIABLE value I L0 L7 2
LOCALVARIABLE y I L0 L7 3
MAXSTACK = 3
MAXLOCALS = 4

As seen, withANDAND() generates two conditional jumps however, withAND() generates three conditional jumps.

Though I’m not that very much experienced with Java bytecode and I may have overlooked something, it seems to me that & will actually perform worse than && in every case: it generates more instructions to execute, including more conditional jumps to predict and possibly fail at. [3]

What’s sets me thinking is the Google Guava’s Power of Two method:

Is this use of & (where && would be more normal) a real optimization?

Intel ASM code for the original method:

# {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest'
# this: rdx:rdx = 'AndTest'
# parm0: r8:r8 = long
...
0x0000000003103bbe: movabs rax,0x0
0x0000000003103bc8: cmp rax,r8
0x0000000003103bcb: movabs rax,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103bd5: movabs rsi,0x108
0x0000000003103bdf: jge 0x0000000003103bef
0x0000000003103be5: movabs rsi,0x118
0x0000000003103bef: mov rdi,QWORD PTR [rax+rsi*1]
0x0000000003103bf3: lea rdi,[rdi+0x1]
0x0000000003103bf7: mov QWORD PTR [rax+rsi*1],rdi
0x0000000003103bfb: jge 0x0000000003103c1b ;*lcmp
0x0000000003103c01: movabs rax,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c0b: inc DWORD PTR [rax+0x128]
0x0000000003103c11: mov eax,0x1
0x0000000003103c16: jmp 0x0000000003103c20 ;*goto
0x0000000003103c1b: mov eax,0x0 ;*lload_1
0x0000000003103c20: mov rsi,r8
0x0000000003103c23: movabs r10,0x1
0x0000000003103c2d: sub rsi,r10
0x0000000003103c30: and rsi,r8
0x0000000003103c33: movabs rdi,0x0
0x0000000003103c3d: cmp rsi,rdi
0x0000000003103c40: movabs rsi,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c4a: movabs rdi,0x140
0x0000000003103c54: jne 0x0000000003103c64
0x0000000003103c5a: movabs rdi,0x150
0x0000000003103c64: mov rbx,QWORD PTR [rsi+rdi*1]
0x0000000003103c68: lea rbx,[rbx+0x1]
0x0000000003103c6c: mov QWORD PTR [rsi+rdi*1],rbx
0x0000000003103c70: jne 0x0000000003103c90 ;*lcmp
0x0000000003103c76: movabs rsi,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c80: inc DWORD PTR [rsi+0x160]
0x0000000003103c86: mov esi,0x1
0x0000000003103c8b: jmp 0x0000000003103c95 ;*goto
0x0000000003103c90: mov esi,0x0 ;*iand
0x0000000003103c95: and rsi,rax
0x0000000003103c98: and esi,0x1
0x0000000003103c9b: mov rax,rsi
0x0000000003103c9e: add rsp,0x50
0x0000000003103ca2: pop rbp
0x0000000003103ca3: test DWORD PTR [rip+0xfffffffffe44c457],eax # 0x0000000001550100
0x0000000003103ca9: ret

Intel ASM code for the modified method:

# {method} {0x0000000017580bd0} 'isPowerOfTwoANDAND' '(J)Z' in 'AndTest'
# this: rdx:rdx = 'AndTest'
# parm0: r8:r8 = long
...
0x0000000003103438: movabs rax,0x0
0x0000000003103442: cmp rax,r8
0x0000000003103445: jge 0x0000000003103471 ;*lcmp
0x000000000310344b: mov rax,r8
0x000000000310344e: movabs r10,0x1
0x0000000003103458: sub rax,r10
0x000000000310345b: and rax,r8
0x000000000310345e: movabs rsi,0x0
0x0000000003103468: cmp rax,rsi
0x000000000310346b: je 0x000000000310347b ;*lcmp
0x0000000003103471: mov eax,0x0
0x0000000003103476: jmp 0x0000000003103480 ;*ireturn
0x000000000310347b: mov eax,0x1 ;*goto
0x0000000003103480: and eax,0x1
0x0000000003103483: add rsp,0x40
0x0000000003103487: pop rbp
0x0000000003103488: test DWORD PTR [rip+0xfffffffffe44cc72],eax # 0x0000000001550100
0x000000000310348e: ret

The JIT compiler generates far less assembly code for the && version than for Guava's & version. So everything points to Guava’s & method being less efficient than the more “natural” && version.

Or is it?

On compiling the same methods with Java 7, the assembly code generated for the & method by the JIT compiler, has only one conditional jump now, and is way shorter!

So, is this use of & (where && would be more normal) a real optimization?

The answer is the same, even for this (very!) specific scenario: it depends on your JVM implementation, your compiler, your CPU and your input data.

Now, let’s go back to the first piece of code where this all started and look at the solution on how to fix it i.e. make it “branchless”.

The “if” statement can be replace with some bitwise operators.

Comparing the results (AWS t2.micro, Java 8):

//  Branch - Random
seconds = 10.93293813
// Branch - Sorted
seconds = 5.643797077
// Branchless - Random
seconds = 3.113581453
// Branchless - Sorted
seconds = 3.186068823

If you give the Intel compiler the branchless code, it just out-right vectorizes and is just as fast as with the branch (with the loop interchange).

To conclude, what is branch aware programming?

Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?

Highly unlikely, at least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect.

What should I keep in mind to produce branch-friendly code in a high level programming language?

Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you’ve measured and found a problem.

I’m not a micro-optimization wizard. I don’t know exactly how the hardware branch predictor works.

  1. Profiling:
    That’s the very first thing you should be looking into if you don’t know how to do this (profiling). Most of these micro-level hotspots are best discovered in hindsight with a profiler in hand.
  2. Branch Elimination:
    There is an easier, high-level way to mitigate branch mis-prediction, and that’s to avoid branching completely. Read more.
  3. Zero Cost Exception Handling:
    In the case that an exception is thrown, they’re now slower than ever before. Yet in the case where an exception isn’t thrown, they’re now faster than ever before and often faster in successful scenarios than code like:

There are many more ways to avoid failed branch predictions, that’s for you to explore!

I hope you enjoyed the article! Find more in my profile. Visit my portfolio here: https://harshal.one.

The Startup

Get smarter at building your thing. Join The Startup’s +787K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Harshal Parekh

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Harshal Parekh

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store