Consider this code (description bellow):
- We initialize an array of random data
- In the first benchmark, we sum its higher than average numbers
- We sort the array
- We run the same benchmark again.
Sounds simple right ?
However, the results can be a little bit unexpected:
- 0.03 sec for the first one
- 0.01 sec for the second one
What’s going on here ?
If you can read a title, you probably know what’s going on:
Steve Jobs once tried to explain what is branch prediction:
“I don’t know what it does. It predicts branches”
“It’s a good thing”
Steve jobs, 2003
Thank you Steve !
I’m not sure if this was enough, so I’ll try to explain a bit more:
When a CPU execute instructions, they are fed into a pipeline.
Consider this graphic:
Here, each square represent one instruction.
Even tho this theorical CPU can execute one instruction per cycle, the execution of one instruction is spread over 4 cycles.
Here are some exemple of real world pipelines:
|Architecture| Pipeline Depth |
| Pentium 4 | Up to 31 |
| PPC G4 | 7 |
| Intel Core | 14 |
| ARM | 3 - 13 |
Why have a pipeline at all ?
Splitting the work into smaller chunks makes it easier to increase the frequency (since each unit needs to process less things).
So… why not making it even longer ?
Well, there’s a practical limit on how much you can split an instruction… And there’s bubbles !
What if the next instruction needs the result from the current one ?
Well then this happens:
We need to wait for the previous instruction to produce the result, creating “bubble” within the pipeline, where the stages can do nothing useful.
The longer the pipeline, the worse the effect, since the bubble will span more and more stages.
If you thought that was bad, there’s worste to come:
Some of the instructions are called “conditional jumps” (think if … else ).
What happen when the CPU encounter such instruction ? The next instruction is not waiting for the result of the previous one: the CPU litteraly does’nt know which instruction is next.
So instead of waiting for the result to resume the pipeline again, the CPUs trys to predict it.
If it made the right choice, it can keep on going, if not… it discards everything, and loads the correct branch. At best it gained several cycles, at worse… it looses nothing.
There are various kind of predictions algorithms:
- Flip a coin (usually do not take the jump): 50% accuracy
- Lookup past iterations results (usually with 1–2 bits of memory)
- Replay a pattern
Modern branch predictors are estimated at over 95% accuracy !
So… What about our benchmark test ?
Since our array contains random data, it’s easy to see that no matter how good that branch prediction is, it can predict only with 50% accuracy.
But once the array is sorted, it becomes possible to “predict” the next branching result with >99% accuracy.
And here we can explain our performance difference