Processors and optimization 101

Peter Hrvola
5 min readMar 25, 2019

--

I decided to write this post, because I enjoyed our last lecture on Computer System Performance given by Pinar Tözün at ITU. We covered many interesting topic on how CPUs can affect performance and how to optimize applications for performance. I want to summarize what I’ve learned and describe what effects can CPU have on program execution.

I start by showing simple Subscalar CPU. Then I continue showing how we can improve performance of Subscalar CPU by using Instruction pipelining. Then I cover essentials of branch prediction which is closely connected with Instruction pipelining. Next, I describe how in some cases CPUs are able to execute instructions out of order. Last but not least I cover controversial feature of Simultaneous multithreading.

Modern CPUs are using advanced techniques to improve performance by implicit parallelization and avoid waiting for cache or memory. These techniques can improve throughput dramatically, but in some cases they might hurt more than help.

Let’s start by examining Subscalar CPU. This type of CPU doesn’t offer any kind of parallelism. One instruction is executed at a time. Executing one instruction can take multiple CPU cycles. In the example below, we can see that executing three instruction takes 15 CPU cycles. For every instruction process has to do following steps: IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back.

Executing 3 instruction on sub-scalar CPU // CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=140179

Subscalar model is not very efficient and throughput of application could be improved by implicitly parallelizing some steps. CPU’s implicit parallelization guarantees to a programmer the same result as if the instructions were executed after each other and it’s performed by CPU without any programmer interference.

Instruction pipelining

To improve the Subscalar model we can use so called Instruction pipelining. As we can see in the image bellow, at any given time four different stages of instruction execution are done at the same time. This sounds nice in theory, but since we are prefetching instructions that don’t need to be in linear order, for example conditional jump (“jle” in assembler). CPUs have to predict which branch is going to be taken. This is called branch prediction.

Instruction pipelining example // By en:User:Cburnett — Own workThis W3C-unspecified vector image was created with Inkscape., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1499754

Branch prediction helps CPU to prefetch instructions that are more likely to be executed. Every time a new if statement is encountered we have a 50% chance of guessing which branch is going to be taken, right?

No, it turns out that programs are more likely to take the branch with lower address (code that was already executed) than the instruction with higher address. This simple technique can already give us 80% success rate for prediction. This type of prediction is called Static branch prediction and it is mostly successful for loops.

After we have witnessed condition multiple times we can make better guesses based on the previous results. There is a lot of predictors that can utilize history, they all offer different accuracy / complexity. For regular applications good prediction would be about 96%. But this depends on the nature of application, binary size, size of processed data, I/O devices.

Out of order execution

Pipelining instructions seems very nice but we can do even better. What if we could recognize groups of instructions that do not depend on each other. That would mean we can execute more instructions at the same time and still make a guarantee of linear execution to programmers.

Example of superscalar, 2 way CPU // By Amit6, original version (File:Superscalarpipeline.png) by User:Poil — Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9748747

Let say we have two instructions to be executed:
IR1 = addr1 / addr2
IR2 = addr3 * addr4
These two instructions don’t share any addresses nor their results depend on each other. This allows CPU to perform Out of order execution (OoO). CPUs can perform more than two instructions in parallel, usually four, this is called 4-way superscalar CPU.

All this seems nice, but how does it work in practice? Memory and caches are slow, at least compared to CPU speed. This means that if data are not available in registry we will have to pay huge waiting penalty and CPU will have to wait just to fetch data.

In the image below we can see a more real-world scenario on 2-way superscalar CPU where one instruction needs to load data from L1 cache. In this case, the CPU is waiting for loading data.

Example of CPU accessing L1 cache

This example includes only very fast access to L1 cache for simplicity but for example, accessing data from disk could take around 345 000 cycles. At that point, the operating system might decide to switch context that is executed and perform some other tasks while the data are being loaded.

Simultaneous multithreading

There is one more performance optimization that is included in many CPUs and it could play a significant role in performance. It’s called Simultaneous multithreading (SMT). However, its use is controversial as it could hurt performance more than it helps.

SMT allows a core to execute multiple threads at the same time, the operating system is not aware of executing multiple threads and it doesn’t have to do an explicit context switch. SMT is implemented by adding more registries to the core but sharing all caches. After each instruction execution threads are swapped. The goal of SMT is to improve efficiency by decreasing memory latency. Most common are CPUs that allow executing two threads in parallel, but in some cases, it could be up to eight.

Some researches show that threads can be used to proactively seed caches for the main thread. However, if executed binary cannot fit into L1 cache adding another binary executed at the same tame will result in Cache trashing and decrease performance by adding more cache misses and fetching instruction/data from main memory. For some specific usage it could be better to turn off SMT support, however, this largely depends on specific usage and requires more testing.

Modern CPU perform a lot of complicated optimizations behind the curtain. These optimization are designed for general use and for specific needs they might hurt more than help. If in doubt, it’s always a good idea to perform testing on some real usage profiles and see if we can witness Cache trashing, High misprediction rates. I hope you enjoyed summary of what I’ve learned as much as I did.

--

--