WTF is a SIMD, SMT, SIMT

Cody
4 min readOct 17, 2016

--

These are terms that are tossed around a lot in High-Performance Programming. I’m going to attempt to demystify these terms. By the end of this blog post hopefully you will have a good mental model of how these terms work.I’m going to glaze over a lot of details of modern CPU’s. To fully load/stores, caching, and OoO would take ages.

To start. What is a CPU?

A very simple processor

Basically you have registers (active data). Instructions are read from RAM, and execute them in order. Data from RAM is loaded/stored as instructions say. This is a very simple model. There is no tricks or magic. Load data, load data, process data, store result. Okay.

What developed next was called OoO (Out of Order Execution). Beside looking like the coolest acronym ever OoO simply means the processor will decide what order to execute your instructions in (not in your order, or out of order). The goal being as a processor became more complex it may in some circumstances run 1, or even 2 instructions at the same time.

A simple model for this is the Intel Pentium:

FPU = Floating Point Unit, ALU = Integer Processor Unit

Pentium introduced a second ALU, and an OoO unit. The goal being under proper circumstances the Chip could do 2 Integer operations in the same clock cycle.

This didn’t work. The first ALU was saturated, while the second ALU was only used <1% of the time. There simply wasn’t enough opportunities to do multiple things in 1 clock cycle. So Intel had to come up with a solution. Enter the Pentium Dual Core:

2 Independent Instructions streams = more parallelism

The idea here was the OoO scheduler is now pretending to be 2 separate processors. Since 1 ALU is saturated, the other is wasted. 2 Threads should saturate both ALU’s! The name for this model is SMT (Simultaneous Multi-Threading). 1 Core is simultaneous pretending to be multiple threads. Intel calls this Hyper Threading.

So what is bad about the SMT architecture? Latency. When 1 of the SMT Threads is blocked (disk, or ram read/write). Around half the Core is now doing nothing. Ideally the 1 still operating thread can step up and use the other resources, but as Intel learned this isn’t always the case. IBM and Oracle just decided to make Cores even wider in the hopes SOMETHING will use the extra resources

SMT-6 POWER8 Core

But ultimately this model still has the same fundamental limitation. Furthermore this puts a lot of stress on the OoO unit which is juggling a lot of instructions.

The solution for this model was SIMT Single-Instruction Multiple Threads. The goal being the OoO unit will simply pause thread execution, and stop preforming OoO with them when their blocked by RAM/Disk access.

A very simple SIMT model

This seems like the dumbest idea right? If 1 thread can’t saturate all the OoO functionality why STOP using threads selectively?!? Are we trying to design a terrible processor? Will vendors literally buy anything that uses a fancy term?

So what is the advantage of SIMT? Well vector processing. Vector Processing or SIMD (single instruction multiple dispatch) allows for 1 instruction to do multiple things.

Classic Instructions
SIMD/Vector Instructions

SIMD is great for number crunching. When you have a simple math equation, but a TON of data. The repetitiveness allows for processors to do 2,4,8,16,32 operations in 1 cycle instead of well 2,4,8, etc… This can be a very non-trivial speed up.

The true power of SIMT is when it is combined with SIMD. The idea being threads are only run, when they can execute a SIMD instruction. So the OoO/SIMT scheduler has to handle pooling data, and looking even further ahead.

SIMT+SIMD is how a modern GPU works.

AMD GCN has 4x SIMD + 1 ALU (but Warp Scheduler is an Nvidia term)

A processor like AMD GCN has 64 register sets (or files). But only 4 compute processors. So it can run 4 SIMD instructions at once, but only threads that can be immediately executed are. The goal is that every cycle something should be able to run regardless of RAM/Cache latency or load. This is part of the reason modern GPU’s are so fast. It is incredibly rare for cycles to be wasted waiting on resources.

--

--