Under the hood CPU: Know your CPU

aarti gupta
software under the hood
5 min readDec 27, 2017

know the mechanics of your race car, part 1: CPU

Terminology used

CPU or Central Processing Unit executes instructions of a computer program by performing the basic arithmetic, logical, control and input/output (I/O) operations specified by the instructions

CPU cores: A core is usually the basic computation unit of the CPU — it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs).A CPU may have one or more cores to perform tasks at a given time.

CPU clock speed, or clock rate, is measured in Hertz — generally in gigahertz, or GHz. A CPU’s clock speed rate is a measure of how many clock cycles a CPU can perform per second. For example, a CPU with a clock rate of 1.8 GHz can perform 1,800,000,000 clock cycles per second.

Hardware threads: Generic term that refers to multithreading achieved mostly by duplicating thread state and sharing most everything else in a processing core. Multithreading achieved by duplicating most everything, the whole “core,” is what multicore and many-core designs are all about. An OS may have many threads to run, but the CPU can only run X such tasks at a given time, where X = number cores * number of hardware threads per core. The rest would have to wait for the OS to schedule them whether by preempting currently running tasks or any other means.

Hyperthreading: The CPU pretends it has more cores than it does, and it uses its own logic to speed up program execution. Hyper-threading allows the two logical CPU cores to share physical execution resources. This can speed things up somewhat — if one virtual CPU is stalled and waiting, the other virtual CPU can borrow its execution resources. Hyper-threading can help speed your system up, but it’s nowhere near as good as having actual additional cores. In other words, the operating system is tricked into seeing two CPUs for each actual CPU core. Useful for tasks like video editing, 3D rendering, and heavy multi-tasking. Similarly, Hyper Threading can help a CPU push light tasks like background applications or browser windows to one processor, while heavy applications like games or full-screen video goes to another

Difference in hardware threads and hyperthreading: “Hyperthreading” is a very specific form of implementing a “hardware thread” that is only found on dynamic (a.k.a. out-of-order) execution engines.

Dynamic Execution/Out of order execution engines: Dynamic reordering of instructions lets the CPU hide memory latencies, allowing for even higher clock speeds. For every cache miss, a Pentium 4 3.6GHz has to wait around 230 clock cycles to get data from main memory, which is a lot of idle time in the eyes of the CPU. Incremental increase in instruction level parallelism — by reordering instructions on the fly, out-of-order architectures can improve ILP as best as possible in areas where the compiler fails to.

Components of the CPU

1. Execution Units

Control unit

The control unit of the CPU contains circuitry that uses electrical signals to direct the entire computer system to carry out stored program instructions. The control unit does not execute program instructions; rather, it directs other parts of the system to do so. The control unit communicates with both the ALU and memory.

Arithmetic logic unit

The arithmetic logic unit (ALU) is a digital circuit within the processor that performs integer arithmetic and bitwise logic operations. The inputs to the ALU are the data words to be operated on (called operands), status information from previous operations, and a code from the control unit indicating which operation to perform. Depending on the instruction being executed, the operands may come from internal CPU registers or external memory, or they may be constants generated by the ALU itself.

Memory management unit

Most high-end microprocessors (in desktop, laptop, server computers) have a memory management unit, translating logical addresses into physical RAM addresses, providing memory protection and paging abilities, useful for virtual memory. Simpler processors, especially microcontrollers, usually don’t include an MMU.

2. Registers

In computer architecture, a processor register is a quickly accessible location available to a computer’s central processing unit (CPU).

Registers are normally measured by the number of bits they can hold, for example, an “8-bit register” or a “32-bit register”. A processor often contains several kinds of registers, that can be classified according to their content or instructions that operate on them:

CPU Operations and Instruction Cycle

The fundamental operation of most CPUs, regardless of the physical form they take, is to execute a sequence of stored instructions that is called a program. The instructions to be executed are kept in some kind of computer memory. Nearly all CPUs follow the fetch, decode and execute steps in their operation, which are collectively known as the instruction cycle.

After the execution of an instruction, the entire process repeats, with the next instruction cycle normally fetching the next-in-sequence instruction because of the incremented value in the program counter. If a jump instruction was executed, the program counter will be modified to contain the address of the instruction that was jumped to and program execution continues normally. In more complex CPUs, multiple instructions can be fetched, decoded, and executed simultaneously.

SIMP: Single Instructions Multiple Data

SIMD instructions allow the CPU to perform the same operation on multiple values simultaneously. For example we would like to perform four multiplications on eight values:

z1 = x1 * y1
z2 = x2 * y2
z3 = x3 * y3
z4 = x4 * y4

Normally that would require eight instructions to load values from memory into registers and four multiplication instructions. Using SIMD instructions, the CPU can load all four x values into the xmm0 with a single MOVUPS instruction,, another MOVUPS to load the four y values into the xmm1 register and a single MULPS instruction to multiply them

+-------+-------+-------+-------+
| x3 | x2 | x1 | x0 | xmm0
+-------+-------+-------+-------+
* * * *
+-------+-------+-------+-------+
| y3 | y2 | y1 | y0 | xmm1
+-------+-------+-------+-------+
= = = =
+-------+-------+-------+-------+
| x3*y3 | x2*y2 | x1*y1 | x0*y0 | xmm0
+-------+-------+-------+-------+

The key feature here is that this multiplication will be performed simultaneously on all four values, which will be four times faster. SIMD instructions are often called vectorized instructions, because you can think of them as operating on vectors of values.

Determining equivalent for your computer

cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5–2676 v3 @ 2.40GHz
stepping : 2
microcode : 0x3a
cpu MHz : 2400.223
cache size : 30720 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs :
bogomips : 4800.10
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

--

--

aarti gupta
software under the hood

-distributed computing enthusiast, staff engineer at VMware Inc