CPU Extension Instruction Sets — Speeding up your CPU Computation

Faris Rahman
Feb 12, 2018 · 6 min read

As the complexity of computation problems grow over time, the demand for high-speed processors is also increasing accordingly. One key to achieving faster performance demanded is by giving a system the ability to perform parallel executions, which will enable the system to produce higher throughput and rapid operation.

GPGPU (General Purpose Graphic Processing Unit) is an option to do just that. GPGPUs provide us with massive parallel cores to crunch multiple data at once, distribute the data across all the cores to perform a specific instruction. That capability makes GPGPUs compute 10 times faster than the scalar CPU (possibly more, depending on the number of cores, and DRAM size of the GPU). But obviously enough, they are quite pricey.

Consumer/gaming GPUs are quite popular and could be another option to tackle the need for advanced computation accelerator. However, they are not designed to continuously perform high load computations, as needed when we train a machine learning model, infer an AI application, DNA sequencing, or run a computer simulation. Hence, its durability could be quite a concern.

One of the leading GPU manufacturer, Nvidia, has a project which sole purpose is advancing the enterprise-level graphics card’s capability in high-level computation while perceiving its durability as well. Their GPUs are named after one the greatest scientist in the 90’s, Tesla. Tesla’s latest release, the Volta V100, has an outstanding single-precision performance of 15.7 TFLOPS [1].

Unlike gaming GPUs, Nvidia’s Enterprise GPUs, such as Tesla series, are not OEM (original equipment manufacturer). In other words, they are more expensive than gaming GPUs (even compared to ones produced under different brand names such as ASUS, Leadtek, Digital Alliance, etc.). As a conclusion, Enterprise GPUs could stand as the middle ground between GPGPUs and gaming GPUs in terms of their price, durability, and computing power.

but what if I can’t afford a GPU but desperately in need of speedy computation?

As you might already notice, advance computing graphics card (Google TPU, Intel Xeon Phi) and enterprise GPUs could be a little to pricey. Unless your massive workloads of computation have sales potentials, or your project is backed with a big pile of research money, they might not be the correct choice.

But worry not, for those of you who own a notebook/laptop, even without a GPU, but are thinking to speedup computation power around 2–3 times more, there is actually a hidden treasure waiting to be released which can achieved such things.

An extension instruction set has been introduced to Intel, AMD, ARM, and some other CPUs. This new instruction set is built alongside their default instruction set and has a purpose of accelerating execution on its current arithmetic operation.

Focusing mainly on the data streamlining, this instruction set uses wider registries to pack more data on a single batch for further execution. Hence, data parallelism (and higher throughput) is achieved. This extension instruction set is modeled after the SIMD (Single Instruction Multiple Data), which will be explained further in the following section.

SIMD (Single Instruction Multiple Data)

SIMD (Single Instruction Multiple Data) is a type of operation that processes multiple data at once on a single arithmetic unit. SIMD is one of the class in . The complete list of Flynn’s Taxonomy is as follows:

  • SISD (Single Instruction Single Data)
  • SIMD (Single Instruction Multiple Data)
  • MISD (Multiple Instruction Single Data)
  • MIMD (Multiple Instruction Multiple Data)

SIMD is mostly used for vector operation such as addition, multiplication, dot product, etc.

With SIMD, we can serialise/pack vector elements at once, compute them using specific instructions, and produce each result simultaneously.

SIMD instruction set is creating higher throughput in the execution pipeline by leveraging data parallelism. For that reason, SIMD has been widely used for speeding up computation.

Below is a list of SIMD instruction set available for each of CPU brand:

SSE (Streaming SIMD Extensions)

A long journey of instruction set technology evolution has been traversed before SSE was introduced. Begin with the MMX instruction set, introduced by Intel Pentium in1997, it’s the additional set of instruction alongside its IA-32 x86 instruction set. MMX registries (mm0 — mm7) is 64 bit wide which can hold 64 bit long integers or multiple smaller integers.

SSE firstly introduced in Intel Pentium 4 and purposed to replace the current MMX instruction set. SSE using XMM (xmm0 — xmm7) registries which is 128 bit wide. Hence, compare with MMX, can hold twice more data, hence twice the throughput.

To utilise this SIMD instruction sets, there is a C intrinsic library which accommodates new data type and sets of C functions prepared by Intel. Several datatypes are introduced to this extension registries which will be used to perform the SIMD operation, listed as follows:

  • __m64
  • __m128
  • __m128i
  • __m128d

__m64 used by MMX instruction sets. It can pack one 64 bit long value, two of 32 bit value, four of 16 bit value, or eight of 8 bit value.

__m128 used by SSE instruction sets. it can hold four of 32 bit value

__m128i used by SSE instruction sets, specifically for integers value of (16 x 8 bit value) or (8 x 16 bit value) or (4 x 32 bit value) or (2 x 64 bit value)

__m128d used by SSE instruction sets specifically for floating point value of (4 x 32 bit value) or (2 x 64 bit value)

Most SSE intrinsic function / syntax has a defined naming convention as per template below:



__m128 dst _mm_add_ps(__m128 a, __m128 b) is performing addition on 128bit a and 128bit b value and store the result on 128bit dst.

add is the instrinsic operation which as the name tells, is performing addition.

ps is the suffix which denotes packed floating point. the “p” denotes the packed element being operated and the “s” denote a single precision floating-point data element (32 bit).

Intrinsic operations cover many common arithmetic operation (addition, multiplication, division, etc), logical operation, comparison operation, and also conversion operation. To access the Intel’s C intrinsic library, you need to include the C header file

#include "xmmintrin.h" // for SSE
#include "emmintrin.h" // for SSE2
#include "nmmintrin.h" // for SSE4.2

For more detailed list of intrinsic function for Intel intrinsic library, you can check on link below:


Let’s try a simple vector addition using SSE C intrinsic (explanation in the comment)

Compiling above SSE C intrinsic code using GCC require additional arguments -msse

gcc --std=c99 -msse -o sse_add sse_add.c

Let’s compare execution time taken by SSE and a simple vector addition in default C.

A simple vector addition code in C without intrinsic will be as follow:

gcc --std=c99 -o vec_add vec_add.c

Execution of simple vector addition with vector length (2048 * 2048) in C without intrinsic takes 12.03 ms and execution in SSE C intrinsic takes 9.44 ms which is ~27% faster.

The speeding up is pretty obvious and significant. With a bit of code breakdown to see the difference between execution on plain x86 instruction set and advance instruction set, we get a clear conclusion why this acceleration happen.

// SSE intrinsic can reduce iteration with maximum number of packed
int packed = 4;
for(int i=0;i<VECTOR_SIZE / packed;i++){
__m128 _vec_a = _mm_loadu_ps(a + i * packed);
__m128 _vec_b = _mm_loadu_ps(b + i * packed);
__m128 _vec_c = _mm_add_ps(_vec_a, _vec_b);
memcpy(c + (i*packed), &_vec_c, packed * sizeof(float));

The key of performance increment in SIMD is the iteration reduction caused by multiple data execution at once which maximize throughput. Hence, more data can be packed at once meaning reducing more iteration which will increase the performance greatly.

That’s the reason why the SIMD technology evolved, by widening the size of the registry in the instruction level to perform an execution.


Nowadays, CPUs has been equipped with the SIMD processor to speed up floating point operations. SIMD is also widely used for speeding up computation on arithmetic operations and advance numeric operations. Other than GPUs, CPU with SIMD can be another alternative for hardware accelerator.


[1] NVIDIA Tesla V100 Specifications,


Extending Vision Beyond Imagination

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface.

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox.

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store