Searching Gigabytes of Data Per Second With SIMD

Josh Weinstein
The Startup
Published in
8 min readAug 1, 2020

SIMD. Single instruction multi data. You may not have heard of these four words before, but they have the power to make software run at lightning speed. They can accelerate actions like copying or searching data 10x, 20x or more times faster than with traditionally written code. The CPUs that power our computers today possess a special set of instructions that can process data simultaneously, and in parallel. In fact, the sets of these instructions have been around for a number of years. They are seldom explored or discussed, but have the potential to provide unparalleled performance in a world of ever growing software capacity.

Using them though, is a bit easier sad than done. Most technologies or techniques in programming run code written in a programming language and automate a task. That code is either compiled to instructions that run on a virtual machine, or in the case of languages like C or C++, it’s compiled down to assembly along with a few more steps to form an executable. A programmers role is writing programs in the programming language, leaving the process of transforming code into instructions, to a compiler. Many high level languages like Ruby or Python also provide wrappers for cross platform system level functions, so the programmer may not even be aware of what operating system their code runs on.

SIMD is not a feature of programming languages in any traditional sense. It dives far below the comforting layers of compilers, virtual machines, and even operating systems. SIMD functions and runs at the processor level. SIMD is a paradigm of assembly instructions that can be executed on a CPU which has special registers that handle vectors. The normal assembly instructions generated by compilers are called SISD, standing for single instruction single data. One instruction, such as addition, operates on one designated data value at a time. SIMD, on the other hand, operates on fixed sized groups of data called vectors. Each pair of values at each position in both vectors undergo the same operation, but simultaneously. The diagram below illustrates the comparison.

Fortunately, you don’t need to write assembly in order to use SIMD instructions. In fact, several C++ compilers like Clang or GCC offer intrinsics, that allow a compiler to replace programming language syntax with SIMD instructions.

Vectorization, is it automatic ?

One question is: can compilers just use SIMD instructions in place of SISD instructions? The answer is complicated. Automatic use of SIMD instructions in compilers is an area of academic research called Automatic Vectorization . Here, “automatic” means that the compiler would know what SIMD instructions to generate for the code you write in a programming language. In some cases, this is possible, such as looping through an array of integers and summing them or applying an identical action. However, there are quite a number of limitations to this if the loop has many conditional statements or control flows. A method of processing data in single values does not always translate to a method of processing data in vectors, in the same way some languages have phrases which do not translate well or at all to others.

Programming with SIMD means switching the way you think to using vectors and groups of data. This article demonstrates how you can rethink a problem to be solved with vectors, and take the performance of your code to levels you never thought possible.

Data, Strings, and Counting

Big data processing and searching often deals with text contained in string objects. One problem we can use for a test run is counting the characters in strings. Usually, a string is composed of bytes, with some encoding, like UTF-8. A very simple task to perform on a string is to count the number of times some character appears in a string. A traditional approach to such a problem would involve a loop over each character, comparing it to the target character that’s being counted, and if it matches, incrementing a counter.

The difference is, SIMD does not have control flow that’s at all related to if { ... } else { ... } style of statements in programming languages. Primarily because such conditional statements are designed for single data flows. SIMD, instead uses masks, vectors whose bits represent boolean results of comparison operations. For example, the following diagram represents an equality operation, where the 8-bit unsigned integers between two vectors are compared.

Masks

The result is stored in the grey vector. For positions where the values were equal , all the bits of that position in the result vector are set. If the two values in the position of the operand vectors were not equal, all the bits in the result vector are unset. Think of this as saying, “true means 1” and “false means zero”. The mask vectors can be used in other SIMD operations to control the flow of data from one vector to another.

For this problem, we aren’t really interested in data flow specifically. We want to determine the count of a specific character within a string. We need a way to take the mask vector from the comparison operation, and produce a count of the set integers. We will use two instructions from a more basic instruction set called SSE2. This is an instruction set which operates on 128-bit vectors and is based off the x86 architecture, found in Intel or AMD processors. This means that for the case of strings and characters, SSE2 vectors can process 16 characters in each operation. There are instruction sets capable of processing much more than just 16 characters at a time, but those are beyond the scope of this article.

The next two instructions needed for our solution will convert the 128-bit mask into a 16-bit mask, and count the number of set bits in the 16-bit mask. These are commonly referred to as “move mask” and “pop count” instructions, respectively . In the diagram below, the first row of boxes represent 8-bit integers within a 128-bit vector. The value FF is the hexadecimal value for 255, indicating all bits are set. Conversely, the value 00 indicates no bits are set.

Now that the sequence of steps in the SIMD solution is visualized and understand, let’s look at the code involved for both the normal and the SIMD approach.

The Code

First, the typical, C approach to counting characters in a string. The intention here is to make the two functions as similar as possible, except for the use of regular counting code and the SIMD intrinsics. In any software performance test, it’s vital to control other variables as tightly as possible.

Here, a while loop is used, which a termination condition of reading all the characters in the C-string. If the character matches the target character, increase the characters matched count by 1. In any case, increase the characters consumed counter as well as incrementing the pointer.

Now, for the SIMD function.

This might look a bit complicated, as there are some components I didn’t yet explain. First, __m128i is a data type that’s meant to embody a register within the context of a language like C. In reality, an x86 CPU has specific, named registers for 128-bit SIMD operations, but the compiler in this case will pick and choose which registers to use on it’s own. The function call expressions beginning with _mm_* are compiler intrinsics for the x86 SIMD instructions. They permit the compiler to inline specific instructions to perform, in this scenario, 128-bit vector operations. They are not true function calls, they make no use of the stack pointer or base pointer registers.

The intrinsic _mm_set1_epi8 sets all 16, 8-bit integers within a vector to a specific value. This step only needs to be done once before the loop because we will always be comparing against the same character for all operations during the course of the loop. _mm_load_si128 loads 128-bits of data from memory into a 128-bit register. In order for any data to undergo SIMD operations of any size or variety, it must be explicitly loaded from memory into such a register. The next two intrinsics, _mm_cmpgt_epi8 and _mm_movemask_epi8, compare against the vector of all the same character, and convert the mask into 16 bits of a 32 bit integer type int, respectively.

The _popcnt32 intrinsic is responsible for counting the number of set bits in an integer. It does not have the _mm_ prefix, because it does not operate on 128-bit registers, it just operates on standard 64-bit registers.

For reference, you can view a list of Intel’s x86 intrinsics here.

To measure the performance of these two functions, I will use the more precise timing library specific to the unix operating system. The timing function uses code from the <sys/time.h> header and will measure performance in microseconds, as follows:

The Test

The test for each function will be to count the number of lines in a CSV document that is exactly 1.6 GB, or 16,000,000,000 bytes long. Since we only want to measure the counting portion of each function, the CSV document will exist as a string in memory, to avoid variability in reading from disk. The number of lines in a CSV document corresponds to the number of newline characters, '\n' , minus one.

For this test, I am using a 2.7 GHz Quad-Core Intel Core I7 processor. Performance tests of any kind, but especially for SIMD are highly sensitive to changes in the environment, such as the processor, the other applications running on the computer, and more. After running the same code 50 times, my average result was the following:

sse2 csv perf test
Using data size: 1600000000 bytes
Running: 'csv_count_norm' took 3922641 u/s
Running: 'csv_count_simd128' took 702249 u/s

My best result was:

sse2 csv perf test
Using data size: 1600000000 bytes
Running: 'csv_count_norm' took 3810724 u/s
Running: 'csv_count_simd128' took 658831 u/s

That means at best, the SIMD solution processed 2.4 GB of data per second, while the regular solution processed only 0.4 GB of data per second. That’s around a 600% increase in performance. That’s also only using a single core, no additional threads needed.

In the end, SIMD can be shown to make an incredible improvement in the performance of programs with process and search big data. Using just the processor available to me on my mac, and a more basic set of SIMD instructions, I was able to achieve a six fold increase in performance over a traditionally written C function.

However, it doesn’t end there. There are SIMD instruction sets on other processors which can process 256-bits, or even 512-bits of data at a time. Meaning, in the basic example problem tested here, it might be possible to achieve a 1200% or 2400% increase in performance over the default approach. Using SIMD instruction sets in software truly puts the sky as the limit for maximum performance.

--

--

Josh Weinstein
The Startup

I’m an engineer on a mission to write the fastest software in the world.