An Eye-opening Single-core Parallelism Test on M1 Mac

Hong Jiang
Feb 23 · 5 min read

Since the M1 Macs were released, there have been numerous benchmarks comparing them with the Intel Macs. Most of the results heavily favor the M1 in both performance and power efficiency. When I received my new M1 MacBook Air, I was interested in testing one of the theories that explain why the M1 is so performant.

In computer engineering, a technique called dynamic execution (a.k.a. out-of-order execution) is used to parallelize different stages of instruction processing (fetch, decode, dispatch, execution, etc.), so that a single CPU core can process as many instructions at the same time as possible in the instruction pipeline. With this technique, the processor execute the instruction in an order determined by the availability of input data and processing units, rather than their original order in the program. This requires that the CPU be able to fetch and decode a few instructions ahead of the one currently being executed.

The M1 uses the ARM instruction set, in which all instructions are 32-bit long. It is relatively easy to fetch the next few instructions. Intel processors however use the much more complex x86 instruction sets with variable instruction lengths, so it is generally not possible to determine where the next instruction starts without decoding the current one. Intel uses a brute-force technique that tries to fetch at all possible instruction boundaries and see which one turns out to be correct, which makes the processor both more complicated and less efficient. I would like to test how much this matters and whether it contributes to the M1’s performance gains.

The idea is to write two simple tasks A and B, both of which consists of instructions that must be executed sequentially because of data dependency. Run each of them in a loop for 100 million times, and let the time consumed be t1 and t2 respectively. Then interlace the instructions from A and B to create task C. Because the instructions in C are from two independent tasks, this creates opportunities for them to be executed in parallel. Run C for the same number of times and let the time consumed be t3. Because the loop adds some overhead, and half of the instructions in C can be opportunistically executed at the same time with the other half, we should expect t3 < t1 + t2. How much t3 is less than t1 + t2 shows how effective dynamic execution is.

The full program is on GitHub if you’re interested. This is my task A:

    i = 0;
a = 0;
while (i < n) {
if (i % 3 == 0) {
a += i*2;
} else {
a -= i;
}
i++;
}

It is easy to see that each step depends on the result of the previous and must be executed sequentially. Because we want to focus on the CPU, the code is simple enough that all variables can fit in the registers, but complex enough to throw off some common optimization techniques that might interfere with the test. In the full program I also output a after the loop as a side effect to make sure the entire thing isn’t optimized away by the compiler. Before running the test, the program has a warm-up phase to make sure all data are loaded into the CPU cache to minimize memory access.

Task B is quite similar:

    j = 0;
b = 0;
while (j < n) {
if (j % 7 == 0) {
b += j*6;
} else {
b -= j;
}
j++;
}

Task C is just A and B interlaced together:

    i = 0;
j = 0;
a = 0;
b = 0;
k = 0;
while (k < n) {
if (i % 3 == 0) {
a += i*2;
} else {
a -= i;
}
i++;
if (j % 7 == 0) {
b += j*6;
} else {
b -= j;
}
j++;
k++;
}

When I ran the program on a x86_64 machine, the result wasn’t surprising:

Time consumed on x86_64

t1 + t2 was about 750,000, and t3 was about 18.3% less than that. There were definitely some savings contributed by dynamic execution.

When I ran the exact same program (compiled to a different binary for ARM, of course) on my M1 MacBook Air. The result was quite unexpected:

Time consumed on M1

The time it took to run C was only slightly longer than the time it took to run either A or B! I tried this experiment multiple times, and the results were quite close. This implies that dynamic execution achieves near perfect efficiency for this particular program on the M1, assuming there wasn’t some big flaw in this test.

Putting the numbers from both platforms together makes the difference more obvious:

I expected the M1 to be better at optimizing the pipeline, but the difference was much more significant than I thought. I plan to further investigate and confirm the results in assembly code when I have more time. You might see an article by me (or someone else if they beat me to it) later talking about how this article were wrong. ;-)

Ruminations on Programming

Thoughts on the study and practice of computer programming

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store