Compiler Optimizations: Boosting code performance without doing much!

A hands-on showdown of compiler flags

Published in

Nerd For Tech

13 min readNov 21, 2023

Unlocking peak performance from your C++ code can be daunting, demanding meticulous profiling, intricate memory access adjustments, and cache optimization. Is there a trick to simplify this a bit?? Fortunately, there is a shortcut to achieving remarkable performance gains with minimal effort — provided you have the right insights and know what you’re doing. Enter compiler optimizations that can significantly elevate your code’s performance.

Modern compilers serve as indispensable allies in this journey toward optimal performance, particularly in automatic parallelization. These sophisticated tools possess the prowess to scrutinize intricate code patterns, especially within loops, and execute optimizations seamlessly. This article aims to spotlight the potency of compiler optimizations, focusing on the Intel C++ compilers — renowned for their popularity and widespread usage.

In this story, we unravel the layers of compiler magic that can transform your code into a high-performance masterpiece, requiring less manual intervention than you might think.

What are compiler optimizations?

Compiler optimizations encompass various techniques and transformations a compiler applies to the source code during compilation. But why? To enhance the performance, efficiency, and, in some instances, the size of the resulting machine code. These optimizations are pivotal in influencing various aspects of code execution, including speed, memory usage, and energy consumption.

Any compilers execute a series of steps for converting the high-level source code to the low-level machine code. These involve lexical analysis, syntax analysis, semantic analysis, intermediate code generation (or IR), optimization, and code generation.

During the optimization phase, the compiler meticulously seeks ways to transform a program, aiming for a semantically equivalent output that utilizes fewer resources or executes more rapidly. Techniques employed in this process encompass but are not limited to constant folding, loop optimization, function inlining, and dead code elimination.

I’m not going to discuss all the available options, but how we can instruct the compiler to do specific optimization that might improve code performance. So, the solution???? Compiler Flags.

Developers can specify a set of compiler flags during the compilation process, a practice familiar to those who have used options like “-g” or “-pg” with GCC for debugging and profiling information. As we go ahead, we’ll discuss similar compiler flags we can use while compiling our application with the Intel C++ compiler. These might help you improve your code’s efficiency and performance.

So, what are we working with?

I won’t delve into dry theory or inundate you with tedious documentation listing every compiler flag. Instead, let’s try to understand why and how these flags work.

How do we accomplish this???

We’ll take an unoptimized C++ function responsible for calculating a Jacobi iteration, and step by step, we’ll unravel the impact of each compiler flag. Along this exploration, we’ll measure the speedup by systematically comparing each iteration with the base version — starting with no optimization flags (-O0).

The speedups (or the time of execution) were measured on an Intel® Xeon® Platinum 8174 Processor machine. Here, the Jacobi method solves a 2D partial differential equation (Poisson equation) for modeling the heat distribution on a rectangular grid.

u(x,y,t) is the temperature at point (x,y) at time t.

We solve the stable state when the distribution isn’t changing anymore:

A set of Dirichlet boundary conditions have been applied at the boundary.

We essentially have a C++ coding performing the Jacobi iterations on grids of variable sizes (that we call resolutions). Basically, a grid size of 500 means solving a matrix of size 500x500, and so on.

The function for performing one Jacobi iteration is as follows:

/*
 * One Jacobi iteration step
 */
void jacobi(double *u, double *unew, unsigned sizex, unsigned sizey) {
 int i, j;

 for (j = 1; j < sizex - 1; j++) {
  for (i = 1; i < sizey - 1; i++) {
   unew[i * sizex + j] = 0.25 * (u[i * sizex + (j - 1)] +  // left
      u[i * sizex + (j + 1)] +  // right
      u[(i - 1) * sizex + j] +  // top
      u[(i + 1) * sizex + j]); // bottom
  }
 }

 for (j = 1; j < sizex - 1; j++) {
  for (i = 1; i < sizey - 1; i++) {
   u[i * sizex + j] = unew[i * sizex + j];
  }
 }
}

We keep performing Jacobi iteration until the residual reaches a threshold value (inside a loop). The residual calculation and threshold evaluation are done outside this function and are not of concern here. So, let’s talk about the elephant in the room now!

How does the base code perform?

With no optimizations (-O0), we get the following results:

Runtime in seconds and MFLOP/s for the base case (“-O0”)

Here, we measure the performance in terms of the MFLOP/s. This will be the basis of our comparison.

MFLOP/s stands for “Million Floating Point Operations Per Second.” It is a unit of measurement used to quantify the performance of a computer or processor in terms of floating-point operations. Floating-point operations involve mathematical calculations with decimal or real numbers represented in a floating-point format.

MFLOP/s is often used as a benchmark or performance metric, especially in scientific and engineering applications where complex mathematical calculations are prevalent. The higher the MFLOP/s value, the faster the system or processor is at performing floating-point operations.

Note 1: To provide a stable result, I run the executable 5 times for each resolution and take the average value of the MFLOP/s values.
Note 2: It’s important to note that the default optimization on Intel C++ compiler is -O2. So, it is important to specify -O0 while compiling the source code.

Let’s go ahead and see how these run times will vary as we try different compiler flags!

The most common ones: -O1, -O2, -O3 and -Ofast

These are some of the most commonly used compiler flags when one begins with compiler optimizations. In an ideal case, the performance of Ofast > O3 > O2 > O1 > O0. However, this doesn’t necessarily happen. The critical points of these options are as follows:

-O1:

Goal: Optimize for speed while avoiding code size increase.
Key Features: Suitable for applications with large code sizes, many branches, and where execution time isn’t dominated by code within loops.

-O2:

Enhancements over -O1:
— Enables vectorization.
— Allows inlining of intrinsics and intra-file interprocedural optimization.

-O3:

Enhancements over -O2:
— Enables more aggressive loop transformations (Fusion, Block-Unroll-and-Jam).
— Optimizations may only consistently outperform -O2 if loop and memory access transformations occur. It can even slow down the code.
Recommended For:
— Applications with loop-heavy floating-point calculations and large data sets.

-Ofast:

Sets following flags:
— “-O3”
— “-no-prec-div”: It enables optimizations that give fast and slightly less precise results than full IEEE division. For example, A/B is computed as A * (1/B) to improve the computation speed.
— “-fp-model fast=2": enables more aggressive floating-point optimizations

The official guide talks in detail about exactly which optimizations these options offer.

When using these options on our Jacobi code, we obtain these execution run times:

It is clearly evident that all these optimizations are much faster than our base code (with “-O0”). The execution run time is 2–3x lower than the base case. What about MFLOP/s??

Well, that’s something!!!

There is a big difference between the MFLOP/s of the base case and those with the optimization.

Overall, though only slightly, “-O3” performs the best.

The extra flags used by “-Ofast” (“-no-prec-div -fp-model fast=2”) aren’t giving any additional speedup.

Architecture targeted (-xHost, -xCORE-AVX512)

The machine’s architecture stands out as a pivotal factor influencing compiler optimizations. When the compiler knows the available instruction sets and the optimizations supported by the hardware (like vectorization, SIMD), it can significantly enhance performance.

For instance, my Skylake machine has 3 SIMD units: 1 AVX 512 and 2 AVX-2 units.

Can I really do something with this knowledge???

The answer lies in strategic compiler flags. Experimenting with options such as “-xHost” and, more precisely, “-xCORE-AVX512” may allow us to harness the full potential of the machine’s capabilities and tailor optimizations for optimal performance.

Here is a quick description of what these flags are all about:

-xHost:

Goal: Specifies that the compiler should generate code optimized for the host machine’s highest instruction set.
Key Features: Takes advantage of the latest features and capabilities available on the hardware. It can give an amazing speedup on the target system.
Considerations: While this flag optimizes for the host architecture, it might result in binaries that are not portable across different machines with varying instruction set architectures.

-xCORE-AVX512:

Goal: Explicitly instructs the compiler to generate code that utilizes the Intel Advanced Vector Extensions 512 (AVX-512) instruction set.
Key Features: AVX-512 is an advanced SIMD (Single Instruction, Multiple Data) instruction set that offers wider vector registers and additional operations compared to previous versions like AVX2. Enabling this flag allows the compiler to leverage these advanced features for optimized performance.
Considerations: Portability is again the culprit here. The binaries generated with AVX-512 instructions may not run optimally on processors that do not support this instruction set. They may not work at all!

AVX-512 set instructions use Zmm registers, which are a set of 512-bit wide registers. These registers serve as the foundation for vector processing.

By default, “-xCORE-AVX512” assumes that the program will unlikely benefit from zmm registers usage. The compiler avoids using zmm registers unless a performance gain is guaranteed.

If one plans to use the zmm registers without any restrictions, “-qopt-zmm-usage” can be set to high. That’s what we’ll be doing as well.

Don’t forget to check the official guide for their detailed instructions.

Let’s see how these flags work for our code:

Wohoo!
We now cross the 1200 MFLOP/s mark for the smallest resolution. The MFLOP/s values for other resolutions have also increased.

The remarkable part is that we achieved these results without any substantial manual interventions — simply by incorporating a handful of compiler flags during the application compilation process.

However, it is essential to highlight that the compiled executable will only be compatible with a machine using the same instruction set.

The optimization-versus-portability trade-off is evident, as code optimized for a particular instruction set may sacrifice portability across different hardware configurations. So, make sure you know what you’re doing!!

Note: Don’t worry if your hardware doesn’t support AVX-512. Intel C++ Compiler supports optimizations for AVX, AVX-2 and even SSE. The documentation has everything you need to know!

Interprocedural Optimization (IPO)

Interprocedural Optimization involves analyzing and transforming code across multiple functions or procedures, looking beyond the scope of individual functions.

IPO is a multi-step process focusing on the interactions between different functions or procedures within a program. IPO can include many different kinds of optimizations, including Forward substitution, Indirect call conversion, and Inlining.

Intel Compiler supports two common types of IPO: Single-file compilation and multi-file compilation (Whole Program Optimization) [3]. There are two common compiler flags performing each of them:

-ipo:

Goal: Enables interprocedural optimization, allowing the compiler to analyze and optimize the entire program, beyond individual source files, during compilation.
Key Features:
- Whole Program Optimization: “-ipo” performs analysis and optimization across all source files, considering the interactions between functions and procedures throughout the entire program.
- Cross-function and cross-module optimization: The flag facilitates inlining functions, synchronization of optimizations, and data flow analysis across different program parts.
Considerations: It requires a separate link step. After compiling with “-ipo”, a particular link step is needed to generate the final executable. The compiler performs additional optimizations based on the whole program view during linking.

-ip:

Goal: Enables interprocedural analysis-propagation, allowing the compiler to perform some interprocedural optimizations without requiring a separate link step.
Key Features:
- Analysis and propagation: “-ip” enables the compiler to perform research and data propagation across different functions and modules during compilation. However, it does not perform all optimizations that require the full program view.
- Faster compilation: Unlike “-ipo”, “-ip” doesn’t necessitate a separate linking step, resulting in speedier compilation times. This can be beneficial during development when quick feedback is essential.
Considerations: Only some limited interprocedural optimizations occur, including function inlining.

-ipo generally provides more extensive interprocedural optimization capabilities as it involves a separate link step but comes at the cost of longer compilation times. [4]
-ip is a quicker alternative that performs some interprocedural optimizations without requiring a separate link step, making it suitable for development and testing phases.[5]

Since, we’re only talking about performance and different optimizations, compile times or size of the executable not being our concern, we’ll focus on “-ipo”.

-fno-alias

All the above optimizations depend on how well you know your hardware and how much you would experiment. But that’s not all. If we try to identify how the compiler would see our code, we may identify other potential optimizations.
Let’s again have a look at our code:

/*
 * One Jacobi iteration step
 */
void jacobi(double *u, double *unew, unsigned sizex, unsigned sizey) {
 int i, j;

 for (j = 1; j < sizex - 1; j++) {
  for (i = 1; i < sizey - 1; i++) {
   unew[i * sizex + j] = 0.25 * (u[i * sizex + (j - 1)] +  // left
      u[i * sizex + (j + 1)] +  // right
      u[(i - 1) * sizex + j] +  // top
      u[(i + 1) * sizex + j]); // bottom
  }
 }

 for (j = 1; j < sizex - 1; j++) {
  for (i = 1; i < sizey - 1; i++) {
   u[i * sizex + j] = unew[i * sizex + j];
  }
 }
}

jacobi() function takes a couple of pointers to double as parameters and then does something inside the nested for loops. When any compiler sees this function in the source file, it has to be very careful.

Why??

The expression to calculate unew using u involves the average of 4 neighboring u values. What if both u and unew point to the same location?
This would become the classical problem of aliased pointers [7].

Modern compilers are very smart and to ensure safety, they assume that aliasing could be possible. And for scenarios like this, they avoid any optimizations that may impact the semantics and the output of the code.

In our case, we know that u and unew are different memory locations and are meant to store different values. So, we can easily let the compiler know there won’t be any aliasing here.

How do we do that?

There are two methods. First is the C “restrict” keyword. But it requires changing the code. We don’t want that for now.

Anything simple? Let’s try “-fno-alias”.

-fno-alias:

Goal: Instructs the compiler to not assume aliasing in the program.
Key Features: Assuming no aliasing, compiler can more freely optimize the code, potentially improving the performance.
Considerations: The developer has to be careful in using this flag as in case of any unwarranted aliasing, the program may give unexpected outputs.

More details can be found in the official documentation.

How does this perform for our code?

Well, now we have something!!!

We’ve achieved a remarkable speedup here, nearly 3x of the previous optimizations. What’s the secret behind this boost?

By instructing the compiler not to assume aliasing, we’ve given it the freedom to unleash powerful loop optimizations.

A closer examination of the assembly code (though not shared here) and the generated compile optimization report (see below) reveals the compiler’s savvy application of loop interchange and loop unrolling. These transformations contribute to a highly optimized performance, showcasing the significant impact of compiler directives on code efficiency.

Final graphs

This is how all the optimizations perform against each other:

Compiler Optimization report (-qopt-report)

The Intel C++ compiler provides a valuable feature that allows users to generate an optimization report summarizing all the adjustments made for optimization purposes [8]. This comprehensive report is saved in the YAML file format, presenting a detailed list of optimizations applied by the compiler within the code. For a detailed description, see the official documentation on “-qopt-report”.

What next?

We discussed a handful of compiler flags that can drastically improve the performance of our code without us actually doing much. The only prerequisite: don’t do anything blindly; make sure you know what you’re doing!!

There are hundreds of such compiler flags, and this story talks about a handful. So, it is worth looking at your preferred compiler’s official compiler guide (especially the documentation related to optimization).

Apart from these compiler flags, there are a whole bunch of techniques like Vectorization, SIMD intrinsics, Profile Guided Optimization and Guided Auto Parallelism which can amazingly improve the performance of your code.

Similarly, Intel C++ compilers (and all the popular ones) also support pragma directives that are very nice features. It’s worth checking some of the pragmas like ivdep, parallel, simd, vector, etc. on the Intel-Specific Pragma Reference.