Code Optimization Techniques

Riddhi Pandya
12 min readJun 2, 2020

--

Below are the techniques for loop optimization.

Inline Function

Functions can be instructed to the compiler to make them inline so that the compiler can replace those function definitions wherever called. It saves the overhead of variable push/pop on the stack, which was required for function calling and also reduces the overhead of return from the function. At the same time, the inline function will tend to increase the code size.

Loop jamming

Loop jamming reduces code size and makes code run faster as well by eliminating the loop iteration overhead.

Generic Loop Hoisting

To improve the performance of inner loops, reduce redundant constant calculations.

Loop Overhead

The MCU has a conditional branch mechanism that works well when counting down from a positive number to zero. It is simple but effective:

for( I = 0; I <10; I++){ … }

If we needn’t care about the order of the loop counter, we can do this instead:

for ( I =10; I — ; ) { … }

This works because it is quicker to process I — as the test condition, which says “Is I non-zero? If so, decrement it and continue”. For the original code, the processor has to calculate “Subtract I from 10. Is the result non-zero? If so, increment I and continue.” In tight loops, this makes a considerable difference.

Loop fusion, loop fission

There may be cases in which two separate loops can be merged, and there may be cases in which a single loop is split into two. The following is an example:

The right version might lead to an improved cache behavior (due to the improved locality of references to array p), and also increases the potential for parallel computations within the loop body.

Loop permutation

For row-major layout, it is usually beneficial to organize loops such that the last index corresponds to the innermost loop. A corresponding loop permutation is shown in the following example:

Caches are normally organized such that adjacent locations can be accessed significantly faster than locations that are further away from the previously accessed location.

Explicit Parallelism

Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization.

However, unrolling increases code size. Unrolling is normally restricted to loops with a constant number of iterations.

Loop Inversion

Loop inversion improves the performance in boundary conditions. Rather than checking the loop condition at the beginning, it is better to check at the end, so that it does not require to jump back to the top even when the condition fails, . For that purpose, one can use (if + do…while) instead of the only a while.

When the value of i is 99, the processor need not jump. (This is required in the old code) Processors have a special instruction for decrement, which is faster to test if compared with zero than comparing two different numbers. So it reduces execution time.

Loop Invariant Computations

Instructions that give the same value each time when defined with an infinite loop should be kept outside the infinite loop to protect that line from calculating repeatedly inside a loop.

Use Lookup Table

Look-Up-Table (LUT) is an array that holds a set of pre-computed results for a given operation. It provides access to the results in a way that is faster than computing each time the result of the given operation. Look-Up Tables are a tool to accelerate operations that can be expressed as functions of an integer argument. They store pre-computed results that allow the program to immediately obtain a result without performing the same time-consuming operation repeatedly.

However, we have to be careful before deciding to use LUT’s to solve a particular problem. We should carefully evaluate the cost associated with their use (in particular, the memory space that will be required to store the pre-computed results).

Some Other General Techniques:

Conditional statement

Usually, pre-decrement and post-decrement (or pre-increments and post-increments) in normal code lines make no difference. For example, “i — ;” and“ — I;” simply generate the same code. However, using these operators as loop indices and in conditional statements make the generated code different.

“If-else” and “switch-case” are widely used in C code; a proper organization of the branches can reduce the execution time.

For “if-else”, always put the most probable conditions in the first place. Then the following conditions are less likely to be executed. Thus time is saved for most cases.

Using “switch-case” may eliminate the drawbacks of “if-else”, because for a “switch-case”, the compiler usually generates lookup tables with index and jump to the correct place directly

Arranging Cases by Probability of Occurrence

Arrange switch statement cases by the probability of occurrence, from most probable to least probable.

Avoid switch statements such as the following, in which the cases are not arranged by the probability of occurrence

Avoid standard library routines

To reduce the size of the program avoid using large standard library routines. Many routines try to handle all possible cases so code size increases. Standard C library’s sprintf routine is very large. Because of floating-point manipulation routines. If you don’t need to format and display floating-point values (%a, %e, %f, or %g) include the library of sprintf.

Goto Statements

Like global variables, good software engineering practice dictates against the use of Goto statements. But in a pinch, a Goto can be used to remove complicated control structures or to share a block of oft-repeated code.

In addition to these techniques, several of the ones described in the previous section may be helpful, specifically table lookups, hand-coded assembly, register variables, and global variables. Of these, the use of hand-coded assembly will usually yield the largest decrease in code size.

Constant Folding

Constant folding is the simplest code optimization to understand. Let us suppose that you write the statement x = 45 * 88; in your C program. A non-optimizing compiler will generate code to multiply 45 by 88 and store the value in x.

An optimizing compiler will detect that both 45 and 88 are constants, so their product will also be a constant. Hence it will find 45 * 88 = 3960 and generate code that simply copies 3960 into x.

This is constant folding and means the calculation is done just once, at compile-time, rather than every time the program is run.

Strength Reduction

Strength reduction is the replacement of an expression by a different expression that yields the same value but is cheaper to compute. Many compilers will do this for you automatically. The classic example:

Else Clause Removal

A jump condition is inevitable in the first case, whereas the second case gives the higher possibility process (Condition A) to have a sequential execution. This technique is possible only if Undo “Do_That()” process is possible.

Global and Local Variables

Global variables are never allocated to registers. Global variables can be changed by assigning them indirectly using a pointer, or by a function call. Hence, the compiler cannot cache the value of a global variable in a register, resulting in extra (often unnecessary) loads and stores when global variables are used. We should therefore not use global variables inside critical loops.

If a function uses global variables heavily, it is beneficial to copy those global variables into local variables so that they can be assigned to registers. This is possible only if those global variables are not used by any of the functions which are called.

o global variables are stored ->the memory

o local variables are stored -> the register.

o register access is faster than the memory access

Dead code

Code that, while it may be executed, does not affect the output of the program is a prime candidate for removal. Dead code can arise as the result of previous optimization passes, as the result of macro expansion or other automated code generation, or simply as a result of changes to a large body of code that has effects outside of the scope of change

Byte Craft’s Sized Integers

The Byte Craft compiler recognizes int8, int16, int24, and int32 data types. They are integers with the appropriate number of bits. These remove the ambiguity of varying or switchable integer sizes.

Array folding

It takes advantage of the limited sets of components needed within an array.

Storage can be saved at the expense of more complex address computations.

1. Inter-array folding:

Arrays which are not needed at overlapping time intervals share same memory space

2. Intra-array folding

Take advantage of limited sets of components needed within an array

Complex adds. Computation

Instruction Scheduling

Non-pipelined machines do not need instruction scheduling: any order of instructions that satisfies data dependencies runs equally fast.

In pipelined machines, the execution time of one instruction depends on the nearby instructions: opcode, operands.

Instruction conflicts can be handled by rewriting code with non-dependent instruction in between

Power-Saving Techniques

Power consumption is a major concern for portable or battery-operated devices. Power issues, such as how long the device needs to run and whether the batteries can be recharged, need to be thought out ahead of time. In some systems, replacing a battery in a device can be a big expense. This means the system must be conscious of the amount of power it uses and take appropriate steps to conserve battery life.

There are several methods to conserve power in an embedded system, including clock control, power-sensitive processors, low-voltage ICs, and circuit shutdown. Some of these techniques must be addressed by the hardware designer in his selection of the different system ICs. There may be lower-power versions of certain peripherals. Some power-saving techniques are under software control.

Optimizing for Energy

• Use registers efficiently (more than memory)

• Moderate loop unrolling eliminates some loop overhead instructions.

• Eliminate pipeline stalls.

• In-lining procedures may help: reduces linkage

• Energy-aware instruction selection

• Operation strength reduction: e.g. replace * by + and <<

Endianness

Endianness refers to the representation of multi-byte variable inside the embedded system

As low-order first(called little-endian), or high-order first(called big-endian).

care should be taken to maintain the Endianness throughout.

An Endianness difference can cause problems if the processor tries to read binary data written in the opposite format from a shared memory location or file.

Embedded C++

You might be wondering why the creators of the C++ language included so many features that are expensive in terms of execution time and code size. You are not alone; people around the world have wondered the same thing especially the users of C++ for embedded programming. Many of these expensive features are recent additions that are neither strictly necessary nor part of the original C++ specification. These features have been added one by one as part of the ongoing “standardization” process.

In 1996, a group of Japanese processor vendors joined together to define a subset of the C++ language and libraries that are better suited for embedded software development. They call their industry-standard Embedded C++ (EC++). EC++ generated a great deal of initial interest and excitement within the embedded community.

  • A proper subset of the draft C++ standard, EC++ omits pretty much anything that can be left out without limiting the expressiveness of the underlying language. This includes not only expensive features such as multiple inheritances, virtual base classes, runtime type identification, and exception handling, but also some of the newest additions such as templates, namespaces, and new-style casts. What’s left is a simpler version of C++ that is still object-oriented and a superset of C, but has significantly less runtime overhead and smaller runtime libraries.
  • Several commercial C++ compilers support the EC++ standard as an option. Several others allow you to manually disable individual language features, thus enabling you to emulate EC++ (or create your very own flavor of the C++ language).

The features of C++ that are typically too expensive for embedded systems are templates, exceptions, and runtime type identification. All three of these negatively impact code size, and exceptions and runtime type identification also increase execution time. Before deciding whether to use these features, you might want to do some experiments to see how they will affect the size and speed of your application.

Pass Class Parameters by Reference

Passing an object by value requires that the entire object be copied (copy constructor), whereas passing by reference does not invoke a copy constructor, though you pay a “dereference” penalty when the object is used within the function. This is an easy tip to forget, especially for small objects. As you’ll see, even for relatively small objects, the penalty of passing by value can be stiff. I compared the speed of the following functions:

o template <class T> void ByValue(T t) { }

o template <class T> void ByReference(const T& t) { }

o template <class T> void ByPointer(const T* t) { }

For strings, passing by reference is almost 30 times faster! For the bitmap class, it’s thousands of times faster. What is surprising is that passing a complex object by reference is almost 40% faster than passing by value.

Only ints and smaller objects should be passed by value because it’s cheaper to copy them than to take the dereferencing hit within the function.

Postpone Variable Declaration as Long as Possible

In C, all variables must be declared at the top of a function. It seems natural to use this same method in C++. However, in C++, declaring a variable can be an expensive operation when the object has a non-trivial constructor or destructor. C++ allows you to declare a variable wherever you need to. The only restriction is that the variable must be declared before it’s used. For maximum efficiency, declare variables in the minimum scope necessary, and then only immediately before they’re used.

// Declare Outside (b is true half the time)

T x;

if (b)

x = t;

// Declare Inside (b is true half the time)

if (b)

T x = t;

· Without exception, it’s as fast or faster to declare the objects within the scope of the if statement. The only time where it may make sense to declare an object outside of the scope where it’s used is in the case of loops. An object declared at the top of a loop is constructed each time through the loop. If the object naturally changes every time through the loop, declare it within the loop. If the object is constant throughout the loop, declare it outside the loop.

The Optimize Pragma

If an optimized section of code causes errors or a slowdown, you can use the optimize pragma to turn off optimization for that section.

Enclose the code between two pragmas, as shown here:

#pragma optimize(“ “, off)

// some code here

#pragma optimize(“ “, on)

Use Prefix Operators

• Objects that provide the concept of increment and decrement often provide the ++ and — — operators. There are two types of incrementing, prefix (++x) and postfix (x++). In the prefix case, the value is increased and the new value is returned. In the postfix case, the value is increased, but the old value is returned. Correctly implementing the postfix case requires saving the original value of the object — in other words, creating a temporary object. The postfix code can usually be defined in terms of the prefix code.

• The clear recommendation: avoid postfix operators. You may want to declare the postfix operators in the private section of the class so that the compiler will flag incorrect usage for you automatically.

• Strings aren’t included in the results because increment doesn’t make sense for strings. (It doesn’t make sense for bitmaps, either, but I defined increment to increase the width and height by one, forcing a reallocation.) Where this recommendation shines is for mathematical objects like complex. The prefix operator is almost 50% faster for complex objects.

--

--