7 tips for code optimization
Author: Marcin Ławicki, Development Manager, Huuuge
Computer software is omnipresent and integral to our daily activities — be it for work or fun. We run it on computers, consoles, phones, watches, and lots of other hardware. Because of that we put a great deal of effort to make it look great (UI) and be easy and pleasant to use (UX). Since the competition on the market is fierce those aspects, sometimes, belittle other less eye-candy but also important things — for example application’s performance.
In this article, I will present 7 pieces of advice on how to tackle that problem and make your apps not only look but also work great.
Code examples are in C++, but many of the suggested techniques can be applied in other technologies.
What is code optimization?
Before we dive into specifics, let’s clarify what we mean by code optimization. We define it as a modification process intended to make the software more efficient in terms of computations’ time and amount of required resources (primarily memory). It is important to note here that the modification should not change the outcome that the code produces, nor any of its expected side effects.
In essence, optimization is hard because it requires knowledge and experience and quite often it is about keeping the balance between contradicting requirements.
Why should we optimize code?
You may wonder — maybe we should rather direct our energy to the development of new features or fixing of existing bugs rather than optimization?
We should take into account all pros and cons because almost all code can be optimized to some degree. However, it doesn’t mean that it always makes sense to do so — remember that perfect is the enemy of good.
Let’s list some examples of when one may want to double down on optimization:
1. To make the software run faster. In the case of games, the shorter the execution of the game loop, the more frames produced per second, which essentially leads to smoother gameplay and a better gaming experience.
In the case of utility or optimization of applications may lead to better responsiveness or quicker production of results expected by the user.
In the case of backends or servers, it usually means the ability to serve more clients or return the required results faster.
2. To make the software work on limited hardware. In the case of embedded devices, such as ECUs (Electronic Control Unit) in your car or devices falling into the broad category of IoT (Internet of Things) software needs to work on hardware far more limited than our PCs, consoles, or even smartphones while being able to reliably do its work. In such cases, code must be crafted carefully with attention to limitations in computational power and available memory and, in some cases, even power consumption limitations.
3. To make software solutions possible. Some applications need to handle large amounts of data that can’t fit into available memory or require heavily parallelized computations to produce results in a reasonable time. Those might be solutions crunching on huge data sets or applications used for simulations of complex systems.
How to find out if we need to optimize
So far we know whats and whys. Now we have to understand if this is the right time to apply optimizations. To do so, we need to:
1. Use the app regularly and observe how it performs. If you can notice that gameplay is choppy, UI responsiveness is poor, or it takes ages to get the results of your program’s work then it is time to pause adding new features and revisit your code.
2. Inspect performance regularly. While user tests may uncover issues with your application, you likely find some of them faster when you will regularly monitor your software using dedicated tools called profilers.
7 advice for optimizing your code
The stage is set. We know what, why, and when. Now let’s talk about how to optimize.
1. Do not optimize prematurely.
There is a saying attributed to Donald Knuth, one of the fathers of programming: “premature optimization is the root of all evil.”. It is sometimes abused and even twisted in a way that lets lazy programmers abandon optimization whatsoever. The real meaning of those words is that programmers should first focus on two things — make code work properly and make it easy to read and understand. Only when those requirements are met should we consider code optimization. We shouldn’t go into automatic mode and always try to optimize, but do it only when we need to. And be careful — optimization that makes your code messy and hard to follow may cost you a lot in the longer perspective.
2. Run performance tests and use profilers
It doesn’t make sense to optimize something just because it can be optimized, or do it past some threshold. Define the goal that you want to achieve and monitor performance to stay on target. If your game is expected to run at 60 fps do not necessarily push it to run at 80 — especially if that would be costly or heavily impact code readability and ease of maintenance.
3. Start with algorithms
Not all techniques to optimize code were made equal. You need to understand your domain well before you start fighting for better performance. If you have sufficient knowledge then start with looking at your algorithms. It is probably the single most important area where you can improve. If you have implemented a custom solution try to understand its complexity with regards to big O. What is the class of your algorithm? If it is O(n3) or O(n²) there are big chances for vast improvements. Try to find classic problems that resemble most of your own and adapt well-known and optimized solutions to your needs. Only when you exhaust all options in this area should you move forward to other techniques.
Take a look at this (a bit simplified) example — we want to sort a collection of 1 million elements on a machine that performs 100 million operations per second.
The first algorithm selected for the task is a selection sort that has the complexity of O(n²). Calculating the actual amount of time required to sort the data we get a pretty high number of 104 seconds.
The second algorithm — quicksort, solves the same problem in less than a second.
Although quite simplified, this example shows how much can be changed just by applying a different algorithm.
4. Use your language better
Analyze your bottlenecks with regards to features of the programming language you are using. If it is C++ maybe you can change your code so you won’t need to use virtual functions or maybe those smart pointers you put there are not that necessary. Check your data types — maybe there are some bytes to be saved. Verify if the data structures that you picked from STL are the best choice. Maybe there is some 3rd party library that has a better implementation of the required data structure and it fits your requirements. Think twice before you throw an exception or even consider disabling them whatsoever.
While doing those micro-optimizations of the code you must remember that compilers are pretty smart beasts. Avoid “clever” improvements like using bit shifting instead of actual division. Compiler is smart enough to do it for you and all you are going to achieve is making the code less readable.
In the example below we can see how different implementations of the same concept can vary in the performance. The chart represents a comparison of the lookup time of std::string in a map containing 1 million elements.
5. Use your platform better
Optimizations are done on this level and fit only specific platforms. It may be that on the given platform you can use a compiler that supports a newer standard that enables performance-oriented mechanisms (like move semantics introduced in C++11) or produces better-optimized code in general. Maybe there is a library available on a specific platform that will make your program run faster.
Below is a piece of code in C that produces different assemblies on different compilers.
__m256i f(__m128i in)
__m128i p1 = _mm_and_si128(in,_mm_set1_epi32(0xFF));
__m128i p2 = _mm_packus_epi32(p1, p1);
__m128i p = _mm_packus_epi16(p2, p2);
return _mm256_set_m128i(p, p);
Assembly from clang:
vpshufb xmm0, xmm2, xmmword ptr [rip + .LCPI0_0]
vinserti128 ymm0, ymm0, xmm0, 1
vpand xmm0, xmm2, XMMWORD PTR .LC0[rip]
vpackusdw xmm0, xmm0, xmm0
vpackuswb xmm0, xmm0, xmm0
vinserti128 ymm0, ymm0, xmm0, 1
and from MSVC:
vmovdqu xmm0, XMMWORD PTR __xmm@000000ff000000ff000000ff000000ff
vpand xmm1, xmm0, XMMWORD PTR [r8]
vpackusdw xmm2, xmm1, xmm1
vpackuswb xmm3, xmm2, xmm2
vmovups xmm0, xmm3
vinsertf128 ymm0, ymm0, xmm3, 1
Clearly, the clang version stands out and points to possible improvements. Analyzing this version we come up with different C codes:
__m256i f(__m128i in)
__m128i p =_mm_shuffle_epi8(in, _mm_set1_epi32(0x0C080400));
return _mm256_set_m128i(p, p);
That leads to changes in assembly code that makes the gcc version look pretty much the same as clang and MSVC version only slightly different.
vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]
vinserti128 ymm0, ymm0, xmm0, 1
vmovdqu xmm0, XMMWORD PTR [rcx]
vpshufb xmm2, xmm0, XMMWORD PTR __xmm@0c0804000c0804000c0804000c080400
vmovups xmm1, xmm2
vinsertf128 ymm0, ymm1, xmm2, 1
6. Hit the metal
If the performance of your code is still not satisfying you can apply techniques that are hardware-specific. Basically, this boils down to the use of features specific to a given processor family and available via high-level functions called intrinsics. Through such functions, we can access, for example, SIMD instruction — SSE and AVX on Intel silicon or NEON on ARM chips. Intrinsics do not fit every task, but you can still try to use specific instructions using pieces of manually crafted assembly code. This, however, can only work if you know exactly what you are doing.
The example presented below is an excerpt from etcpak compression tool. The purpose of this code is to check if a given block of pixels is of solid color.
const uint8_t* ptr = src + 4;
for(int i=1; i<16; i++)
if(memcmp(src, ptr, 4) != 0)
ptr += 4;
The same logic expressed with intrinsics, although the code is a bit less readable and seemingly bigger, performs significantly better.
__m128i d0 = _mm_loadu_si128(((__m128i*)src) + 0);
__m128i d1 = _mm_loadu_si128(((__m128i*)src) + 1);
__m128i d2 = _mm_loadu_si128(((__m128i*)src) + 2);
__m128i d3 = _mm_loadu_si128(((__m128i*)src) + 3);
__m128i c = _mm_shuffle_epi32(d0, _MM_SHUFFLE(0, 0, 0, 0)); __m128i c0 = _mm_cmpeq_epi8(d0, c);
__m128i c1 = _mm_cmpeq_epi8(d1, c);
__m128i c2 = _mm_cmpeq_epi8(d2, c);
__m128i c3 = _mm_cmpeq_epi8(d3, c);
__m128i m0 = _mm_and_si128(c0, c1);
__m128i m1 = _mm_and_si128(c2, c3);
__m128i m = _mm_and_si128(m0, m1);
if (!_mm_testc_si128(m, _mm_set1_epi32(-1)))
7. Follow the rule of diminishing returns
When choosing optimization to apply, always start with the lowest hanging fruit. The best changes are those that do not cost a lot of effort but bring noticeable differences. Only when there are no easier solutions, move to the ones that require more work. We call this approach the ‘rule of diminishing returns.
Remember also to make sure that you have done as much as could be done before moving to the deeper level.
If, in the process, you hit the mark — stop, otherwise — continue. Assuming your goals are realistic and your skills sufficient, you will be successful.
Marcin Ławicki, Development Manager, Huuuge Games