How to make your C++ code faster with an extended stack
Allocating memory from the stack in C++ is blazingly fast, but also restrictive. It costs only a few arithmetic operations but
- The allocation cannot persist beyond its function’s scope.
- The allocation must be small enough in size that it fits into the program’s stack. If we’re unsure, we run the risk of crashing our program with a stack overflow.
Additionally, for dynamic allocations, where we only know the size at runtime, we need to use the non-standard alloca function. However, alloca can be cumbersome to work with
- It knows nothing about C++ constructors of destructors; so if we’re going to use it for non-trivial classes, we need to invoke those manually.
- alloca can only allocate memory for the function that calls it, which makes it difficult to compose higher-level blocks of functionality.
Frequently, code satisfies the lifetime requirement of stack allocation; but because of the dangers of stack overflow or the difficulties of working with alloca, programmers choose to use heap allocation instead. Consider, for example, a function that computes the median salary for a company. We might write the function like this:
It’s necessary for
compute_median_salary to allocate some workspace memory to use std::nth_element. This allocation meets the lifetime requirement of stack allocation; but if we used alloca, we could crash our program if someone were to call the function with a large workforce. It’s dangerous to use alloca here even if most of our use cases would fit into the stack just fine.
I’ll show an alternative approach, though, that allows us to keeps the performance benefits of stack allocation while avoiding the dangers of stack overflow and providing a much more usable interface than alloca. It works by setting aside additional memory upfront that can act as an extended stack
Like the real stack,
stack_allocator only supports allocating and deallocating memory from the top. We then define a class
scoped_allocator which allocates any memory requested from the extended stack and frees everything when it’s destructor is called. If more memory is requested than is available, it falls back to allocating the memory from the heap, which it also frees on destruction. And, unlike alloca, it invokes constructors and destructors automatically for non-trivial classes. Using
compute_median_salary changes to
Benchmarking the two approaches for randomly generated workforces of 1 to 10 employees, I get the result
BM_compute_median_salary_heap 82201 ns
BM_compute_median_salary_stackext 55363 ns
showing that the extended stack version runs roughly 48% faster on my machine.
If a program is such that we can afford to allocate a small amount of additional memory upfront, then we can use an extended stack to avoid unnecessary heap allocations and achieve significant performance gains. Source code for this approach is available in the header-only, open-source project stackext.