C++ and Embedded Systems, Part 1: ETL vs STL Algorithms

Richard Robinson
4 min readMay 19, 2020
The benchmark function for the `fill` function of ETL

The Embedded Template Library (ETL) serves to, amongst other offerings, replace the C++ STL algorithms with its own implementation in hopes of offering improved performance. ETL was created by John Wellbelove with the main goal of complementing the STL with statically-allocated containers to better serve the needs of embedded systems, where dynamically allocated memory is typically unwanted due to its lack of reliability.

The algorithms provided by the STL are extremely invaluable, and as such their performance, especially in the context of embedded systems, is a key measurement. ETL offers most of the algorithms STL has; some are reverse-engineered from STL and others have different implementations. ETL is also capable of directly using the STL algorithms if the former does not have a better implementation; this is controlled by defining the ETF_NO_STL flag in etl_profile.h. In benchmarking the performance, this flag was set.


The benchmarking was performed using GCC 7.50 and G++ 7.5.0 on the Ubuntu-based elementaryOS operating system on a machine with an Intel Core i7 processor and 16GB of DDR4 memory.

Both ETL and the Google C++ Benchmarking Library were added as dependencies to the project’s CMakeLists.txt file, with the -DETL_DEBUG flag set and CMAKE_CXX_STANDARD set to 20.

The code benchmarked every algorithm available in ETL, both with ETL and with STL. Before executing the code to benchmark, a statically allocated C-style array was created with a size of 10,000. The array was then filled with random numbers between 0 and 1000 using std::uniform_int_distribution<int> and the std::mt19937_64 random number generation engine (with a seed of 0 to ensure consistency).

Each benchmark measured the CPU and real-time execution speed of a particular algorithm when repeated with 10,000 iterations. The invoction of each algorithm was done in a minimally viable way, using beginand end iterators whenever possible to maximize the computational complexity on average. When an algorithm required a function as a parameter, simple lambda expressions were used, with no captures. Each benchmark test was performed 50 times (using the — benchmark-repetitions=50 argument). The results were then exported to a *.json file.

static void ETL_transform(benchmark::State& state) {
int array[SIZE];
int transformed[SIZE];

for (auto _ : state) {
for (int i = 0; i < REPETITIONS; ++i) {
[](int i) { return i + 1; }

An example of one of the benchmark functions used, in this case the benchmark for the etl::transform function. The code which is measured is within the outer range-based for loop.


The data for each trial was outputted to a JSON file, then parsed into a CSV file. Here is a simplified overview of the aggregated benchmarks. Each row corresponds to a particular algorithm, with the columns indicating the normalized mean CPU execution speed (for simplicity, the medians are not show in this overview table).

| algorithm | etl (mean, CPU time) | stl (mean, CPU time) |
| count | 1 | 1.139950596 |
| countif | 1 | 1.393046466 |
| copy | 1 | 1.012820164 |
| copyif | 1 | 1.084625663 |
| find | 1.013823127 | 1 |
| findif | 1 | 2.980586851 |
| equal | 1 | 1.013101107 |
| fill | 1 | 1.150081596 |
| filln | 1 | 1.134638761 |
| transform | 1 | 1.080240997 |
| swap | 1.004696685 | 1 |
| iter_swap | 1 | 1.001366021 |
| swap_ranges | 1 | 1.010655521 |
| rotate | 114.1459506 | 1 |

Of these averages, the median across all algorithms is 1:1.047; that is, STL algorithms are approximately 1.047 times slower than their ETL counterparts. The mean of all the algorithms is 7.297:1 with a standard deviation of 29.649 due to the extreme outlier of the range algorithm. As such, the median is a better representation of the aggregated data in this case.

While most of the algorithms are consistent with each other, the rotate algorithm is an outlier by a factor of over 100. Upon noticing this outlier, three further benchmarks were conducted for rotate, with each benchmark using a different iteratior position for the middle argument.

| `middle` argument | etl (mean, CPU time) | stl (mean, CPU time) |
| begin(array) + 1 | 1.34E+02 | 1.00E+00 |
| end(array) - 1 | 1.36E+02 | 1.00E+00 |
| begin(array) | 1.00E+00 | 1.62E+00 |

The results of these benchmarks were consistent with the original data, indicating that the ETL version of rotate was indeed astronomically slower than its STL counterpart. Upon further research, the author of the ETL library noted that:

The ETL implementation of ‘rotate’ “was general purpose and worked for any forward capable iterator. There are much faster ways for bidirectional and random iterations.” — John Wellbelove

This contrasts the STL implementation, which does in fact make optimizations for different types of iterators. Because C-style array iterators are both bidirectional and random access, the STL implementation would have been optimized for this case compared to the general purpose ETL implementation. Therefore, this is principle reason as to the stark difference and abnormality of this algorithm’s performance.

The full results are available here and the benchmarking code here.


Purely based on performance, ETL is a better alternative to the STL (excluding rotate), if only slightly. Combined with the other benefits ETL has to offer, it makes sense for embedded developers, if choosing only one library to use, to opt for ETL. The algorithms of ETL are also guaranteed to not use any dynamic memory allocation; the STL makes no such gurantee, meaning ETL is not only faster, but safer as well.



Richard Robinson

Fourth year Software Engineering student and Research Assistant. Embedded systems, mobile apps, and VR/AR experiment are my specialties.