A way to minimize errors and make your C code easier to read — The non-trivial parts

Markus Gothe
HiMinds
Published in
16 min readOct 19, 2020

The previous article only focused on multi-threading; this article will focus on cryptography and discuss some other non-trivial programming aspects.

Cryptography is still a research topic, and it is straightforward to implement it in an insecure way or in a way that has unforeseen side-effects that will weaken the algorithm. These potential side-effects include information leakage and related side-channel attacks like timing-based and cache-based attacks. The relatively new WPA3 standard showed us, with the dragonblood attacks, in 2019 how difficult it can be to implement security standards and how different implementations can diverge in security even when using the same standard. The dragonblood attacks are not only a few vulnerabilities discovered but a total of 14 different disclosed vulnerabilities. Most of them are timing-based, cache-based, and down-grade attacks.

One of the best minds of my generation once told me: “there might be 30 000 people working with IT security in the world, but only 200-300 individuals who really understand it”. I think it’s close to the truth; IT security covers a wide range of topics and having an in-depth understanding of almost every aspect of most of the topics is very rare. It requires a certain mindset that’s not for everyone.

What we need to remember is that these 200–300 individuals also from time to time make the same mistakes as the rest of us; e.g. hardware-based side-channel attacks can be complicated to reproduce in a test environment or without costly/specialized equipment. You don’t usually test for hardware-based side-channel attacks when developing code. If you do, you usually have access to a hardware laboratory.

Security is Gödel’s incompleteness theorems put into practice, sooner or later someone figures out how to break the security. The next big challenge will be when we get practical quantum computing, and even if we design a quantum-safe cipher we need to consider all the other aspects. It’s a new frontier, with new types of problems. Back in 2008, it was proven that the authentication used for quantum cryptography was in theory, insecure. Albeit this particular problem was fixed, we don’t know what problems we will face in the future or specific implementations.

Let’s encrypt, but let’s try to do it correctly…

As with the previous article, this is targeting an experienced audience. Having at least 5 to 7 years of professional (or equivalent) experience and fundamental knowledge of cryptography is expected from the reader.

Non-local jumps

Some features are more used than others, and one of the least used features in C is the setjmp()/longjmp() function pair. Think of it as a non-local variant of ’goto’, and one should, in general, avoid using ’goto’ except for error-handling or quirky kernel behaviors. It’s the C equivalent to what is called “call with current continuation”, with some restrictions. You cannot return from the function calling setjmp() and then call longjmp() as an example of these restrictions.

The biggest problem with the function pair is like the analogy of using malloc() inside an asynchronous signal-handler, unless you know for a fact that the signal is blocked you cannot safely free the memory inside the signal handler without facing potential problems. It’s an analogy to get a grasp of the pitfalls when using the function pair. Don’t try to use it from asynchronous signal handlers, if you are not doing it for the fun of it (and it’s rarely fun with undefined behaviors). If you anyway need to use the functionality from a signal handler, use the analogous sigsetjmp()/siglongjmp() function pair.

When using the function pair, it’s important to follow the safety for malloc()/free() as discussed in previous articles and if you declare a local pointer to the heap, better free it before calling longjmp(). It would be best if you did not use both functions inside the same function. It’s better to use ’goto’ if you have to jump to another part of the code inside the same function to avoid additional complexity. I will describe this additional complexity further down.

I will provide a dangerous example below.

#include <stdlib.h>
#include <setjmp.h>
#include <errno.h>
jmp_buf buf; /* Global stack − allocated jump − frame holder */int main(void) {
char *buf = NULL;
int r = setjmp(buf); /* holds 0 on setjmp() and the value defined in longjmp() */
buf = malloc((r+1)*sizeof(char)); /* Allocates r+1 elements in a char-array */

if(buf == NULL) {
return −ENOMEM;
}
if(r != 0) {
longjmp(buf, 42); /* In this case a shortjmp() ;-) */
}
free(buf);
return 0;
}

The code above is at least not an infinite loop or passing 0 to malloc(), but it makes an allocation that is not freed before doing the next allocation and overwrites the pointer. Luckily in this example, the OS will clean it up after the second allocation, but that might not always be the case.

One traditional use for the function pair has been to provide global error handling to a program. In many languages, like C++ and Java, this is better done with the exception-handling that is part of the language. Using C++ might not be feasible for an embedded environment, or you might have to disable the exception-handling for other reasons (e.g. the Google C++ coding standard require this), then handling the exceptions with the setjmp()/longjmp() function pair might be a viable alternative, so there are still valid and normal use cases for it.

If using the function pair then bear in mind that there might be unexpected side-effects due to the stack not being unwinded like we are used to. The function pair works by modifying the stack pointer, the frame pointer, and the program counter. This also means that the stack unwinding cannot be done as it’s normally done, but it is in most cases transparent for the programmer.

Hard-float vs soft-float

When using a desktop computer, we normally don’t even consider the fact that it has an FPU, Floating Point Unit. We take it for granted that floating-point numbers will be handled in the same way as integers in terms of primitives. Of course, we still need the extra functionality for mathematical operations provided by the standard C library.

For embedded systems, the FPU is not always part of the system to reduce the production costs, which makes sense since you can emulate the functionality of an FPU in software. It can be implemented by the compiler, and this mode of operation is usually called “soft-float”. An in-kernel emulator can also be used to implement the functionality of the FPU. The emulator is, however, slower than the “soft-float” mode and the availability of it depends on the hardware architecture and the OS. The most obvious reason to use the in-kernel emulator is to be able to use third-party binaries that require an actual FPU.

On a side note, the MIPS hardware architecture allows FPU-implementations to omit certain floating-point functionality. Functionality that is not implemented in hardware will instead be trapped, via an exception, and handled by the emulator. Thus making the in-kernel emulator mandatory if the system has got an FPU.

The “soft-float” mode, being implemented by the compiler during compilation, has got a slightly different calling convention than the “hard-float” mode. The compiler represents the floating-point numbers as integers and then emulate primitive operations done on them. All this is hidden for the programmer but requires more assembly code to be emitted from the compiler. Common software-based floating-point instructions are usually included in a separate library when using this mode, for GCC this is part of ‘libgcc_s.so.1’ or ‘libgcc.a’ depending on the linking method. Since the calling convention differs between the two modes, the ABI (Application Binary Interface) between the two modes differs as well. The implication of this is that binaries, kernels, and standard C libraries compiled for one mode doesn’t work with the other mode.

What is less well-known is that GCC by default tries to do some optimizations that are not strictly compatible with the IEEE 754 standard on floating-point arithmetic in this mode of operation. This might give us some subtle bugs that are not observed during development and do not appear during traditional use of the code, only to show up when running the code in the “soft-float” mode.

An example of the before-mentioned bugs is code for comparing a floating-point number with 0. The bug manifests because of the internal representation done by the compiler. This gives a different behavior in certain corner-cases when running the same code on a system with an FPU and a system without an FPU. Floating point numbers can also be signed when 0, i.e. +0 or -0. So when doing a comparison between a floating-point number and 0, use the fpclassify() function as below.

float x = 0.0;

if(fpclassify(x) == FP_ZERO) {

}

Lack of entropy — The return of Shannon

Embedded systems are notorious for lacking good and reliable sources of entropy. If you are lucky, there is some sort of true random number generator, TRNG, built-in to the system. However, I consider a TRNG to be a rarity in most of the consumer and low-end devices.

The way the OS generates (seeds) entropy for the cryptographically secure (at least in theory) pseudo-random number generator, CSPRNG, has shown to be not so cryptographically secure in, e.g. QNX (Blackberry was responsible and fixed this weakness a long time ago). Usually what the OS does is that it collects and combines the entropy from interrupts, I/O subsystems, network traffic, date/clock… You get the idea. In the best of all worlds, the CSPRNG will reseed itself when there is more entropy, to increase the randomness. For consumer products, a CSPRNG is normally working pretty well since there is usually some interaction with the environment. Building up entropy is then only an issue for performance-critical tasks. For extremely tiny embedded systems getting enough entropy, even after hours, can be more difficult. Nonetheless, it can be of great importance to have high entropy. For most embedded systems this is only an issue during boot-up and at worst the first minutes after boot, given that the implementation of the CSPRNG is correct of course.

The problem with having a low entropy is that it makes it possible to run trivial brute-force attacks against functions relying on randomness. To illustrate this better, I will compare the ‘/dev/random’ and the ‘/dev/urandom’ interfaces provided by most Unix-like operating systems. Their behaviors are OS-dependent so that I will describe the traditional implementation in Linux; the first file is blocking and return cryptographically secure randomness of high entropy, while the second file provides cryptographically secure randomness with a twist and doesn’t block. The twist is that it cannot consume the entire entropy pool and hence can be made non-blocking, although the quality of the entropy may vary. This twist and the fact that the interface is non-blocking makes the interface much faster, especially on embedded devices.

One common thing to do for a performance-critical application like OpenVPN, as an example, to forward packets faster on a consumer router is to patch the source code to use the latter file. The rationale for using such a patch is that since OpenVPN is consuming many resources for doing its job, the VPN tunnel may be unstable/slow during certain conditions if it’s blocking. Nota bene that this doesn’t mean that a patched variant of OpenVPN is inherently insecure, it might just not be as secure as it could have been in theory and that’s something else. The ciphers that are used for the tunnel are more likely in this case to weaken the theoretical security with magnitudes compared to this optimization.

In Linux, the behavior of ‘/dev/random’ has in the past years changed to not block unless the CSPRNG has not been initialized, so these days it will behave more like ‘/dev/urandom’ in terms of performance. The underlying logic for the interfaces has been unified as well, so their behaviors are identical except for the small difference mentioned above.

For functions that implement security per se, and usually relies on “true randomness”, the above optimization can become a hazard depending on the OS-specific implementation details. An example is key-generating functions, which are not supposed to be deterministic at all. Then we might end up with keys that can be broken with a simple brute-force attack, especially when we don’t have enough entropy.

If unsure which interface to use or if performance doesn’t matter, always use what is considered to be the most secure one.

Secure memset()

Using memset() to clear memory works typically very well. The compiler can optimize it away without the programmer noticing it. Most of the time, this optimization is valid and doesn’t affect the running program. If tampering with cryptography don’t reckon it won’t be optimized away prematurely unless it is used later in the code path. The compiler might optimize it away via dead store elimination as mentioned. The sensitive data will then be accessible and exposed since the memory wasn’t cleared. Therefore C11 provides memset_s() as a secure variant, which has a more stringent error checking as well.

If memset_s() is not available one can use the following snippet:

void *secure_memset(void *v, int c, size_t n) {
volatile unsigned char *p = v;
while(n −− ) {
*p++ = c;
}
return v;
}

Bear in mind that your actual result of this specific implementation may vary depending on the compiler’s respect for the ’volatile’ keyword. Some compilers are known to even optimize code like the above when aggressive optimizations are enabled.

This example of information leakage leads us to constant-time code and how not to write it.

Constant-time code

Some code in cryptographical computations are sensitive to timing-attacks and must be written with extreme care in a way that makes the function always take a certain amount of time given that the length of the data is the same.

Even for very experienced programmers writing constant-time code is, most of the time, extremely difficult. In world post-spectre (and similar speculative branching attacks) it is harder than ever to write correct cryptographical code since the branching itself might leak timing information. Hence the need to eliminate as many branches as possible. This is a related concept called branch-free code.

So let’s take a look at a common implementation of strcmp() below.

int strcmp (p1, p2)
const char *p1;
const char *p2;
{
register const unsigned char *s1 = (const unsigned char *) p1;
register const unsigned char *s2 = (const unsigned char *) p2;
register unsigned char c1, c2;
do
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == ’\0’)
return c1 − c2;
}
while (c1 == c2);

return c1 − c2;
}

Constant time code should follow the ARX-principle (Add, Rotate, Xor/Or), to guarantee constant-time properties. These hardware constructs are widely regarded in text-books as “safe” and deterministic (you could albeit implement an ALU (Arithmetic Logic Unit) which takes a random time before returning the results, but that’s a pretty bad sales pitch). As seen above, it uses subtraction combined with branching to make the comparison.

So let’s look at a constant-time variant.

int str_iseq(const char *s1, const char *s2)
{
int m = 0;
volatile size_t i = 0;
volatile size_t j = 0;
volatile size_t k = 0;
if (s1 == NULL || s2 == NULL)
return 1;
while (1) {
m |= s1[i]^s2[j];
if (s1[i] == ’\0’)
break;
i++;
if (s2[j] != ’\0’)
j++;
if (s2[j] == ’\0’)
k++;
}
return m;
}

It does return 1 or 0 only, which might be insufficient but proves the difficulties with writing the code. The interesting part is to look at how it uses xor and or to store and compare the individual characters. The variable ’k’ ensures that the loop always ends with an increment operation and hence needs ’volatile’ to not been optimized away, the compiler is the enemy when it comes to writing cryptographical correct code. More information explaining the function in detail can be found at this link.

A less contrived solution is to use the constant-time memcmp() in OpenSSL and derivatives as seen below.

int CRYPTO_memcmp(const void *in_a, const void *in_b, size_t len)
{
size_t i;
const unsigned char *a = in_a;
const unsigned char *b = in_b;
unsigned char x = 0;
for (i = 0; i < len; i++)
x |= a[i] ^ b[i];
return x;
}

Since the length is known a prior, this reduces the complexity and moreover it
is clearer in terms of code and implementation (which is well-done, I’d personally initialize ’i’ at declaration), since the first example implicates branching issues which must be addressed; hence use a constant-time memcmp() for cryptographical code where you know the size of strings a priori!

Remember that the theoretics for cryptographic functionalities are usually well-assessed and scrutinized by academics. What the attacker usually attacks is a specific implementation of the theoretical work (an example of the contrary is the predecessor to MD5, MD4, that was flawed by design but let’s leave those things to Bruce Schneier and his peers).

Misc. cryptography

A note on perfect forward secrecy might be appropriate as well. Perfect forward secrecy (called PFS) guarantees that if you can brute-force one message in a stream of encrypted messages, you cannot read the rest. It’s a nice property which we usually want to have. It can be implemented by using the ephemeral variants of the Diffie-Hellman key exchange. Just make sure that the used cipher suites support this, and everything will be fine.

When transporting data via SSL/TLS, or similar protocols, it can be done by using block ciphers together with hash digests (for validating the authenticity of the data). This is rather slow in terms of performance unless there is hardware acceleration available. There are symmetric cipher modes that remove the need for the hash digests called “authenticated encryption with associated data”. These modes will reduce the complexity for the programmer, and slightly increase the performance. To truly increase performance when using symmetric encryption, it’s always better to use stream ciphers, like Salsa20 or ChaCha20, which are novel innovations developed by famed cryptographer Dan Bernstein. These ciphers are on par with 128-bit AES-GCM in terms of security but around 3–4 times faster in terms of performance. Most of the common cryptographic libraries support at least one of the two ciphers.

A common problem which is worth mentioning for embedded devices is that sometimes they store a fixed size certificate/key in some sort of non-volatile RAM. This is done by design, and it is, in most cases, the most convenient way to avoid recreating a new certificate/key every time the device reboots. Then when you want or need to create a new certificate/key for some reason, it will not fit into the area of the non-volatile RAM. If possible, reserve the double amount of memory for storing certificates compared to what you actually will use when starting production. There might be a viable attack disclosed tomorrow on the cipher suites you are using today, and then changing the key size or the ciphers might be very difficult due to design since the size of the certificate/key will most likely be larger. Much of the future is uncertain, but the fact that we will need larger keys and certificates is for sure.

When using asymmetric cryptography in embedded systems, there is also a great misunderstanding on how to use the private key. It’s private in the sense that only one copy of it should exist. This copy should be used for signing binaries and to encrypt passwords used for symmetric encryption, i.e. AES. Symmetric encryption should then be used when encrypting/decrypting, for example, a firmware image. For this purpose, the private key in the key pair used should never be on the device. There can, of course, be another private key on the device for setting up SSL/TLS etcetera, and this private key should be unique for the device. Since having a unique private key might not always be possible, the same key pair must not be used for these two separate purposes. The private key, from which we can easily derive the public key (via a trapdoor function), cannot encrypt large messages. This is confusing for some people since it usually can be done with the public key. What I’ve encountered in one case was that a well-known router vendor came up with the idea to reverse the roles of the keys: encrypt the firmware with their public key, and decrypt it with the private key. As the reader might have guessed the private key could be extracted from the devices fairly easily. Given the built-in trapdoor function, the public key can be reproduced in a matter of seconds when you have the private key. Then anyone who has access to the private key will have the possibility to sign and encrypt their own firmware files for the devices. Goes without saying that not all people's intentions are benevolent and giving them the private key is not a good idea and it’s a terrible idea when you have used the same private key on hundreds of thousands of devices.

Summary and moving forward

The main ideas from the article can be summarized as follows:

  • Cryptography is difficult, use the pre-existing and well-tested libraries that exist. Although these libraries have been shown to have weaknesses after many years in production environments.
  • Using cryptography libraries makes it difficult to blame the individual using them, so in terms of responsibility when accidents happen, you are not the one to blame.
  • Be 100% sure that you know the impacts of what you are doing when using cryptography in your code.
  • If it’s possible, test your code with software floating-point enabled.
  • Be careful when using longjmp()/setjmp().

Cryptography is as said very difficult to implement in a way that’s both fast and secure. Usually, there is a trade-off between the two. Learn how to distinguish when you should use an extra layer of security, and when you shouldn’t use it. If there is a vulnerability disclosed related to cryptography then as a programmer you shouldn’t worry too much unless you were the one who implemented it; you are not alone to have the issue and soon there will be a fix for it ready. You should, of course, have the code patched as soon as possible. If history has learned us something in terms of computer programming, it is that even to most proficient experts can make mistakes. It’s rarely fair to anyone involved to play a blame game in these situations.

Mistakes will happen, like it or not. But provided the tools and knowledge of how they might happen, we can try our best to mitigate the risks. I have not touched on the concept of code-reviewing in my articles, because I got the feeling that it was more of a generic subject than the topics I have covered. Of course, code-review is a good tool. Considered that it’s used correctly and that all parties understand most of the actual code and many of the implications. We shouldn’t rely on code-review alone as much as we should rely solely on any other method to mitigate the risks, what we need to do is to combine them.

“Given enough eyeballs, all bugs are shallow.”

This is the final article of the articles on advanced C concerning software bugs. I’ve got a vague idea of writing a couple of more general articles on BSD sockets next.

--

--

Markus Gothe
HiMinds
Writer for

Avid SGI/IRIX enthusiast and embedded MIPS specialist...