Fun with Strncpy

Mark Aylett
Reactive Markets
Published in
7 min readJun 18, 2020
Photo by Daria Nepriakhina on Unsplash

The strncpy() function may result in buffer overruns when used incorrectly. Use of this function is understandably frowned upon by many developers. A fresh look at the API, however, reveals some genuine uses for this classic C function that are still applicable today.

This article aims to:

  1. rediscover some of the original motivations for this often misunderstood function;
  2. clarify how and when to use this function safely and effectively in modern code.

Background

A word of caution: the dangers of misusing strncpy() are real. It should only be used when the benefits outweigh the risks and it should always be used with extreme care. Higher level, safer APIs should be preferred for the vast majority of use-cases outside of specialist application domains.

The team at Reactive Markets have worked on many low-latency trading systems over the years. In this domain, applications often have latency requirements in the low microsecond or even nanosecond range. Achieving these latencies requires messaging protocols and encodings that are sympathetic to high performance data structures.

In some cases, it may be appropriate to simply copy bytes from memory to wire, as the example in the next section demonstrates, but this is not always practical or possible due to competing concerns and variable length data elements.

ITCH and SBE are good examples of financial protocols that attempt to balance the need for tight, compact data on the wire and contiguous, aligned in-memory data structures.

Binary Messaging

Let’s start with a motivational example to help set the stage for the discussion that follows. Imagine that we have been asked to design a binary message format with the following fields:

  • id: a 64 bit integer identifier;
  • user: an alpha-numeric string of no more than 8 bytes in length.

Arguably the most efficient way to represent this data structure in memory is as follows:

struct Message {
int64_t id;
char user[8];
};

Each field is 8 bytes in length and they are adjacent in memory with no padding between them. It turns out that this simple data structure also works reasonably well as a binary message format. In this case, the memory layout and the message format are the same, so no encoding or decoding is required when reading and writing messages; bytes can simply be copied to and from the network.

So far so good, but if we assume that the user field could be less than 8 bytes, or even empty, how do we know how long it is? We could introduce a separate length field, but this is suboptimal when the string is 8 bytes in length and, perhaps more seriously, our data structure no longer naturally aligns to a 16 byte boundary:

struct Message {
int64_t id;
int8_t user_len;
char user[8];
};

One solution to this problem, and the one heavily adopted by the C programming language, is to use a zero byte to indicate the end of a string. This is where the C family of string functions come in and why they are still useful when dealing with low-level data structures.

An important point to note regarding the user field in our example is that it is only zero-terminated when the string is less than 8 bytes in length.

Buffer Overruns

The C family of string functions are designed for manipulating zero-terminated strings. Before turning our attention to strncpy(), let's take a brief look its close cousin the strcpy() function.

The strcpy() function copies a zero-terminated source string to destination buffer. The strcpy() function will result in a buffer overrun if the source string is larger than the destination:

char buf[6];
strcpy(buf, "Aylett");

The problem in this example is that a zero terminator is written beyond the end of the buffer:

+---+---+---+---+---+---+
| A | y | l | e | t | t |\0
+---+---+---+---+---+---+ ^
0 1 2 3 4 5 write overrun

Programmers often reach for strncpy() as a possible solution, believing it to be a "bounded" version of strcpy():

char buf[6];
strncpy(buf, "Aylett", sizeof(buf));

In this example, however, the function will stop appending characters when the destination buffer is full, and the resulting string will not be zero-terminated. This may result in a “read overrun” if the consumer expects a C-style (zero-terminated) string:

+---+---+---+---+---+---+
| A | y | l | e | t | t | ?
+---+---+---+---+---+---+ ^
0 1 2 3 4 5 no zero terminator

This often leads programmers to question why strncpy() was designed with such a seemingly broken contract and they perhaps abandon it altogether. In order to answer this question, we need to stop thinking of strncpy() as a bounded version of strcpy(), and revisit one of the original motivations for its design.

Data Leaks

An advantage of low-level languages is that they allow programmers to layout their data structures in contiguous regions of memory:

struct Person {
char forename[32];
char surname[32];
};
static_assert(sizeof(Person) == 64);
static_assert(std::is_pod_v<Person>);

Tightly packed data such as this are ideal for binary messaging and file formats. When populating such structures, however, programmers must take care to avoid security vulnerabilities and data leaks:

Person person;
strcpy(person.forename, "Mark");
strcpy(person.surname, "Aylett");

If this data structure were copied verbatim onto the wire, then the use of strcpy() in this example would present a security risk. To understand why, first consider the memory location immediately following the zero terminator:

+---+---+---+---+---+---+-------+---+
| M | a | r | k |\0 | ? | ... | ? |
+---+---+---+---+---+---+-------+---+
0 1 2 3 4 5 ... 31

The problem is that the “person” object is not zero-initialised, so these memory locations will contain arbitrary data from the stack, which is subsequently leaked onto the wire. If these locations happen to contain sensitive data from old stack objects, then your system is now at risk of attack.

Message Padding

To avoid exposing private memory locations, we must ensure that any unused gaps in our data structures are suitably filled or padded. What we need is a function that copies the source string to the destination, and then fills any remaining bytes in the destination buffer with a padding:

void pstrcpy(char* dst, const char* src, size_t n) noexcept
{
// Copy source to destination.
size_t i{0};
for (; i < n && src[i] != '\0'; ++i) {
dst[i] = src[i];
}
// Pad remaining bytes with space.
for (; i < n; ++i) {
dst[i] = ' ';
}
}

The API can be further simplified by inferring the size of the destination buffer using a function template:

template <size_t SizeN>
inline void pstrcpy(char (&dst)[SizeN], const char* src) noexcept
{
// ...
}

This function can then be called as follows:

char buf[6];
pstrcpy(buf, "Mark");

Which will result in a “space padded” string:

+---+---+---+---+---+---+
| M | a | r | k | | |
+---+---+---+---+---+---+
0 1 2 3 4 5

Note that it is technically more correct to refer to this string encoding as “space padded” rather than “space terminated”, because the string will not be space terminated when the source string length is greater than or equal to the destination.

Strncpy Contract

We are finally ready to revisit the strncpy() contract. The FreeBSD man page states that:

strncpy() copies at most len characters from src into dst. If src is less than len characters long, the remainder of dst is filled with \0 characters.

Does this specification look familiar? It is precisely the function described in the previous section, except the pad character is now a zero (\0) rather than a space.

The following idiom is often used when zero termination is unconditionally required:

char buf[6 + 1];
strncpy(buf, "Aylett", sizeof(buf) - 1);
buf[sizeof(buf) - 1] = '\0';

This idiom is neatly encapsulated by the following function template, which avoids common programming errors regarding the buffer size:

template <size_t SizeN>
inline void zstrcpy(char (&dst)[SizeN], const char* src) noexcept
{
strncpy(dst, src, SizeN - 1);
dst[SizeN - 1] = '\0';
}

Read Safety

Although we now have a better understanding of the strncpy() contract, we still need a solution for reading these zero-padded strings without overrunning the buffer. Consider the following member function:

const char* surname() const noexcept
{
return surname_;
}

If surname_ is a zero-padded string with no padding, then it will not be zero-terminated and the unwary user will likely overrun the buffer when reading. Assuming that we do not need a zero-terminated string, the cleanest solution is to use the std::string_view<> template:

std::string_view surname() const noexcept
{
return {surname_, strnlen(surname_)};
}

Again, the key point is to think of strncpy() as a function that copies a C-style string into a fixed length buffer, and then zero pads any remaining bytes.

Summary

Taking some time to revisit the fundamentals can lead to new insights and fresh perspectives.

In summary:

  • the strncpy() is a genuinely useful function that is unfortunately often misunderstood;
  • before using strncpy(), decide whether you actually want a zero-terminated string or a zero-padded fixed length string;
  • If a zero-padded fixed length string is required, then use strncpy() to write the string andstd::string_view<> to read the string.

The functions mentioned in this article and many more besides are freely available in our open source toolbox-cpp project. Please also check out the Reactive Markets website if you would like to know more about us and how our Platform is powering a new generation of professional traders.

See you next time!

--

--