How to detect and fix a buffer overflow

Alexander Entinger
Arduino Engineering
7 min readAug 21, 2024

Buffer overflows are one of the most difficult category of bugs to both detect and locate on an embedded system. This is due to the fact that buffer overflows happening in one part of the code can have mysterious side effects to other parts in the code, but not necessarily at the time the buffer overflow occurs.

I’ve recently experienced a very peculiar buffer overflow induced error while working on an industrial communication stack for the Arduino Opta PLC family. For a feasibility study I was adding Modbus RTU communication via RS485 to aforementioned industrial communication stack. While the stand-alone sketch testing RTU Modbus communication with the sensor worked just fine interesting side effects occurred after integrating the Modbus communication with the industrial communication stack: Arduino Opta’s relays started to toggle every time an RS485 communication was initiated by the Opta.

My first thought was that there had to be a bug somewhere within ArduinoModbus or ArduinoRS485. When thorough static code analysis didn’t yield any insights I extended my analysis to ArduinoCore-mbed, on which the Arduino Opta is based. Having run out of ideas on the possible root causes for the observed behavior I was left with no other option than breaking up my Opta’s case, locating the Serial Wire Debug interface, ordering and soldering a 10-pin SWD header and finally attaching my Segger J-Link debug probe.

Arduino Opta WiFi being debugged using Segger JLink (RS485 Modbus device to the right)

Debugging an Arduino sketch requires a number of preparations, one of which is to rebuild your Arduino core to contain debug information. For ArduinoCore-mbed based cores this is done by invoking the following shell command:

cd ~/Arduino/hardware
mkdir -p arduino-git
git clone --recursive https://github.com/arduino/ArduinoCore-mbed
cd ArduinoCore-mbed
./mbed-os-to-arduino -a -g OPTA:OPTA

After rebuilding and uploading the code I was carefully stepping through the code using Segger’s Ozone (a debugger and performance analyzer with seamless integration of the company’s debug probes), paying special attention to the parts of the code base where RS485 communication was taking place.

While doing so I quickly found that — at the time of RS485 communication — two private member variables within the RS485 class called _txPin and _rePin had been corrupted. Both variables contain an integer value which is used to identify a specific GPIO pin required for RS485 communication.

Private member variables _txPin and _rePin of RS485 class contain corrupted values

I was able to detect the manipulation by comparing the values of the member variables immediately after initialization and during attempted Modbus communication.

Correct values of _txPin and _rePin captured immediatly after initialization

While _txPin is only used during initialization _rePin (receive enable) is regularly used throughout the program for disabling data reception during transmission and re-enabling data reception after transmission:

void RS485Class::noReceive()
{
if (_rePin > -1) {
digitalWrite(_rePin, HIGH);
}
}

void RS485Class::receive()
{
if (_rePin > -1) {
digitalWrite(_rePin, LOW);
}
}

In case of the Arduino language conventions for identifying a digital input or output an integer value of 0 is used to identify D0, which in case of the Arduino Opta coincides with RELAY1.

#define D0  (0u)
/* ... */
#define RELAY1 (D0)

Consequently, during every RS485 communication instead of setting the actual receive enable pin RELAY1 was activated. Having thus identified the source of the unexpected relay activation (and the non-working Modbus communication) I set out to identify the reason for change of private member variables _txPin and _rePin.

Carefully stepping through the code while keeping an eye on the values of _txPin and _rePin turned out to be a futile exercise. The only fact I managed to ascertain was that at some point in time the value of the private member variables miraculously changed. To further complicate issues the data corruption occured at different times during the debugging process indicating that the data corruption could be caused by code being executed in another thread (ArduinoCore-mbed is based on ARM mbedOS and runs multiple threads besides the user sketch).

With both my frustration and caffeine consumption reaching dangerously high values I recalled that there was another debugging tool available to me: setting a data breakpoint. Data breakpoints enable the debug probe to halt the execution flow when data is read from or written to a specific memory location. While this is a powerful tool in can’t be used for every problem setting, as — during the lifetime of a program — the same location in memory can contain different variables. However, in this special case the RS485 class was instantiated as a global object, meaning that the memory location of _txPin and _rePin is constant during the whole lifetime of the firmware.

Setting a data breakpoint to halt execution flow when data is written to _rePin

Re-running the firmware with a data breakpoint enabled I was quickly able to identify the offending code — which was part of my port of the industrial communication stack to Arduino Opta — to cause an illegal write to both _txPin and _rePin:

int mbed_wrapper_addrinfo(const char* hostname, const char* portstr, struct addrinfo* hints, struct addrinfo** info)
{
if (hostname == NULL)
{
static const char* localhost = "localhost";
hostname = localhost;
static SocketAddress _hints("localhost", atoi(portstr));
_hints.set_ip_address("127.0.0.1");
hints = (struct addrinfo*)&_hints;
}

auto ret = NetworkInterface::get_default_instance()->getaddrinfo(hostname, (SocketAddress*)hints, (SocketAddress**)info);
hints->ai_addr = (struct sockaddr*)hints; /* _txPin corrupted. */
hints->ai_next = NULL; /* _rePin corrupted. */
info[0] = (struct addrinfo*)hints;
return 0;
}

Taking a closer look at the memory addresses of the hints->ai_next (0x2404 AEF0) and hints->ai_addr (0x2404 AEF8) it turns out that those are the same as for _txPin (0x2404 AEF0) and _rePin (0x2404 AEF8).

Now what is going on here? How can it be that writing values to a C struct concerned with network name resolution is leading to a data corruption in a RS485 communication class? Let’s take a closer look:

The C struct addrinfo which is passed as an input parameter by pointer to mbed_wrapper_addrinfo is defined as:

struct sockaddr {
SocketAddress ai;
};

struct addrinfo {
SocketAddress ai;
int ai_flags;
int ai_family;
int ai_socktype;
int ai_protocol;
size_t ai_addrlen;
struct sockaddr *ai_addr;
char *ai_canonname;
struct addrinfo *ai_next;
};

During the execution flow of this function a new value is assigned to the pointer, containing the address of a static instantiation of the mbed SocketAddress class.

int mbed_wrapper_addrinfo(const char* hostname, const char* portstr, struct addrinfo* hints, struct addrinfo** info)
{
/* ... */

/* _hints is declared static and therefore located in the
* same memory area as global variables such as RS485.
*/
static SocketAddress _hints("localhost", atoi(portstr));
_hints.set_ip_address("127.0.0.1");
/* By using a cast to (struct addrinfo*) the address of the
* static _hints object is stored inside the C-struct hints
* pointer.
*/
hints = (struct addrinfo*)&_hints;

/* ... */

Since the C struct addrinfo contains a SocketAddress object as the first struct member this is not an immediate problem. However, the various other C struct members are now pointing to memory no longer belonging to itself but to the private member variables of the RS485 class which just happened to be slotted in the memory right next to the static SocketAddress _hints object.

struct addrinfo {
SocketAddress ai; /* &_hints */
int ai_flags; /* &_hints + sizeof(SocketAddress) */
int ai_family; /* &_hints + sizeof(SocketAddress) + sizeof(int) */
/* ... */
struct sockaddr *ai_addr; /* also RS485._txPin */
struct addrinfo *ai_next; /* also RS485._rePin */

This is also glaringly obvious when taking a look at the addresses of the local variables in the debugger:

struct addrinfo * members sorted ascending by memory location

I leave the proper resolution of this bug as an exercise to the reader. However, it is quite important to me to share this particular version of a buffer overflow. In my experience buffer overflows are associated with array out-of-bounds access and while that is the definitely the most common case there are more insidious versions of buffer overflows, such as accessing memory belonging to other objects due to careless pointer casts.

Locating data corruption in any program is very difficult, as the effects can appear at different times and different locations during execution flow. They may not even manifest immediately but much later during the development process — whenever the data being corrupted causing an actual program deviation.

As a closing statement I’d like to offer that the C and C++ programming languages are powerful instruments. However, with great power comes great responsibility and such power needs to be wielded both thoughtfully and and with surgical precision.

--

--