Tracking a regression…

for educational purposes.. In this post I will illustrate my approach to identify the root cause for a regression and will touch git, C, OpenSSL, GDB, Debian and Pharo. Maybe it is teaching you a trick or two but if not please let me know and maybe I can learn something from you instead.

Something not working

I built Debian packages for a dynamic programming language called Pharo and a friend of mine reported that when using the Debian 9.0 package his code fails to open a TLS/SSL connection. The code produced the following log:

Error: SSL Exception: connect failed [code:-5] ZdcSecureSocketStream(Object)>>error: ZdcSecureSocketStream>>sslException:code: ZdcSecureSocketStream>>connect
ZnClient>>setupTLSTo:

Finding a reproducer

My first step is to reproduce the issue. If I can locally reproduce it, I can more quickly iterate to find the root cause. I started a Debian8 VM and installed a package and found a simple testcase. The below would download a PNG through a TLS connection.

ZnEasy getPng: 'https://pharo.org/files/pharo.png'

To my surprise this worked for me. This meant that the TLS integration is not completely broken but the same test failed for my friend. Now it is time to understand the difference. I suspected that my friend might not have X509 root certificates installed, he verified this by using dpkg -l and saw that certificates are already installed and we verified this by using the OpenSSL CLI application. The following opens a TCP connection and initiates a TLS handshake.

$ openssl s_client -connect pharo.org:443
...

I needed to understand the difference between my working setup and the failing one. My friend showed me the libraries the plugin links to on his machine. This can be done by using the ldd command which is actually a bash script that will invoke the GNU libc dynamic loader (try executing /lib/x86_64-linux-gnu/ld-linux-x86–64.so.2)

$ ldd /usr/lib/i386-linux-gnu/pharo6-vm/SqueakSSL.so
libssl.so.1.1 => /usr/lib/i386-linux-gnu/libssl.so.1.1
$ ldd /usr/lib/i386-linux-gnu/pharo6-vm/SqueakSSL.so
libssl.so.1.0.0 => /usr/lib/i386-linux-gnu/libssl.so.1.0.0

Okay so the difference seems to be the version of OpenSSL. Version 1.0.x vs 1.1.x. There were various API changes (e.g. sealing structs/forbidding to directly access struct members) so this kind of made sense.

Starting the search

Now that we know the issue might come from OpenSSL 1.1.0 I wanted to reproduce it. The easiest way for me was to set-up a chroot build-environment with Debian9 for i386 using debootstrap.

$ debootstrap --arch=i386 stretch debian9
...
$ chroot debian9
...

I then compiled the Virtual Machine in my chroot and tried to execute the reproducer. I had to slightly change it to execute without needing X11 but could see the error. Now that I can reproduce it, I can try to understand why it is failing.

Switching on the light

The TLS plugin is linking to OpenSSL and I had a brief look at the code. There are plenty of lines that check if debugging is enabled and then print a log message. I was too lazy to enable this logging through the testcase code and decided to unconditionally enable it.

@@ -279,6 +279,7 @@ sqInt sqCreateSSL(void) {
ssl = calloc(1, sizeof(sqSSL));
+ ssl->loglevel = 1;
ssl->bioRead = BIO_new(BIO_s_mem());

There should have been light but there wasn’t. The runtime code in the testcase will actually switch off logging all the time. So my next try was to handle setting the property and hardcode it to always on:

@@ -701,7 +702,7 @@ sqInt sqSetIntPropertySSL(sqInt handle, sqInt propID, sqInt propValue) {
if(ssl == NULL) return 0;

switch(propID) {
- case SQSSL_PROP_LOGLEVEL: ssl->loglevel = propValue; break;
+ case SQSSL_PROP_LOGLEVEL: ssl->loglevel = 1; /*...*/ break;
default:
if(ssl->loglevel) printf("sqSetIntPropertySSL: Unknown property ID %ld\n", (long)propID);
return 0;

I could finally see where in the connection set-up it got an error from OpenSSL and then returned without opening a connection.

Digging deeper

To understand what was going on I looked into OpenSSL. In Debian not every package has a -dbg variant but there seems to be a catch-all solution and it requires to add a new package repository.

$ echo "deb http://debug.mirrors.debian.org/debian-debug/ stretch-debug main" >> /etc/apt/sources.list
$ apt update
$ apt install libssl1.1-dbgsym:i386 gdb

Now I could try to run my testcase using gdb and then step into OpenSSL to see what is going on…

$ gdb --args ./products/.../pharo Pharo6.image eval THE_TEST_CASE
...
(gdb) b sqConnectSSL
Function "sqConnectSSL" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (sqConnectSSL) pending.
(gdb) r
...
warning: unable to open /proc file '/proc/98232/status'
warning: unable to open /proc file '/proc/98232/status'
...
warning: unable to open /proc file '/proc/98232/status'
warning: unable to open /proc file '/proc/98232/status'
...
[Inferior 1 (process 98232) exited normally]
(gdb)

Fixing the test fixture

It should have hit the function but it didn’t. The “warning” was a clue and in fact this is a symptom of running within a chroot. I had to mount the procfs first.

$ mount -tproc none /proc
$ gdb --args ./products/.../pharo Pharo6.image eval THE_TEST_CASE
...
(gdb) b sqConnectSSL
...
(gdb) r
...
Thread 1 "pharo" hit Breakpoint 1, sqConnectSSL (handle=1, srcBuf=0x56ba15d8 "", srcLen=0, dstBuf=0x56ba25e8 "", dstLen=17408)

After the breakpoint was hit I used“n” to execute the code line by line until the OpenSSL function was called and then tried to step into it…

A suspicion

In the above log message the srcLen=0 sticks out. I modified the plugin to skip the BIO_write completely and the errors where gone but the connection got stuck. I then modified the code to only write if srcLen is bigger than 0 and the testcase started to work. Good I have a fix or workaround. Now let’s understand why OpenSSL is breaking with a zero byte write.

Beauty of Free Software

The beauty of Free Software is that on debugging one is not stuck with a black box but can actually look at the code and even step in it. With Debian it is super easy to get the complete corresponding code.

$ echo "deb-src http://httpredir.debian.org/debian stretch main" >> /etc/apt/sources.list
$ apt update
$ apt-get source libssl1.1:i386
...

Now we have the patched sources that were used to build the OpenSSL libraries. The only issue is that gdb doesn’t know about it… but that is easy to solve as well…

(gdb) dir openssl-1.1.0f/debian/
Source directories searched: ...bld/openssl-1.1.0f/debian:$cdir:$cwd

Now I could use “list” and “n” in gdb to see the sourcecode and run to the next line. Beautiful.

Exploring OpenSSL

The method that failed was BIO_write. The code is located in crypto/bio/bio_lib.c and looks more or less like the one below (slightly simplified).

int BIO_write(BIO *b, const void *data, int dlen)
{
size_t written;
int ret;
    if (dlen < 0)
return 0;
   return bio_write_intern(b, data, (size_t)dlen, &written);
}
static int bio_write_intern(BIO *b, const void *data, size_t dlen,
size_t *written)
{
...
    ret = b->method->bwrite(b, data, dlen, written);
...
return ret;
}

So BIO_write will do some sanity checking, hand it off to bio_write_intern and then it will go through a “vtable” and call another routine. From reading the plugin code I see that this is a memory buffer and found code in bss_mem.c:mem_write. I could have used “step” in GDB but I was quite confident that this is the implementation and instead set a breakpoint there and on first hit let the method run to the end and look at the return value.

(gdb) b mem_write
...
breakpoint hit.. inl=0
...
(gdb) finish
Value returned is $1 = -1

Putting it together

OpenSSL 1.1 started to return an error when the memory backend is asked to write zero bytes. I used a mix of git gui blame bss_mem.c, git log -p bss_mem.c and git diff OpenSSL_1_0_2m..OpenSSL_1_1_0 … bss_mem.c to see what changed from version 1.0.2 to 1.1.0 in the code. The git gui to blame the parent commit is a neat feature and helped me to see when the function in question changed. In the git clone of the code I found the change.

commit e5bf3c923c184b19e8c3ef7043080955479a2325
Author: Dr. Stephen Henson <steve@openssl.org>
Date: Fri Feb 13 13:02:24 2015 +0000
size_t for buffer functions.

Change BUF_MEM_grow and BUF_MEM_grow_clean to return size_t.
Reviewed-by: Richard Levitte <levitte@openssl.org>
diff --git a/crypto/bio/bss_mem.c b/crypto/bio/bss_mem.c
index d190765dc2..56e0d2e323 100644
--- a/crypto/bio/bss_mem.c
+++ b/crypto/bio/bss_mem.c
@@ -187,7 +187,7 @@ static int mem_write(BIO *b, const char *in, int inl)

BIO_clear_retry_flags(b);
blen = bm->length;
- if (BUF_MEM_grow_clean(bm, blen + inl) != (blen + inl))
+ if (BUF_MEM_grow_clean(bm, blen + inl) == 0)
goto end;
memcpy(&(bm->data[blen]), in, inl);
ret = inl;

In this case blen+inl is zero and BUF_MEM_grow_clean will return 0. After the switch to size_t the code fails to differentiate error from zero…

Conclusion

I filed a bug report with the OpenSSL project about their behavior change and modified the SqueakSSL to check the srcLen before calling BIO_wirte. This was the pull request and it has been merged.

In the bigger picture it shows that API design is difficult. The behavior of a zero byte write is underspecified (not documented in the man page) and the lack of automated tests made this enter a release without being noticed. If you design an API make sure to document the corner cases and test them (minimum input, less than minimum, maximum).

I hope this journey was interesting!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.