When you think you found a bug in TCP, don’t get cocky, kid.

Who do you blame when things go wrong? What was that quote by Arthur Conan Doyle?

Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.

Yeah…that’s definitely not what happened here. How cool would I be if I found a bug in TCP or Linux’ implementation of it?

This is how cool I would be.

No, this is the story of how I went down the debugging rabbit whole: reading code, taking TCP dumps, rushing to conclusions, tracing system calls, and finding the light. Debugging is my spiritual awakening.

I’m about to get technical, so bear with me. Here’s some background on our stack:

Cassandra is our database of choice, meant to be insert fancy promotional keywords and so far it has served us well. C# plus Cassandra means we need the .NET Cassandra driver for our application stack. Cassandra and its clients communicate through CQL, the native protocol. We use protocol version 4.

The protocol is frame based. Clients and servers exchange frames in request-response cycles. The frame schema is relatively simple:

In other words, it uses length prefixing. It allocates

  • 1 byte for a protocol version, which also includes the request/response direction of the frame
  • 1 byte for some flags (compression, tracing, etc.)
  • 2 bytes for a stream id which enables multiple requests on the same connection
  • 1 byte for an opcode which dictates the purpose of the frame
  • 4 bytes for the length of the body
  • length bytes for the body itself

We’re interested in that first byte, the protocol version. In production, our C# web application started reporting

Cassandra.ProtocolErrorException: Invalid or unsupported protocol version (95); highest supported is 4

The byte value between parentheses varied from log to log, but never in a predictable manner.

The .NET driver entry point is the Cluster. Client applications use a builder to configure and build the Cluster. One of the configuration options is ProtocolOptions. It lets you set the protocol version. We had configured it to use protocol version 4. The value 95 (and all the others) were way off.

That was a client exception. What was Cassandra reporting on the server side? As it turns out, it was spitting out similar information

Cassandra tried to decode some bytes it received from the Netty server into a CQL frame but failed because it detected an invalid protocol version. Unfortunately, this is old news, not much to go on.

It was pretty obvious, however, that the protocol exchange had somehow become corrupted. I browsed the various groups in the Cassandra community, particularly JIRA, but didn’t find any relevant reports about an incorrect or alternating protocol version.

The .NET driver isn’t as popular as some of the other drivers, but the development team (props to Jorge Bay Gondra) is very responsive. I browsed their JIRA, but found no other issues exhibiting these symptoms.

It was time for a deep dive!

I’ve browsed the driver implementation before, to report smaller bugs, but haven’t really investigated the socket connection and exchange. That’s what I’ll describe next.

If you recall, the driver’s entry point is the Cluster. It lets you connect to a cluster and interact with it using an ISession. You use it to send requests, synchronously with the Execute family of methods or asynchronously with the ExecuteAsync family of methods. In reality, the synchronous implementation uses ExecuteAsync and just blocks, waiting for the result.

Hidden behind this public API is the infrastructure that serializes requests into frames (byte buffers) and streams them over a TCP socket. The driver’s internal Connection and TcpSocket classes are where the magic happens.

The driver establishes one or more connections to each Cassandra host. Each connection is backed by a TCP Socket. When applications execute a request through the ISession, the driver acquires a Connection and sends it the request.

At this point, the Connection goes into what is essentially an “async” loop. It places the request into a queue and, if the queue has elements and no requests are currently being streamed to the socket, it starts a new task to process an element from the queue. Then it returns. The task retrieves a request from the queue, serializes it to a list of byte[] buffer objects (which represent a single frame), stores that in a SocketAsyncEventArgs, and uses Socket.SendAsync to send it over the network. On completion, SendAsync raises a Completed event on the SocketAsyncEventArgs. Finally, the event handler invokes a callback on the Connection which restarts this whole process.

After reading through this implementation, everything looked good, at least with my limited knowledge of .NET’s Socket API. The only thing left to do was try and reproduce the issue. It took a couple of iterations (remember that this occurred across all types of requests and for all our tables), but here’s what I settled on: a simple CQL table with an int primary key and a text column

CREATE TABLE test.protocol_error (
pkey int PRIMARY KEY,
content text
)

and a C# program (run on Mono)

A program to reproduce the protocol version error.

This program essentially establishes a single connection to my Cassandra server running locally, sets small socket read and write buffer sizes (the smallest allowed in my RHEL6 environment), and queues up a thousand requests, exiting once one fails. The text is simply a sequence of increasing integers separated by underscores. It looks like

0_1_2_3_4_5_6_7_8_9...4997_4998_4999

This will be useful soon.

This small program consistently failed in my local environment (RHEL6, Mono 3.12, Cassandra 2.2, CQL driver 3.1) with the same protocol version error.

2 System.AggregateException: One or more errors occurred. --->  Cassandra.ProtocolErrorException: Invalid or unsupported protocol  version (95); highest supported is 4

I also tried it with Mono 4.6 (on RHEL 7) and that failed as well. Unfortunately, it didn’t fail on Mac (Mono 4.0.1). I could not make it fail on Mac at all. So what was it doing on Linux?!

The next step was to take a TCP dump and find out exactly what the program was sending and what Cassandra was receiving. I used the always useful tcpdump tool. I set up tcpdump to listen to packets destined for port 9042 (CQL native protocol port) on the loopback interface

sudo tcpdump -i lo 'port 9042' -vv -X

-vv gives us more verbose output and -X prints the headers and the content of each packet. Here’s a sample of the output

A single request frame. The text version is here.

This capture shows a single request frame, split into two packets. I’ve highlighted relevant parts of the protocol. The frame starts with a short header. The first byte, 0x04, represents the protocol version and the fact that this is a request. The next byte contains zero flags. The next two bytes indicate the request is using stream id 0. The opcode byte, 0x0A, marks this as an EXECUTE request. The following four bytes, 0x00,0x00,0x5d,0x78, tell us the length of the body, ie. 23928. If we do the math

16384 + 7553 = 9 (headers) + 23928 (body)

everything adds up. I’ve also highlighted some of the payload. We can see the value of the pkey, 0 for this first request, and the value of the content, our sequence of increasing integers.

In this frame, all the numbers in the content are accounted for, 0 to 4999. This frame is fine. The next one, however, is not.

The next two packets in the tcpdump capture look like

This is where the protocol breaks. The text version is here.

There are a lot of things to take away from this capture. Most importantly, these two packets begin two separate frames for two separate requests. The first packet represents the start of the second INSERT request. You can tell from the primary key’s value, highlighted in blue, 1. Additionally, the payload is the first part of the content column’s value, ie. 0_1_2_…. However, the program never sends the complete frame. The second packet, instead of being the continuation of the previous request’s frame, actually begins the frame of the third request, with a primary key value of 2. Where’s the rest of the second request’s frame?

You’ll note that the sequence ids and their acknowledgements are in order: the last packet in the previous snippet handled 16490:24043, while the first and second packets here handle 24043:40427 and40427:56811, respectively.

(You’ll also note there’s a 40 ms delay between the two packets. More on that later.)

This is where I started blaming TCP. To me, it looked like it was truncating the program’s sent payload. TCP is meant to provide reliable, ordered, and error-checked delivery of a stream and I wasn’t seeing that. I’ll come back to this incorrect conclusion later, my debugging wasn’t complete.

By now, Cassandra had received the frames sent and had responded with a response frame of its own (byte 0x34, with value 0x84)

The error message that the driver spits back at us. The text version is here.

There’s our error message. The program receives this response, completes the Task returned by ExecuteAsync, invokes its callback, and exits. So what happened? And where did that protocol version 95 come from?

We know from the “bad” tcpdump output that the second request’s frame was truncated. We also know that the CQL protocol does not have boundaries to delimit the beginning and end of frames. Instead, it relies on length markers. When the program sent its second request, it indicated the length of its body required 23928 bytes. 16375 of those were sent in the first packet. Cassandra expects 7553 more bytes and gets them from the second packet, even though that packet is meant to represent a different request. In other words, that frame’s header and the following 7544 bytes are interpreted as the body of the previous request’s frame.

That second packet has a length of 16384. Cassandra consumes the first 7553 bytes and considers that request complete. There are still bytes in the receiving buffer, so it keeps reading, interpreting those remaining bytes as the beginning of a new request. Therefore, the byte at offset 7553(0x1db5 in tcpdump output, accounting for 52 bytes for IP/TCP headers) must represent the protocol version of a new request. What’s its value?

This packet is interpreted as two request frames. The text version is here.

It just so happens that, for this content, that byte had a value of 95 (0x5f).

From this point forward, the connection stream was corrupted and unrecoverable, with Cassandra being offset by the missing part of the frame. All ensuing requests on this connection would fail with a similar protocol version error.

This explained the errors we saw but didn’t address why the TCP stream was truncated. I had to go deeper for that.

As much as I wanted to believe that TCP was to blame, that was very unlikely. Still, in the CQL driver, it was clear the frames were being constructed correctly, but maybe Mono’s native use of sockets was broken.

I don’t know C and so Mono’s native code wasn’t very appealing. I just wanted to see the corresponding system calls. Fortunately, Linux provides the strace tool to trace a process’ system calls, including those which are network related. I wanted to attach the strace to the program before it started sending bytes, so I added a short Thread.Sleep before the for statement. This gave me time to set up this strace on the command line

sudo strace -p <pid> -s 100 -f -tt -e trace=network &> tmp.out

where

  • -p <pid> selects the process identified by pid
  • -s controls the length of strings (before being truncated) in the output
  • -f includes all child processes, ie. all threads started by Mono
  • -tt adds more precision to the timestamp logged
  • -e trace=network gives me all network related system calls

strace generates a lot of output. Here are the system calls (of a different execution than the tcpdump)

We can immediately tell Mono is using the system call sendmsg in its Socket implementation. The man page states

On success, these calls return the number of characters sent. On error, -1 is returned, and errno is set appropriately.

The sendmsg call on line 3 in the output above sends two buffers (msg_iov) of sizes 16384 and 7553. This call returned a value of 23937, indicating that all those bytes were sent. The call on line 5, however, returned a value of 16384 (in the sendmsg resumed). It seems to have been interrupted by a SIGPWR signal. The sendmsg on the last line then sends an entirely new request. The 7553 bytes from the sendmsg on line 5 are lost. I hadn’t expected this. (We also see a 40 ms delay similar to the tcpdump’s after the call that is “truncated”.)

Back in the .NET code, sendmsg’s positive return value does not represent an error, so C#’s SocketAsyncEventArgs::Completed callback does not handle it as such. The callback’s SocketAsyncEventArgs::SocketError has a value of Success. Mono’s implementation of SendAsync does, however, propagate that return value through SocketAsyncEventArgs::BytesTransferred

Gets the number of bytes transferred in the socket operation.

Unfortunately, the .NET CQL driver never checks for it and therefore never attempts to send the remaining bytes. I can’t say for sure whether this is a bug in the driver since the documentation for SendAsync doesn’t mention under which conditions it’ll send partial bytes. I asked this question on Stack Overflow to clarify, but I’ve yet to receive a satisfactory answer. I can only say that TCP is not to blame.

I also opened JIRA ticket CSHARP-526 with the driver team to request that they verify the bytes sent and act accordingly. Until then, fortunately, the CQL driver allows us to use a different Socket API that does make a guarantee about the number of bytes sent, NetworkStream.

When your application calls BeginWrite, the system uses a separate thread to execute the specified callback method, and blocks on EndWrite until the NetworkStream sends the number of bytes requested or throws an exception.

We can enable this with SocketOptions::SetStreamMode. If we do, our problem finally goes away. One small code change, a pull request, and a deployment later and production is happy.

But I’m not.

What is a SIGPWR signal? Why does it interrupt the sendmsg system call? Why is there a 40 ms delay between sendmsg calls? Why can’t I reproduce this issue on Mac OS X? The NetworkStream solution is nice and all, but I’ll only be satisfied once I’ve answered all these questions. So here we go.

SIGPWR is a Linux signal, meant to indicate power failure. Like most signals, however, applications can make use of them as they wish. It turns out Mono uses SIGPWR as part of its garbage collector.

The garbage collector uses SIGPWR and SIGXCPU if it is used with Linux threads. These should not be touched by the client program.

The Mono runtime sends SIGPWR signals to all of its threads so that they pause for a garbage collection cycle. That 40 ms delay? That was the Stop The World time for Mono’s Boehm GC.

The “interrupted” behavior in the system call is hard to define. It can vary depending on what the executing thread was doing at the time in the system call. It can very from one kernel version to another. In my experience, for sendmsg specifically, it can vary depending on the socket buffer sizes (notice they were set low in my program above, though that could also have affected the memory footprint of the application and thus garbage collection). It can apparently also vary depending on the MTU size of the interface serving the network traffic. There are likely many more factors. On RHEL 6, I observed the system call sending a packet the size of the MTU, 16436, and returning that value as the number of bytes sent, while more remained.

What about on Mac OS X? As it turns out, on Mac OS X, Mono uses thread_suspend to pause threads. According to this document, it

[…] increments the suspend count for target_thread and prevents the thread from executing any more user-level instructions.

As far as I can tell, it won’t interrupt a system call like sendmsg. That’s why the behavior is not reproducible on Mac OS X.


There you have it. Now I’m happy.

Bugs like these are f̶r̶u̶s̶t̶r̶a̶t̶i̶n̶g interesting since they provide a reason to explore your entire application stack. I now know more about the CQL driver we use extensively in production. I know more about .NET’s Socket API. I’ve gained some experience using useful debugging tools like tcpdump and strace. I know more about signals and signal handlers.

I also know not to rush to conclusions that may embarrass me.