When you think you found a bug in TCP, don’t get cocky, kid.
Who do you blame when things go wrong? What was that quote by Arthur Conan Doyle?
Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.
Yeah…that’s definitely not what happened here. How cool would I be if I found a bug in TCP or Linux’ implementation of it?
No, this is the story of how I went down the debugging rabbit whole: reading code, taking TCP dumps, rushing to conclusions, tracing system calls, and finding the light. Debugging is my spiritual awakening.
I’m about to get technical, so bear with me. Here’s some background on our stack:
- C# home grown web application (ie. no fancy web application frameworks)
- Cassandra (on an old version, 2.2)
- .NET Driver for Cassandra (C# talks CQL to Cassandra, version 3.1)
- Mono Runtime Environment (on an old version, 3.12)
- Production servers running Red Hat 6
Cassandra is our database of choice, meant to be insert fancy promotional keywords and so far it has served us well. C# plus Cassandra means we need the .NET Cassandra driver for our application stack. Cassandra and its clients communicate through CQL, the native protocol. We use protocol version 4.
The protocol is frame based. Clients and servers exchange frames in request-response cycles. The frame schema is relatively simple:
In other words, it uses length prefixing. It allocates
- 1 byte for a protocol version, which also includes the request/response direction of the frame
- 1 byte for some flags (compression, tracing, etc.)
- 2 bytes for a stream id which enables multiple requests on the same connection
- 1 byte for an opcode which dictates the purpose of the frame
- 4 bytes for the length of the body
- length bytes for the body itself
We’re interested in that first byte, the protocol version. In production, our C# web application started reporting
Cassandra.ProtocolErrorException: Invalid or unsupported protocol version (95); highest supported is 4
byte value between parentheses varied from log to log, but never in a predictable manner.
The .NET driver entry point is the
Cluster. Client applications use a builder to configure and build the
Cluster. One of the configuration options is
ProtocolOptions. It lets you set the protocol version. We had configured it to use protocol version
4. The value
95 (and all the others) were way off.
That was a client exception. What was Cassandra reporting on the server side? As it turns out, it was spitting out similar information
Cassandra tried to decode some bytes it received from the Netty server into a CQL frame but failed because it detected an invalid protocol version. Unfortunately, this is old news, not much to go on.
It was pretty obvious, however, that the protocol exchange had somehow become corrupted. I browsed the various groups in the Cassandra community, particularly JIRA, but didn’t find any relevant reports about an incorrect or alternating protocol version.
The .NET driver isn’t as popular as some of the other drivers, but the development team (props to Jorge Bay Gondra) is very responsive. I browsed their JIRA, but found no other issues exhibiting these symptoms.
It was time for a deep dive!
I’ve browsed the driver implementation before, to report smaller bugs, but haven’t really investigated the socket connection and exchange. That’s what I’ll describe next.
If you recall, the driver’s entry point is the
Cluster. It lets you connect to a cluster and interact with it using an
ISession. You use it to send requests, synchronously with the
Execute family of methods or asynchronously with the
ExecuteAsync family of methods. In reality, the synchronous implementation uses
ExecuteAsync and just blocks, waiting for the result.
Hidden behind this public API is the infrastructure that serializes requests into frames (byte buffers) and streams them over a TCP socket. The driver’s internal
TcpSocket classes are where the magic happens.
The driver establishes one or more connections to each Cassandra host. Each connection is backed by a TCP
Socket. When applications execute a request through the
ISession, the driver acquires a
Connection and sends it the request.
At this point, the
Connection goes into what is essentially an “async” loop. It places the request into a queue and, if the queue has elements and no requests are currently being streamed to the socket, it starts a new task to process an element from the queue. Then it returns. The task retrieves a request from the queue, serializes it to a list of
byte buffer objects (which represent a single frame), stores that in a
SocketAsyncEventArgs, and uses
Socket.SendAsync to send it over the network. On completion,
SendAsync raises a
Completed event on the
SocketAsyncEventArgs. Finally, the event handler invokes a callback on the
Connection which restarts this whole process.
After reading through this implementation, everything looked good, at least with my limited knowledge of .NET’s
Socket API. The only thing left to do was try and reproduce the issue. It took a couple of iterations (remember that this occurred across all types of requests and for all our tables), but here’s what I settled on: a simple CQL table with an
int primary key and a
CREATE TABLE test.protocol_error (
pkey int PRIMARY KEY,
and a C# program (run on Mono)
This program essentially establishes a single connection to my Cassandra server running locally, sets small socket read and write buffer sizes (the smallest allowed in my RHEL6 environment), and queues up a thousand requests, exiting once one fails. The
text is simply a sequence of increasing integers separated by underscores. It looks like
This will be useful soon.
This small program consistently failed in my local environment (RHEL6, Mono 3.12, Cassandra 2.2, CQL driver 3.1) with the same protocol version error.
2 System.AggregateException: One or more errors occurred. ---> Cassandra.ProtocolErrorException: Invalid or unsupported protocol version (95); highest supported is 4
I also tried it with Mono 4.6 (on RHEL 7) and that failed as well. Unfortunately, it didn’t fail on Mac (Mono 4.0.1). I could not make it fail on Mac at all. So what was it doing on Linux?!
The next step was to take a TCP dump and find out exactly what the program was sending and what Cassandra was receiving. I used the always useful
tcpdump tool. I set up
tcpdump to listen to packets destined for port 9042 (CQL native protocol port) on the loopback interface
sudo tcpdump -i lo 'port 9042' -vv -X
-vv gives us more verbose output and
-X prints the headers and the content of each packet. Here’s a sample of the output
This capture shows a single request frame, split into two packets. I’ve highlighted relevant parts of the protocol. The frame starts with a short header. The first byte,
0x04, represents the protocol version and the fact that this is a request. The next byte contains zero flags. The next two bytes indicate the request is using stream id
0. The opcode byte,
0x0A, marks this as an
EXECUTE request. The following four bytes,
0x00,0x00,0x5d,0x78, tell us the length of the body, ie.
23928. If we do the math
16384 + 7553 = 9 (headers) + 23928 (body)
everything adds up. I’ve also highlighted some of the payload. We can see the value of the pkey,
0 for this first request, and the value of the content, our sequence of increasing integers.
In this frame, all the numbers in the content are accounted for, 0 to 4999. This frame is fine. The next one, however, is not.
The next two packets in the
tcpdump capture look like
There are a lot of things to take away from this capture. Most importantly, these two packets begin two separate frames for two separate requests. The first packet represents the start of the second
INSERT request. You can tell from the primary key’s value, highlighted in blue,
1. Additionally, the payload is the first part of the
content column’s value, ie.
0_1_2_…. However, the program never sends the complete frame. The second packet, instead of being the continuation of the previous request’s frame, actually begins the frame of the third request, with a primary key value of
2. Where’s the rest of the second request’s frame?
You’ll note that the sequence ids and their acknowledgements are in order: the last packet in the previous snippet handled
16490:24043, while the first and second packets here handle
(You’ll also note there’s a 40 ms delay between the two packets. More on that later.)
This is where I started blaming TCP. To me, it looked like it was truncating the program’s sent payload. TCP is meant to provide reliable, ordered, and error-checked delivery of a stream and I wasn’t seeing that. I’ll come back to this incorrect conclusion later, my debugging wasn’t complete.
By now, Cassandra had received the frames sent and had responded with a response frame of its own (byte
0x34, with value
There’s our error message. The program receives this response, completes the
Task returned by
ExecuteAsync, invokes its callback, and exits. So what happened? And where did that protocol version
95 come from?
We know from the “bad”
tcpdump output that the second request’s frame was truncated. We also know that the CQL protocol does not have boundaries to delimit the beginning and end of frames. Instead, it relies on length markers. When the program sent its second request, it indicated the length of its body required
16375 of those were sent in the first packet. Cassandra expects
7553 more bytes and gets them from the second packet, even though that packet is meant to represent a different request. In other words, that frame’s header and the following
7544 bytes are interpreted as the body of the previous request’s frame.
That second packet has a length of
16384. Cassandra consumes the first 7553 bytes and considers that request complete. There are still bytes in the receiving buffer, so it keeps reading, interpreting those remaining bytes as the beginning of a new request. Therefore, the byte at offset
tcpdump output, accounting for 52 bytes for IP/TCP headers) must represent the protocol version of a new request. What’s its value?
It just so happens that, for this content, that byte had a value of
From this point forward, the connection stream was corrupted and unrecoverable, with Cassandra being offset by the missing part of the frame. All ensuing requests on this connection would fail with a similar protocol version error.
This explained the errors we saw but didn’t address why the TCP stream was truncated. I had to go deeper for that.
As much as I wanted to believe that TCP was to blame, that was very unlikely. Still, in the CQL driver, it was clear the frames were being constructed correctly, but maybe Mono’s native use of sockets was broken.
I don’t know C and so Mono’s native code wasn’t very appealing. I just wanted to see the corresponding system calls. Fortunately, Linux provides the
strace tool to trace a process’ system calls, including those which are network related. I wanted to attach the
strace to the program before it started sending bytes, so I added a short
Thread.Sleep before the
for statement. This gave me time to set up this
strace on the command line
sudo strace -p <pid> -s 100 -f -tt -e trace=network &> tmp.out
-p <pid>selects the process identified by
-scontrols the length of strings (before being truncated) in the output
-fincludes all child processes, ie. all threads started by Mono
-ttadds more precision to the timestamp logged
-e trace=networkgives me all network related system calls
strace generates a lot of output. Here are the system calls (of a different execution than the
We can immediately tell Mono is using the system call
sendmsg in its
Socket implementation. The man page states
On success, these calls return the number of characters sent. On error, -1 is returned, and errno is set appropriately.
sendmsg call on line 3 in the output above sends two buffers (
msg_iov) of sizes 16384 and 7553. This call returned a value of
23937, indicating that all those bytes were sent. The call on line 5, however, returned a value of
16384 (in the
sendmsg resumed). It seems to have been interrupted by a
SIGPWR signal. The
sendmsg on the last line then sends an entirely new request. The 7553 bytes from the
sendmsg on line 5 are lost. I hadn’t expected this. (We also see a 40 ms delay similar to the
tcpdump’s after the call that is “truncated”.)
Back in the .NET code,
sendmsg’s positive return value does not represent an error, so C#’s
SocketAsyncEventArgs::Completed callback does not handle it as such. The callback’s
SocketAsyncEventArgs::SocketError has a value of
Success. Mono’s implementation of
SendAsync does, however, propagate that return value through
Gets the number of bytes transferred in the socket operation.
Unfortunately, the .NET CQL driver never checks for it and therefore never attempts to send the remaining bytes. I can’t say for sure whether this is a bug in the driver since the documentation for
SendAsync doesn’t mention under which conditions it’ll send partial bytes. I asked this question on Stack Overflow to clarify, but I’ve yet to receive a satisfactory answer. I can only say that TCP is not to blame.
I also opened JIRA ticket CSHARP-526 with the driver team to request that they verify the bytes sent and act accordingly. Until then, fortunately, the CQL driver allows us to use a different Socket API that does make a guarantee about the number of bytes sent,
When your application calls
BeginWrite, the system uses a separate thread to execute the specified callback method, and blocks on
NetworkStreamsends the number of bytes requested or throws an exception.
We can enable this with
SocketOptions::SetStreamMode. If we do, our problem finally goes away. One small code change, a pull request, and a deployment later and production is happy.
But I’m not.
What is a
SIGPWR signal? Why does it interrupt the
sendmsg system call? Why is there a 40 ms delay between
sendmsg calls? Why can’t I reproduce this issue on Mac OS X? The
NetworkStream solution is nice and all, but I’ll only be satisfied once I’ve answered all these questions. So here we go.
The garbage collector uses SIGPWR and SIGXCPU if it is used with Linux threads. These should not be touched by the client program.
The Mono runtime sends
SIGPWR signals to all of its threads so that they pause for a garbage collection cycle. That 40 ms delay? That was the Stop The World time for Mono’s Boehm GC.
The “interrupted” behavior in the system call is hard to define. It can vary depending on what the executing thread was doing at the time in the system call. It can very from one kernel version to another. In my experience, for
sendmsg specifically, it can vary depending on the socket buffer sizes (notice they were set low in my program above, though that could also have affected the memory footprint of the application and thus garbage collection). It can apparently also vary depending on the MTU size of the interface serving the network traffic. There are likely many more factors. On RHEL 6, I observed the system call sending a packet the size of the MTU,
16436, and returning that value as the number of bytes sent, while more remained.
[…] increments the suspend count for
target_threadand prevents the thread from executing any more user-level instructions.
As far as I can tell, it won’t interrupt a system call like
sendmsg. That’s why the behavior is not reproducible on Mac OS X.
There you have it. Now I’m happy.
Bugs like these are f̶r̶u̶s̶t̶r̶a̶t̶i̶n̶g interesting since they provide a reason to explore your entire application stack. I now know more about the CQL driver we use extensively in production. I know more about .NET’s
Socket API. I’ve gained some experience using useful debugging tools like
strace. I know more about signals and signal handlers.
I also know not to rush to conclusions that may embarrass me.