The scredis Driver Malfunctions in Pathological Environments
We experienced a recent outage here at AppNexus surrounding an ephemeral user session store. This is a story of how visibly correct code can still fail in unexpected ways.
We have a Scala service based on the Play framework that is a simple wrapper around Redis storage of user sessions. This service is completely separate from authentication and merely exists to abstract the Redis KV store so that other services in our infrastructure can communicate in a standard, signed, and RESTful fashion. This service runs in an OS level virtual machine in the JVM and uses the scredis driver to communicate with Redis.
The code can be essentially boiled down to:
Literally, call get
on the scredis driver with some String key
and optionally set an expiration if one is provided and then return the result. Simple.
Recently the root OS of the machine on which the VM was deployed had a memory pressure situation. As a result the scheduling of each individual VM on the host machine experienced delays. The box itself was in fairly heavy distress to the point where it was dropping network packets and suspending execution of individual VMs for seconds at a time.
(somekey and someindex is obfuscating actual data above)
Understandably, our simple wrapper application was experiencing timeouts in communication with the Redis server. The OS was not giving the VM enough scheduler priority to execute things. However, we also saw other strange behavior like RedisProtocolException
s trying to parse what responses we did get back, implying corruption or packet loss; and strangest of all, completely mismatched data where a request for key A
would return data for key B
! For those of you unfamiliar with KV storage, it’s akin to executing a SQL statement for some primary key in a mysql database and getting back data for a totally different row! The last behavior there is the scary part since it appeared with no related processing error.
We initially thought the problem must be in our code: maybe a buffer we were holding onto and not clearing on an Exception
, Maybe something was off by one in our handling of requests and responses, or some assumption about Actor execution order exposing a race condition. We were seeing RedisProtocolException
s about unexpected responses for some requests:
This was telling us a SET
operation, which should return a simple string reply of ‘OK’ was instead returning something that was not a simple string reply. This could still be packet corruption somewhere lower in the network stack but coupled with the mismatched data behavior we were pretty sure it pointed to the scredis driver itself.
The frustrating part is that we could not reproduce this issue in our testing environments. The working theory was that packet loss and/or lack of process priority was exposing some race condition in the driver resulting in the weird behavior we were seeing. So we constructed a test.
Test setup
On Centos 5
Then we wrote a test program, the source of which is (note I am not a Scala dude, so forgive any non-Scalaisms):
Simply:
- Loop forever
- Go through our 1 to 100 key set
- Get the value for key from redis
- If the key does not match the value print out that fact
We then ran this for a long time with the packet loss on the test instance and still could not reproduce it. Despair.
Eventually someone suggested:
to simulate the lack of responses or timeouts from the redis server. And then, glorious output was glorious:
We had reproduced the weird behavior of this driver! We were asking for key 5
and getting back value 99
. It’s possible that redis itself was malfunctioning, but this was quickly discarded because with so many worldwide users we were pretty sure we would have heard of this fact at some point, however, Google searches revealed nothing.
At this point we set a breakpoint in the debugger in the ListenerActor
in handleReceiveTimeout
and added a condition that we only break if remainingByteStringOpt
was set.
So after trying a few more times with SIGSTOP
, SIGCONT
we hit the condition again and now had it in the debugger.
First thing the driver does is send a message to the current ioActor reference to shutdown. Based on how the actor model functions this is really just a message enqueued into that actor’s mailbox for execution the next time it gets a turn. Next the driver calls become
to change the state machine of this actor into reconnecting
. become
is just passed up to the context
:
This passes up to the ActorCell
implementation in Akka which replaces the top item in the behaviorStack
:
Which eventually executes reconnecting
:
Since isReceiveTimeout
is true
here we will failAllSentRequests
which simply dumps everything in the requests
queue (including the request that matches the data in remainingByteStringOpt
) and then we go to reconnect
:
This overwrites our ioActor
reference with a new one, adds to the context and switches the state machine yet again. Note that the remainingByteStringOpt
still contains remnant data at this point:
In connecting
we re-establish the connection to the Redis server and then sendAllQueuedRequests
:
We then swing all the way back around to receive
where we prepend this remnant data to the buffer to be processed. This remnant data that was for an older terminated request was never finished processing by the driver:
All of this results in data that mismatches our original request.
If you are having trouble following the execution path of the code above, do not despair. I have trouble following it as well. I believe this is a case where the complexity in execution path introduced by using the actor model for a synchronous protocol makes it very difficult to reason about correct behavior. The state machine in an Actor is hard to get right. When you split this state management about a connection to a database across multiple actors (ListenerActor, IOActor) you run the risk of getting the state machine wrong in one of them. It might simplify the execution path of this driver if they were to use a single Actor to deal with connection state and network IO, centralising the state management to one place. Then they could fire off Actors to deal with each complete response. I believe the fix of the above is simple: in handleReceiveTimeout
simply set remainingByteStringOpt
to None and discard any unprocessed bytes. But this code path is so complex there might actually be a valid case for not discarding it, making the fix more complicated.
The fix for the above has been submitted as a pull request to the scredis
maintainers. I have also pre-shared this write up with them.
Written by Riley Berton