Mining Zookeeper’s transaction log to track down bugs
At Midokura, we use Zookeeper to store the virtual network topology and device tables for our SDN product. A couple of weeks ago we started to see an issue on one of our QA environments where our agents, which run on each hypervisor providing the network overlay, would start receiving errors when accessing zookeeper.
This is a relatively well known error. It occurs when a client tries to read a znode or the children of a znode, and the size of the response is greater than jute.maxbuffer (which defaults to 1MB). The agents were trying to read something that had grown very large, but we didn't know what. We connected to zookeeper, but there were no nodes that were particularly big, either in terms of data size or number of children. This stumped us for a while. Obviously there had been something very large at that point in time, that was large enough to fill 1MB, but it didn't seem to exist anymore. We needed a way to see exactly what the zookeeper state looked like at that point of time.
Fortunately this is possible with Zookeeper. Zookeeper is a replicated state machine. Writes are replicated to each node in the ensemble using a protocol called ZAB before being applied to the zookeeper state. ZAB stores these updates in a transaction log on a disk on the zookeeper server. By starting with a clean zookeeper state and applying the entries in the log in order, you can rebuild the zookeeper state at any point in time. Of course, in a cluster that has been running for years you may not have the whole log reaching back until the beginning of time. Zookeeper periodically takes a snapshot of its state and stores this on disk. In this case, you just need to find the last snapshot taken before the point in time you are interested in, and play forward the transaction log from that point onward.
This is exactly what we did. And we wrote a tool to do it easily. To run the tool, you just need to point it at your snapshot and transaction log directories and specify the time at which you would like to see what the zookeeper state was. The tool will then rebuild the state and dump it to the console.
Running the tool against the zookeeper data from our QA environment showed that one of our agents was creating a whole load of ephemeral nodes under a znode. When the other agents tried to get the children of this znode (they were watching for changes to the children), they would get “Packet len is out of range” exception. But then, when the agents restarted after not being able to access zookeeper for a period, these ephemeral nodes would be cleaned up. This is why we couldn't find the large nodes after the fact.
Not only were we able to see what nodes were being created, but we could see what was in them. This gave us more clues to which code path was at fault. We eventually narrowed it down to one rarely exercised code path, which we have since removed.
Some days later we were having another “problem” with zookeeper. Something was creating a lot of shortlived connections. This left us suspicious. We had been having zookeeper problems around that time. Maybe these connections were a contributing cause?
Our agent creates a zookeeper session which it maintains for the lifetime of the agent, which should be in the order of weeks to months. These connections were only living for 10 seconds or so. One way to figure out what was creating the connection would be to see what each of these zookeeper sessions were doing.
Having previously had success in digging into zookeeper’s history, we figured we could pull this info out of the transaction log quite easily. We wrote a tool, GrepOps which allows you to grep the log by the znode path, the session id or both, and shows you all operations which match your criteria.
In this particular case, we found that the sessions were either creating port bindings or deleting port bindings. By knowing the exact sequence of operations we could narrow down the possible code paths which could be causing these connections, eventually discovering that it must be the mm-ctl tool, which is used for integration with OpenStack neutron. It turns out this had nothing to do with the zookeeper problems we had been seeing, but at least, with this tool, we could rule out the possibility.
Both the TimeMachine and GrepOps tool are available on github: http://github.com/midonet/zktimemachine