Bigtable: from hero, to villain, back to hero
Bigtable (BT) or more specifically, Google’s Cloud Bigtable product is Google’s cloud native NoSQL storage solution that runs at petabyte scale. You can read a lot about it in the official documentation here.
Given the massive scale at which ShareChat operates, we use various different databases that we pick and choose for various needs rather than a one-size-fits-all approach. Specific to BT, we have various use-cases that covers the entire spectrum of what BT provides.
This article talks about some interesting scenarios we encountered with BT and the troubleshooting steps we followed through the entire process.
For a few days now, we were experiencing application aborts when writing to a specific BT table. Let’s call this the keyValue table. There was nothing special about this operation. It does a simple increment of a counter value and keeps track of the state/count of the application actions.
The latency for a particular service had spiked to a high value and was showing no signs of coming back down whatsoever. We tried to scale the cluster for the application pods in k8s and scale the BT nodes in the cluster, but no luck.
Errors and initial debugging
The specific application error looked like this (the highlight is mine):
ReadModifyWrite errors, with message “Error while mutating the row XXX : Reached maximum allowed row size.”
This looked like a BT bug since we were simply trying to increment a value using the ReadModifyWriteRow class and the row size was confirmed to be a simple counter which could be such a large value. We ignored this and kept going.
We also saw increased latency around the ReadModifyWriteRow operations, almost like the errors were the end result of the failed timeouts.
The monitoring dashboard for this instance was nothing spectacular except that it started happening out of the blue one evening. No code changes were rolled out, no experiments were turned on and no interesting traffic patterns were observed at that time either.
The system error rate for this instance was a low 2–3% and did not seem big enough to cause the higher impact we were seeing on the service. But…when we flipped to the table-view, it was clear that it was this one table that was causing all the impact. It was getting averaged-out in the instance view.
Note to self: always open up the BT table-specific monitoring views during debugging.
Key Visualizer? Next steps?
Could it be a hot-spotting issue? Unfortunately, BT only provides the awesome Key Visualizer for tables greater than 30GB. Since ours did not meet those criteria, we could not take advantage of it.
BT complaining about that large row size when all we were doing was a simple counter increment?
So, we put all the notes we had until now together:
- No code changes, no traffic patterns are contributing to this.
- The sudden impact started one fine evening.
- QPS was not too high and other tables with higher QPS were performing better.
- Could the BT ReadModifyWriteRow API have a bug in it?
There was this one key factor. We had just migrated over to GCP fully in the last 10 days. Could this have been a ticking time-bomb and it just exploded?
BT supports versions by default.
In simpler terms, every update of your value would store the previous value upto the specific count which is configurable and these configurations are called garbage collection policies.
This is a great feature and can reduce the application complexity greatly when used correctly. In our case, we were always fetching the latest version so that row size could not have reached that large limit it was complaining about.
Unless…all the versions are being looked up each time and their total size checked before the write happens. Remember, our error was complaining about a “…Reached maximum allowed…” in the message. Google Support, who were already engaged by that time, came back with an observation that one of their tablets was blocked with a large size of the row that was exceeding limits.
Garbage Collection Policy
All theories now pointed to a possibility that thousands and thousands of updates had created so many versions that were causing the tablet holding that data to do a sanity check and fail the update. We were in the process of applying optimizations about 10 days after the migration. Looks like we had to apply the garbage collection policy for the table right now.
The policy looked like this:
cbt -instance=<instance-name> -project=sharechat-production setgcpolicy keyValue <column-family> maxversions=2
Note that BT does not guarantee the timing of when the GC policy will be applied. The documentation states that it could take about a week at times. In our case, it took about 30 minutes after which we saw something interesting but expected to happen. The table size dramatically came down to a very low number and the errors just magically disappeared.
So it looked like the tablet was running into a sanity check where the millions of versions stored for our counter was causing it to error out.
- Learn and find out the best view of the monitoring tools. In our case, we found the view for table metrics more useful than the instance ones (also the default).
- Besides traffic and code, there could be over-the-time usage patterns that can cause issues. They come without a warning, like most other issues.
- Do not put off optimization for too long. Setting up a version limit was always in our bucket-list but it had to be fast-tracked because of the issues.
- Lastly, relying on error messages, while confusing, should be a good hint to figure out what was going on. If we had paid attention to it from the beginning, we could have narrowed it down to the version issue faster.