FlightMight was having trouble with Cloud Bigtable performance from their clients around the globe. Their system for monitoring aircraft information meant sampling lots of instrument data hundreds of times a second, submitting it all back to a server for storage and analysis.
Sadly though, their aircrafts tended to work all over the world, with shoddy connections and quickly changing latency situations, leaving their performance when interfacing with Cloud Bigtable, less than ideal.
Location, location location.
The issue here is that Cloud Bigtable, in order to be as performant as possible, is local to a specific region. As such, accessing Cloud Bigtable from clients around the world is quickly at the mercy of latency, which is why we recommend that any client accessing Bigtable be, not only in the same region, but specifically, in the same zone.
To show this off, here’s a small test (using a Cloud Bigtable loadtest tool written in Go) I ran where the client in us-central1-c connecting to cluster in:
Don’t get me wrong, 24ms/26ms 95th read/write percentile is extremely impressive, but we can see that almost doubles when we move the cluster to us-east4-a; and if you have a client that’s connection to Bigtable repeatedly, (say pushing up time-stamped samples) then this overhead starts to add up.
Bridging the gap.
Now, let’s be clear — This is not a challenge unique to Cloud Bigtable. The general challenge lies in the fact that we have a resource that’s geographically separated from our clients, and the latency overhead of repeated transactions can cause performance issues. Here on Cloud Performance Atlas, we’ve seen this same issue with respect to latency in Networking, Cloud Storage, Cloud Datastore and App Engine, so we’ve got a nice set of tools at our disposal.
Here are a few ideas we tossed around:
1) Use more clusters. From the Cloud Platform Console, if you create additional clusters, Cloud Bigtable will immediately start replicating data between those clusters (and we require the clusters to be in different zones). This is obviously the most straightforward way to solve this problem, as it allows your data to be replicated to geographically diverse locations. On the down side, this solution comes with extra cost, since you’ve got a second cluster running.
2) Update the clients to use multithreaded push/pull. This would mean that the overhead of latency wouldn’t be hurting the performance of the client, although we’d need to build their client robust enough to queue up samples, submit them in batches, and deal with situations where we lose connectivity. In this scenario, the latency is mitigated to the client’s perspective,but we end up with a lot of potential edge cases we might need to discover & code for. (imagine a scenario where push latency is ~80ms, and we’ve got a device that’s sampling information every ~10ms.)
3) Use Cloud Pub/Sub & a f1-micro instance. Cloud Pub/Sub is a fantastic deliver-once service for GCP. If we have our clients push their sample data into Cloud Pub/Sub, then we don’t need to have any overhead for buffering on the client. On the flip-side though, we would need to spin up a permanent Compute Engine VM instance, in the same zone, which pulls the messages from Cloud Pub/Sub, and then submits them to Bigtable directly. The challenge with this setup is multi-fold. Firstly, we’d need to use an f1-micro Compute Engine VM instance in order to keep costs down, which might not have the performance we need for connections (thus we’d need to scale, incurring higher costs). Secondly, this really only works for write operations from the client; reading data would need a separate path, which might involve different complexity to the endpoint.
4) Serverless endpoint. Cloud Functions and App Engine can help offer the positives of #1 / #2 above, without the downsides. For both reads & writes, Cloud Functions and App Engine can both scale to thousands of instances, each one properly connecting to the nodes in a Cloud Bigtable cluster (which, as we’ve talked about, are designed to handle 10k QPS of reads or writes per node if you’re using SSD storage). And placing the serverless systems in the same region as the Cloud Bigtable cluster means that latency, per-request can fall back inside of Google’s High-speed network, which can improve performance.
And for FlightMight, that was all they needed to hear. They adjusted their clients to push content to Cloud Pub/Sub, which then triggers a Cloud Function that’s deployed in the same zone as their Cloud Bigtable cluster, giving them the performance they need, and lowering their cost.