Monitoring UI: A Way to Understand Bigtable Performance

Irvi Aini
Google Cloud - Community
4 min readSep 3, 2022

When we are using Bigtable, there’s some things that needs to be considered including its health and performance. Depending on what kind of application that we build, there will be different degree of importance of what metrics we’d like to monitor.

When an instance is created, Bigtable will have a built in monitoring tools which will report the performance of nodes, clusters, and instance. The default monitoring will include but not limited to: CPU utilization, error rate, storage utilization, rows read/written read/write requests and read/write throughput.

In Bigtable, the storage capacity of a cluster is determined by the storage type (SSD/HDD) and the number of nodes. As the size increases, Bigtable will need to redistributes the data across the cluster. The recommendation for the disk utilisation is less than 70% of the storage capacity of the cluster, while maintaining it below 60% for latency-sensitive application. To avoid delay, we also can add more node.

The node of the cluster will use CPU resources to handle reads and/or writes request. A streaming job with heavy throughput for example will be likely CPU heavy. When the resource usage exceeds the threshold, the cluster won’t be performant and may return errors whenever there’s need to read and/or write data. For an optimal performance, CPU utilization should remain below 70% average CPU utilization and 90% CPU utilization for the most used node.

While observing the usage of both of disk and CPU, we can determine how we would like to create an autoscaler for the cluster. When an autoscaler is enabled, Bigtable will adjust the number of nodes that actually exists in the cluster. The value that being chosen as the maximum number of nodes should be the number of nodes that the cluster needs to handle the workload’s heaviest traffic as well as the budget that has been prepared.

As the number of nodes changes, Bigtable continuously optimizes the storage, rebalancing data across the nodes, to ensure that traffic is spread evenly and no node is overloaded. When a cluster is scaled up, Bigtable automatically rebalances the nodes for optimal performance. All requests continue to reach the cluster while scaling and rebalancing are in progress. a cluster has scaled up to its maximum number of nodes and the CPU utilization target is exceeded, requests might have high latency or fail. If a cluster has scaled up to its maximum number of nodes and the storage utilization limit is exceeded, write requests will fail. When a cluster is scaled down, nodes are removed at a slower rate than when scaling up, to prevent any impact on latency.

We can set the minimum as low as 1 to ensure that the cluster can scale down to the smallest, most cost-efficient size if possible. The cluster never becomes too small because it automatically prevents the node count from dropping below the minimum needed to maintain the CPU and storage utilization targets. When decreasing the number of nodes in a cluster to scale down, try not to reduce the cluster size by more than 10% in a 10-minute period. Scaling down too quickly can cause performance problems, such as increased latency, if the remaining nodes in the cluster become temporarily overwhelmed.

If there are problems with the schema design, adding nodes to the Bigtable cluster may not improve performance. For example, if we have a large number of reads or writes to a single row in your table, all of the reads or writes will go to the same node in your cluster; as a result, adding nodes will not improve performance. In contrast, if reads and writes are evenly distributed across rows in the table, adding nodes will generally improve performance.

When we want to see whether the read/write is balanced, we can take a look into the monitoring UI. We can detect this instance in the monitoring tool by keeping an eye on the “CPU Utilization of the hottest node”. If the hottest node is frequently above the recommended value, even when our average CPU utilization is reasonable, we might be accessing a small part of your data much more frequently than the rest of our data. One solution to fix this is the need to go back and check your schema design and access patterns and make sure it supports a more even distribution of reads / writes across each table. Often times when we really need to dig into the specifics of the performance, we can also Key Visualizer for Cloud Bigtable to understand the following issues:

  • Check whether your reads or writes are creating hotspots on specific rows
  • Find rows that contain too much data
  • Look at whether the access patterns are balanced across all of the rows in a table

--

--

Irvi Aini
Google Cloud - Community

Machine Learning, Natural Language Processing, and Open Source.