Custom Downscaler For Trino Cluster on Google Cloud

Published in

Walmart Global Tech Blog

5 min readApr 19, 2021

Overview

We started our journey with Trino back in 2018 with a handful of users, with a primary objective to serve only ad-hoc queries. We were able to get 60x performance improvements when compared to Hive. This is when Trino’s adoption increased within Walmart. Different teams switched to Trino for different use-cases and they were all happy.

As of today thousands of dashboards, developed on several BI tools, are powered by Trino, and over 2K active users are running over a million queries every month, processing several petabytes of data daily on this platform.

Our Trino cluster is on GCP and used to run in Autoscale mode, which automatically adds or deletes virtual machine (VM) instances from a managed instance group based on an increase or decrease in cluster load. It also helped us in reducing the costs when the need for resources is lower. The only concerning downside of using GCP’s auto-scaler is the failure of tasks running on a host terminated by the auto-scaler.

Though a majority of our SQL queries finish well within 5–10 seconds, there are certain long-running queries that run for over 45 minutes. These long-running queries were the ones that got highly impacted because of the GCP virtual machine getting terminated by the GCP auto-scaler. There were frequent query failures because of this.

We wanted to offer a much more reliable and happy experience to our platform users meaning,

Zero query failure during cluster downscale
Minimal impact on query performance

At the same time, we wanted to make sure the Trino cluster is well utilized.

Problem of Autoscaling the Trino Cluster

There isn’t any issue with Trino Cluster scaling up, but there is certainly an issue when running queries fail while the cluster scales down.

Even with the graceful shutdown of the Trino worker during scale down, queries can fail if the query is launched just before shutdown and takes longer to run than the grace period of 90 seconds. The long-running queries are no different.

Say, a user submits a SQL query, it runs for 45 minutes and fails just because the cluster downscaled! The only option left with the user is to retry running the query and wait for another 45 minutes with the hope that the query will run successfully this time.

This adds a bad taste to the entire user experience and certainly creates a lot of noise in any organization.

In a nutshell, below are the two major concerns that we wanted to address:

Random worker node termination by GCP auto-scaler leading to frequent query failures
Waste of cluster resources when a query running for say, over 45 minutes and gets failed because one of the nodes was taken away by GCP auto-scaler

To avoid all these issues, we decided to manage the cluster downscaling ourselves and let GCP take care of only Scaling UP the cluster.

The Birth of Custom Down-Scaler

Different strategies that we adopted to resolve the cluster down-scale issue

When we first saw this issue, we decided to switch to a static cluster from an Autoscaling cluster. We used to tweak the cluster bounds twice at specific times of the day based on the cluster load. But, this was only an interim approach and obviously not the very efficient and cost-effective one. Our users were happy as they stopped seeing the PAGE_TRANSPORT_ERROR, REMOTE_HOST_GONE, and TOO_MANY_REQUESTS_FAILED errors. The cluster was in a much more reliable state.

As a next step, we switched to running the cluster in Autoscale Only UP mode where GCP takes care of only scaling up the cluster. We used to resize the cluster forcefully once every day so as so to make sure we do not waste cluster resources during off-peak hours.

Even this wasn’t the best of solutions as our cluster used of scale-up really fast as and when there is a sudden load on the cluster but only gets downscaled once during off-peak hours, meaning cluster used to run with the upper bound for at least 10–12 hours even when the load has reduced.

This is the time we decided to write a custom down-scaler. We allow GCP to only scale up the instances based on CPU utilization. Scale down is taken care of by us. We solved the problem by gracefully scaling down the Trino workers.

How the Custom Down-scaler works?

Custom down-scaler looks at various metrics like overall CPU utilization, memory utilization, running query count, and queued query count to decide whether the downscale operation can be triggered.

As soon as the custom down-scaler sees an opportunity to scale down the cluster, it sends a shutdown signal to the Trino workers.

The Trino Worker then enters a special state in which it

stops serving new requests,
continues processing the current query tasks that were scheduled on it and,
shuts down after finishing that work.

Custom down-scaler continuously polls for the running tasks count on the Trino workers. As soon as the count drops down to zero, the custom down-scaler marks the virtual machine for termination.

This is not the end of the Story

Along with custom down-scaler running 24/7 we have also adopted an approach of scheduled down-scaler. During off-peak hours, our Trino cluster load drops significantly.

Say, during the peak business hours, the cluster is running with full capacity and the load drops, the down-scaler will take few hours to downscale the cluster to a smaller/appropriate size. Rather than only relying on a custom down-scaler to downscale the cluster, a scheduled down-scaler runs and does a forceful resize of the Trino cluster by resizing the GCP instance group.

We make sure when the resize operation is performed, there are no or only a handful of queries running on the cluster so that there isn’t a huge impact.

How did it help?

Well, we managed to reduce our monthly spend on Trino infrastructure by almost 50%, without compromising on user experience. Cluster is well utilized almost all the time.

Source: GCP Trino Cluster CPU Utilization (**Before Custom Downscaler**)

Source: GCP Trino Cluster CPU Utilization (**After Custom Downscaler**)

Screenshots above depict the cluster utilization before and after the introduction of a custom down-scaler. It is evident that the cluster utilization was very inconsistent, at times even less than 30% before the introduction of down-scaler. It improved a lot and is consistently around 80% almost all the time after the custom down-scaler came into the picture.

We do not see any more query failures while cluster downscaling.

Happy business, Happy we!