Cost ShowBack For Trino using Machine Learning

Published in

Walmart Global Tech Blog

6 min readJul 7, 2021

Overview

We provide Trino as a multi-tenant platform managed service on cloud platforms like GCP, Azure as well as On Premise to our internal teams in Walmart. As of today, we process several Petabytes of data against 1 million+ queries every month. Our customers use this service for various use cases like Reporting & dashboards, Ad hoc querying, Federated querying and many more.

As our managed cluster is multi-tenant therefore it was a challenge for us to show back the cost for usage of our service to customers. Also, we had received multiple queries from customers regarding the cost associated with the service to help them plan the budget for their use case.

As part of Cost ShowBack process, we wanted to show back the cost associated with our service to customers based on the usage and therefore we came up with a pay per query model where customers pay only for the queries they run.

Approach for Cost ShowBack

We narrowed down to two best approach for our pay per query model.

1. Calculate cost based on the resource utilisation metrics of the query

We collect query metrics for every query submitted in our Trino cluster. This includes metrics like execution time, CPU time, memory usage and many more.

Limitation:

Suppose if you have same query running at different time intervals for a day, the metrics for resource utilisation (CPU, memory etc.) for the query will vary throughout the day depending on cluster workload, network speed etc. This might result in different cost for the same query.

For example, we submitted a single query with multiple iterations on same data size at different intervals of time and observed that there is some variation in CPU time but high variation in cumulative memory and execution time. CPU time is the cumulative CPU time of all the task for a query whereas cumulative memory is the cumulative memory used throughout the query processing.

2. Calculate cost based on total bytes processed by the query

As total bytes processed by a query is not affected by cluster workload, this metric can be ideal to use for Cost ShowBack. Also, customers would easily relate to this metric compared to other metrics like CPU, memory etc.

Limitation:

We are not directly considering metrics like CPU, Memory which forms the majority of the cost for a VM instance.

Final Approach

We finalised the second approach and overcame the limitation by using a Machine Learning Model (polynomial regression) which considers cumulative CPU time of the query and total bytes processed as the dependent and independent variables.

Regression Model for Cost ShowBack

As we wanted to make sure we consider CPU time and Memory for our Cost ShowBack, we started analysing our vast query statistics data to figure out the right attributes for training our model.

Feature Engineering

Based on our understanding of the query attributes, we filtered out most of the attributes and considered cumulative memory and cumulative CPU time.

We then tried correlation of attributes to understand their relationship with the total bytes processed.

Based on the above chart, there is high correlation(0.8) between CPU time and total bytes processed but cumulative memory did not show any correlation with total bytes and thus we considered only CPU time for our model.This is because cumulative memory is highly effected by the cluster work load as we have seen before.

Also, we plotted the average CPU time of all the different complexity of queries for every GB data processed and observed an overall increasing trend of CPU time.

Model Overview

We used Polynomial Regression to develop our model . We used custom sampling technique for training and testing the model.

We created separate models for “SELECT” and “WRITE” queries as we saw different pattern of resource utilisation . We considered data written by query for “WRITE” query model apart from data scanned.

We have also added few constraints to our model to fit our use case. Some of them include making the cost curve an increasing one as the cost should increase with increase in data bytes processed . Also, few adjustments to the intercept of the function to minimize the minimum cost associated with the query.

Accuracy

Every use case has its own way of measuring the accuracy . For our use case, it was about considering the cumulative CPU time of different complexities of queries executed for different datasets in our cluster.

We measured the accuracy by comparing the overall CPU utilisation of all the queries submitted for a day with the predicted CPU utilisation . We are able to achieve accuracy of above 80% .

We consistently monitor our accuracy at day level and we plan to update our model regularly to include new queries against old/new datasets. We have set up a recurring workflow on the Walmart machine learning platform to do periodic retraining of the model.

Converting Predicted CPU time to Cost

As the overall Trino Cluster utilisation varies daily , therefore we calculate the cost at the end of the day by distributing the total cost of the cluster to all the queries based on the percent share of the predicted CPU time of the queries. Our focus is more on doing an accurate show back and distribution of costs to the teams based on what is billed to the platform rather than keeping the cost constant.

How can users estimate the cost for running a query in our service?

We developed an application for cost estimation which helps users to estimate the cost for running a query in our Trino cluster.

The App would take the below inputs from the user

· Expected Number of queries to be submitted to the cluster.

· Expected total bytes read and total bytes written by the queries

As the cluster utilisation varies daily therefore cost for a query will vary accordingly and therefore we provide end users with estimation attributes for their input to help them plan the budget for their use case.

Conclusion

Cost ShowBack Model helped us show back the infrastructure cost associated with our managed service to our internal teams through pay per query model. It also helps end users to compare the cost of our service with other available alternatives in Walmart.