Trino on Google Kubernetes Engine | Big Data Analytics at Scale

Sharanya Aithal
NoBroker Engineering
6 min readApr 1, 2022

NoBroker.com is the largest C2C real estate market place in India. Every month, we facilitate around 2.7 Million connections and save upto ₹570 Million in brokerage. NoBroker.com, along with NoBroker, Home-services & NoBrokerHood becomes an all-encompassing solution for entire housing necessities.

The products and businesses that we operate on, rely heavily on technology and data. All our business functions run via technology products, most of them consumer facing. With the numerous products and business operations that we have, there is a massive amount of data flowing in a great volume and a wider variety. To keep up with the growth, the ability to rapidly and accurately aggregate data becomes even more important.

Starship is what we call our entire ecosystem of data management, warehousing, ETLs, pipelines, data lake etc. An integral part of starship is enabling business and product functions to use the data lake for Reporting, BI, AdHoc Querying and KPI tracking.

It is important that we provide the necessary analytics experience for non technical folks, in terms of scale, speed and flexibility of reporting, to actively participate in monitoring and innovating the data. For this, we adopted Metabase interfacing to our data lake in GCS via Trino (formerly PrestoSQL). Metabase comes with a neat interface and interactive query builders which are easy to work with. However to facilitate real big data analytics, it is important that we set up the underlying execution engine (trino in this case) which accounts for the scale. While it is possible to have very powerful querying capabilities with trino, there is always a cost vs performance trade off. In this blog, we will discuss how we have configured trino infrastructure and resources to be scalable and optimal for our use.

Why Trino ?

At NoBroker.com, we embrace open source. Trino is a very actively developed open source query engine for big data analytics. It is highly parallel and distributed, which makes it popular for low latency analytics. It is also extremely scalable. Some of the largest organization have adopted trino for querying exabyte scale data lakes. It has a simple ANSI SQL query engine, which makes it easy for analysts and anyone with SQL knowledge to work with. Also it is easily supported by a variety of BI tools, including Tableau, Power BI & Metabase.

All of these along with its vast community was tempting enough for us to choose Trino.

Why Trino on Kubernetes ?

Trino is designed to scale for any size of data. And Kubernetes to any size of resources. Bringing Trino on Kubernetes help us orchestrate stateless trino workloads conveniently for scale. Auto scaling capabilties along with the use of cost effective resource options like pre-emptive VMs not only gives us flexibility to scale massively but also a huge cost advantage.

Deployment

At NoBroker, we have almost everything deployed on Kubernetes using GKE (Google Kubernetes Engine). But it is a hectic task to write and maintain Kubernetes YAML files for all our essential Kubernetes objects. Deployment using Helm chart has simplified this process and reduced the time taken for onboarding a new setup. Helm chart is a collection of files that describe a related set of Kubernetes resources like deployment, configmap, HPA, service,etc.

Trino cluster can easily be deployed into Kubernetes with the help of trinodb helm chart.

Trino chart structure

  • Chart.yaml
  • values.yaml
  • templates

Chart.yaml
A YAML file containing information about the chart.

values.yaml
values.yaml file contains the default configuration values for the chart which can be used in templates like trino docker image with required tag, worker count with autoscaling details, common properties for coordinator and worker (config.properties, jvm.config, node.properties, password-authenticator.properties), resource, node selector,etc.

templates
A directory of templates that, when combined with values,will generate valid Kubernetes manifest files.

After updating the chart files according to our requirement, assuming that trino is the directory where the chart files are kept, we are good to deploy trino on kubernetes in a single step.

helm install trino `pwd`/trino -n trino --create-namespace

To update yaml files after the deployment, run the below command.

helm upgrade trino `pwd`/trino -n trino

and that’s it !!

Autoscaling

In Trino cluster, it’s really challenging to predict the number of worker nodes required to run our queries. All the queries are not same and deciding on the number of worker nodes in advance is not always possible. This is one of the reason to choose Kubernetes that has ability to scale infrastructure based on our requirement. We use cluster autoscaling on node pool and Horizontal Pod Autoscaling (HPA) on our Trino worker deployment which can automatically allocate more pods and resources when the load is high and remove them when there are less requests based on CPU usage, memory consumption or any external metrics. This helps our Trino cluster to answer queries more efficiently without using unnecessary resources.

Cost cutting

Having a Trino setup with multiple workers, running on a high resource node-pool is directly adding on to our cost. This is where Preemptible VMs came in handy to us. We have our clusters running on Preemptible Vms as it is charged up to 70% off the regular standard VM and also it works well for us. Autoscaling is another thing to add on to cost reduction without compromising on performance. In fact, it enables our workloads to deliver in a better way. Enabling autoscaling on our node pools and workloads, help in adding and removing a node/pod as required and avoid wasting resources. Since the traffic to our Trino cluster is less at night compared to day, reducing HPA minimum and maximum replica count at night was important. For the same, We use Kubernetes Schedule Scaler to have a scheduled scale deployments to reduce the no of Trino workers running at off time.

File-based access control

By default, the Trino coordinator allows any user to run queries on all catalog/schema/table. In order to restrict this, we use File system access control plugin which allow us to specify authorization rules where access to data and operations are defined in a JSON file.

For example, if you want to allow an admin user to access all catalog and schema with ALL privileges on table , user A to only access Z schema and have SELECT privileges on table, B user to access Y and X schema having SELECT and INSERT privileges and to deny all other access, we can use the following rules:

{
"catalogs": [
{
"user": "admin",
"catalog": ".*",
"allow": "all"
},
{
"user": "A",
"catalog": "Z",
"allow": "all"
},
{
"user": "B",
"catalog": "(Y|X)",
"allow": "all"
}
],
"schemas": [
{
"user": "admin",
"schema": ".*",
"owner": true
},
{
"user": "A",
"schema": "Z",
"owner": true
},
{
"user": "B",
"schema": "(Y|X)",
"owner": true
}
],
"tables": [
{
"user": "admin",
"schema": ".*",
"privileges": ["SELECT", "INSERT", "DELETE", "OWNERSHIP"]
},
{
"user": "A",
"schema": "A",
"privileges": ["SELECT"]
},
{
"user": "B",
"schema": "(Y|X)",
"privileges": ["SELECT", "INSERT"]
}
]
}

Scale of queries

The complete data is not yet organised in our data lake. It is an on-going process. At present we have around 26 TB of compressed data in our data lake which serves as an organised warehouse for convenient analytics. Currently, around 5500 queries are run on an average per day on Trino. This accounts for around 2.6 TB of data on an average per day. We are expecting this to scale up to 5–6 times in the coming few months. With the set up we have configured, it should be a piece of cake for us to achieve the scale we expect.

If you would like to know more about the kind of work we do, follow our engineering blog. If this kind of work excites you, we are actively hiring. Please look for a reference in our team. We shall be happy to talk to you.

--

--