Presto on Google Cloud Platform
Presto is undoubtedly the fastest growing distributed SQL query engine and uses an architecture similar to a classic massively parallel processing (MPP) database management system.
Presto can be deployed on many different platforms and locations. Whether in the cloud or on-premises, the technology is truly platform agnostic. Combining Presto with the cloud computing services provided by GCP allows you to never get deprived of resources and continue to gain useful insights out of your data.
This article talks about setting up 3 nodes automated autoscaling Presto cluster on Google Cloud Platform. We’ll be making use of Google Compute Engine, Instance Group and Autoscaler.
Disclaimer: Presto configuration used here is definitely not suitable for production setup by all means. You can follow the article for a quick development environment setup.
Environment
- We need 3 virtual machines, one for Presto coordinator and two for Presto workers
- Set Presto on all the machines
- Create a Presto worker image and template
- Create Presto worker Managed Instance Group
- Set auto-scaling policy
- Validate the setup
Let’s start with 3VMs
We need at the very least 3 virtual machines, one for Presto coordinator and 2 for Presto workers.
Install Java
Well, you need to install Java on both coordinator and workers. Presto needs minimum Java 8, though I recommend you to use Java 11 for better GC.
How to install Java?? There are tons of articles available on the internet on setting up Java. Follow one and get it installed.
Next, Configure Presto Coordinator
Start the Coordinator
At this very point in time, you are all set with your Presto coordinator. Run the Presto launcher script and check the Presto UI at http://{coordinator_ip}:8080
Hey! Aren’t we missing something?
As you can see in Presto UI, the Active Workers count is 0. We have not yet set the Presto workers!! Let’s set them up next.
Steps 1 and 2 are the same as coordinator setup instructions. Step 3 is what makes workers different from the coordinator.
Next...Create Managed Instance Group for Workers
At this point in time, if you go refresh the Presto UI webpage, below is what you’ll see!
Voila! We have a Presto cluster ready.
Let’s continue playing around and set auto-scaling
Autoscaling helps your applications gracefully handle increases in traffic and reduce costs when the need for resources is lower.
Per the below autoscaling policy, GCP will add upto 10 instances to your instance group when there is more load (upscaling), and delete instances when the need for instances is lowered (downscaling).
What about Presto CLI?
You can make use of Presto CLI to query the data from Presto. We have configured the JMX catalog, so you can query JMX metrics.
If you want to connect to Hive, which most probably you will, go ahead and configure a hive catalog and restart the cluster.
Now that the Hive catalog is configured, let’s query Hive data from Presto CLI.
CLI is not the only option!
Presto gives you the freedom to use the BI tools that work best for you. You can choose any BI tool of your own choice. It could be Superset, Tableau, Power BI, Zeppelin, Jupyter notebook, and the list goes on...
Well..this is it. You have autoscaling enabled Presto cluster set on GCP for all your analytical needs.
Stay tuned for an amazing insight into Presto on GCP as I am planning to come with more such posts.