Presto On Azure
Presto is a distributed SQL query engine designed to efficiently query vast amounts of data by using distributed execution. Because of it’s distributed in-memory processing architecture it has been the fastest-growing SQL query engine.
Like other data processing engines, Presto can be deployed on multiple platforms. Presto fits very nicely when it comes to the system which decouples compute and data, and is the main reason why Presto fits well on the cloud.
In this article, we will plan to do 3 things,
· We will set up 3 nodes autoscaled Presto cluster on Azure.
· Configure the hive catalog to access multiple types of Azure storage.
· Other available options to create a Presto cluster on Azure.
Running Presto on Azure VMs
What all we are going to do to set up a cluster –
1. Spin up 2 VM and set up Presto coordinator and Presto worker.
2. Create an image out of Presto worker VM that we had set up in the previous step.
3. Create a scale set for Presto workers using the Presto worker VM image.
4. Configure the auto-scaling policy.
5. Validate the setup.
Let’s get 2 VMs created for Presto setup
We need 2 Azure VM first, install java on VMs and then we will be setting up Presto coordinator and worker respectively.
You can find steps to setup azure-cli here.
You can use below commands to spin up Azure VM,
Presto coordinator and worker need a minimum of Java8 installation on its coordinator and workers. I recommend using Java11 for better GC.
Presto coordinator and worker setup
Follow the below steps for setting up Presto Coordinator on a VM,
Follow the below steps for setting up Presto Worker on a VM,
Just in case if you face any issue while starting your Presto coordinator or worker then you can find logs in ‘/var/presto/data/var/log/server.log’ location on respective VMs.
At this point, we are all set to check our Presto UI with 1 worker count. You can launch Presto UI at http:://{presto-coordinator-ip}:8080
Let’s prepare Presto worker image
We have set up Presto worker VM, we will use it to create a Presto worker image out of it which we will be using for auto-scaled deployment.
Firstly we will have to log in to Presto worker VM that we had set up in the last step then run below script,
Once we run the above script, we are ready with the cloud-init installation for custom image VM creation. Enter exit to close the SSH client and follow the steps in below script to get Presto worker image created,
We are now ready to create our scale set for the cluster using Presto worker image
In this step, we will be creating a scale set for our cluster with a min of 2 nodes in the starting and will be setting up autoscale rules for the same.
As a part of this setup, autoscaler will Scale-up 2 instances when the CPU Percentage across instances is greater than 75 averaged over 5 minutes and autoscaler will Scale down by 1 instance when the CPU Percentage across instances is less than 30 averaged over 5 minutes.
This is up to you how do you want to configure autoscaling rule for your setup.
We have 3 nodes (1 coordinator and 2 workers) Presto cluster ready with autoscaling enabled based on CPU usage.
Validation
Let’s validate by opening Presto UI at http:://{presto-coordinator-ip}:8080
Prepare configs for Hive catalog in Presto
Assuming you already have an Azure storage account and access is given to Azure registered application.
If you require only read access on Storage, provide IAM Access Storage Blob Data Reader Role to Azure registered application.
If you require both read and write access on Storage, provide IAM Access Storage Blob Data Owner Role to Azure registered application.
The next steps are to configure the hive catalog for the Presto cluster. Let’s keep the below configurations handy from your Azure registered application,
· Application (client) ID
· Client Secret that can be generated in-app
· Oauth 2.0 token endpoint(v1)
Placeholder {Oauth2.0-endpoint}, {client-id} and {client-secret} have to be replaced with your app information in azure-site.xml before use.
Place your configured azure-site.xml in the Presto installation directory(default location is /usr/lib/presto/etc/) on all the hosts (Presto coordinator and workers).
Configuring Hive catalog
It’s time to get your hive catalog configured for accessing tables created on Azure storage through Presto. This will require a cluster restart.
Don’t forget to replace placeholders with your configurations.
Finally, time to query through Presto
Presto comes bundled with Presto-CLI that provides a terminal-based interactive shell for running queries. You can log in to the coordinator and install Presto-CLI to run your queries.
Presto comes bundled with Presto-CLI that provides a terminal-based interactive shell for running queries. You can start CLI by running the command ‘presto’.
You can log in to the coordinator and login to Presto-Cli then you are all set to query through Presto.
Accessing Presto is not limited to just Presto-cli. It can be accessed through any BI tool of your choice be it Superset, DBeaver, Looker, Tableau, Power BI and the list goes on.
Other available options for Presto setup on Azure
It is possible to run Presto with Azure HDInsight, you can run a custom Script Action on existing or new HDInsight Hadoop cluster (version 3.5 or above) with your bash script URI.
Running Presto with Azure HDInsight would work fine but Presto is already a resource-intensive process and it wouldn’t be effective to share Presto resource with HDInsight processes. If you want to manage your resources effectively for Presto, I strongly recommend setting up Presto on Azure vanilla VMs.