Presto integration with HDInsights

Ankur
Ankur
Aug 16 · 3 min read

Presto is distributed query engine which uses multiple data sources like Hadoop, S3, Azure Blob storage and many more. This blog will help you to integrate presto with the HDInsight cluster.

HDInsight is an Azure managed big data platform that used the most popular open-source tools like Spark, Hive, Kafka and other tools. This blog will not focus on HDInsight Cluster.

We are using Spark 2.4 (HDI 4.0) cluster. To launch the cluster you need an Azure account go to the HDInsight service and create a cluster with the required resources.

Presto architecture has the following components

  • Presto Coordinator
  • Presto worker
  • Presto CLI

Let’s start with the Linux instance(Centos). Here we are using a single instance that works like both coordinator and worker node. Once we spin up the instance we need to install the Java. Presto supports the Oracle Java 1.8 or later versions.

For your reference please use the link. Once you download the Java need to install it using below

sudo yum install jdk-8u121-linux-x64.rpm

Here we are using Presto version 0.149

wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.149/presto-server-0.149.tar.gz

We have to unpack the tar file and do the initial configurations

tar -xvf presto-server-0.149.tar.gz
mv presto-server-0.149 presto
sudo mv presto /opt/

Now exports the Presto path for that we will add below entries in the /etc/profile

export PRESTO_HOME="/opt/presto"
export PATH=$PRESTO_HOME/bin:$PATH

Presto needs the following configuration files

  • config.properties (Presto configuration)
  • core-site.xml (Azure blob storage)
  • node.properties (Environment and Data)
  • hive.properties (Hive details)

Start with the properties. Please use the /opt/presto/etc path to update the presto properties

Config.Properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=2GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

Node.Properties

node.environment=environment_name
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/opt/presto/data

Core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>fs.AbstractFileSystem.wasb.Impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.azure.account.key.BLOB_STORAGE_NAME.blob.core.windows.net</name>
<value>KEY_DETAILS</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>wasbs://CONTAINET_NAME@BLOB_STORAGE_NAME.blob.core.windows.net/</value>
</property>
</configuration>

*The above blob storage is used to configure the HDInsight cluster.

Now we need to create a catalog directory under /opt/presto/etc/ path to store the hive details

mkdir /opt/presto/etc/catalog
cd
/opt/presto/etc/catalog

Hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://xxxxxxx.xxxxxx.xxxx.internal.cloudapp.net:9083,thrift://xxxxxxx.xxxxxx.xxxx.internal.cloudapp.net:9083
hive.config.resources=/opt/presto/etc/core-site.xml

*You can get the hive metastore URI details from the HDInsight cluster.

So we have done with the configuration part. We can start the presto using the below command

launcher start

To debug the thing you can run presto like

launcher run 

To access the Presto we need the presto CLI. Please use the following link to download the CLI and move to the bin directory to access it

wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.258/presto-cli-0.258-executable.jar

Let's connect with presto

[azureuser@Presto ~]$ launcher start
Started as 1895
[azureuser@Presto ~]$ launcher status
Running as 1895
[azureuser@Presto ~]$ presto --server localhost:8080 --catalog hive
presto> show schemas;
Schema
--------------------
default
information_schema
test
(3 rows)
Query 20210816_015817_00000_ikr35, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:02 [3 rows, 44B] [1 rows/s, 28B/s]

TL;DR

Setup an HDInsight cluster in Azure. Launch an instance in azure. Install and configure all the above-required packages and configuration files. Configure the Presto CLI and query the data.

Opstree

Opstree is an end-to-end DevOps consultant company

Opstree

Opstree is a end-to-end DevOps consultant company which helps different organization to implement DevOps practices

Ankur

Written by

Ankur

DevOps Engineer with 10+ years of experience in the IT Industry. In-depth experience in building highly complex, scalable, secure and distributed systems.

Opstree

Opstree is a end-to-end DevOps consultant company which helps different organization to implement DevOps practices