How to Install Presto Over a Cluster
This post describes the steps to follow to install Facebook’s Presto on your cluster.
Although the Presto documentation describes a simple enough way of installing, I found some of the points vague and might cause some confusion. So here’s the modified version.
Prerequisites
The Presto would require you to have the following configured:
- A working Hadoop installation : You can do it by following the steps mentioned here.
- Hive : Required for running the hive-server in later steps and then for modifying database. Presto can only be used for running queries and not for creating tables. For that purpose, a hive installation is required. You can install it by following the steps mentioned here.
The installation part is described in three major parts:
1. Deploying Presto
2. Discovery Service
3. Command Line Interface
Deploying Presto
For the sake of clarity, we’ll call the directory used to install presto as the home directory.
Start by downloading the presto tarball on the master as well as the slaves and extract it.
cd home
wget http://central.maven.org/maven2/com/facebook/presto/presto-server/0.60/presto-server-0.60.tar.gz
tar zxvf presto-server-0.60.tar.gz
This will create a directory presto-server-0.60. Lets call this the Presto directory. Now create a data directory inside your presto directory. Presto needs this for storing logs, local metadata etc. Also create a MetaStore directory required for storing metadata.
cd presto-server-0.60
mkdir data
mkdir metastore
Configuring Presto :
Create a directory inside the Presto directory named etc. This will be used to hold all your configuration files.
cd presto-server-0.60
mkdir etc
Following files have to be created inside the etc directory.
node.properties
This file is typically created by the deployment system when Presto is first installed. The following is a minimal etc/node.properties(same for master and slaves):
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/data
Note : node id can be obtained by typing uuid on your system. Don’t change it once you’ve configured it inside this file.
jvm.config
The following provides a good starting point for creating etc/jvm.config:
-server
-Xmx16G
-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+CMSClassUnloadingEnabled
-XX:+AggressiveOpts
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
-XX:PermSize=150M
-XX:MaxPermSize=150M
-XX:ReservedCodeCacheSize=150M
-Xbootclasspath/p:/home/presto-server-0.60/lib/floatingdecimal-0.1.jar
config.properties
The following is a configuration for the master.
coordinator=true
datasources=jmx
http-server.http.port=8080
presto-metastore.db.type=h2
presto-metastore.db.filename=/home/presto-server-0.60/metastore
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=http://<master>:8080
and this is the configuration for the workers.
coordinator=false
datasources=jmx,hive
http-server.http.port=8080
presto-metastore.db.type=h2
presto-metastore.db.filename=/home/presto-server-0.60/metastore
task.max-memory=1GB
discovery.uri=http://<master>:8080
where <master> is the IP of the master being used here.
log.properties
Add this single line inside the log.properties file:
com.facebook.presto=DEBUG
catalog.properties
Catalogs are registered by creating a catalog properties file in the etc/catalog directory. For example, create etc/catalog/jmx.properties with the following contents to mount the jmx connector as the jmx catalog:
connector.name=jmx
Presto includes Hive connectors for multiple versions of Hadoop:
- hive-hadoop1: Apache Hadoop 1.x
- hive-hadoop2: Apache Hadoop 2.x
- hive-cdh4: Cloudera CDH4
Create etc/catalog/hive.properties with the following contents to mount the hive-cdh4 connector as the hive catalog, replacing hive-cdh4with the proper connector for your version of Hadoop and example.net:9083 with the correct host and port for your Hive metastore Thrift service:
connector.name=hive-cdh4
hive.metastore.uri=thrift://<master>:10000
You can have as many catalogs as you need, so if you have additional Hive clusters, simply add another properties file to etc/catalog with a different name (making sure it ends in .properties).
Running Presto
To run presto as a foreground process, run
bin/launcher run
You can also run it as a daemon (logs will be written to stdout/stderr):
bin/launcher start
Discovery Service:
Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup.
Discovery is configured and run the same way as Presto. Download discovery-server-1.16.tar.gz, unpack it to create the installation directory, create the data directory, then configure it to run on a different port than Presto.
wget http://central.maven.org/maven2/io/airlift/discovery/discovery-server/1.16/discovery-server-1.16.tar.gz
cd discovery-server-1.16
Again, create another data directory for discovery server which is used later.
mkdir data
Configuring Discovery
As with presto, create an etc directory inside the discovery server directory to hold the following files:
node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/discovery-server-1.16/data
jvm.config
-server
-Xmx1G
-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+AggressiveOpts
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
config.properties
http-server.http.port=8411
2.2 Running Discovery:Run Discovery the same way as Presto, either as a foreground process:
bin/launcher run
or as a daemon
bin/launcher start
Command Line Interface
To access presto-cli, you first need to start hive thrift server. Go to hive directory and do the following:
bin/hive —service hiveserver -p 10000
Now, Download presto-cli-0.60-executable.jar, rename it to presto, then run it:
wget http://central.maven.org/maven2/com/facebook/presto/presto-cli/0.60/presto-cli-0.60-executable.jar
mv presto-cli-0.60-executable.jar presto
./presto —server <master>:8080 —catalog hive —schema default
This will start a Presto CLI for you to run your queries.