Running Presto with Hive Metastore on a Laptop in 10 Minutes

Bin Fan
4 min readJul 3, 2020

--

This tutorial guides beginners to set up Presto and Hive Metastore on your local server to query data on S3.

Presto is the SQL Engine to plan and execute queries, S3 is the storage service for table partition files, and Hive Metastore is the catalog service for Presto to access table schema and location information. This tutorial will show you instructions step-by-step to install Presto, Hive MetaStore, and configure them to query data in a public S3 bucket.

Step1: Download and Launch Hive MetaStore

Download and extract the Hive binary tarball. In this tutorial, we download and use apache-hive-2.3.7-bin.tar.gz.

$ cd /path/to/tutorial/root
$ wget https://downloads.apache.org/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz
$ tar -zxf apache-hive-2.3.7-bin.tar.gz
$ cd apache-hive-2.3.7-bin

We only need to launch Hive Metastore to serve Presto catalog information such as table schema and partition location.

If it is the first time for you to launch Hive Metastore, prepare corresponding configuration files and environment, also initialize a new Metastore

$ export HIVE_HOME=`pwd`
$ cp conf/hive-default.xml.template conf/hive-site.xml
$ mkdir -p hcatalog/var/log/
$ bin/schematool -dbType derby -initSchema

Configure Hive to access S3 by appending the following lines in conf/hive-env.sh. Hive needs the corresponding jars to access files with “s3a://” addresses, and AWS credentials as well to access an S3 bucket (even it is public):

export HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-core-1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-s3-1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-aws-2.8.4.jar
export AWS_ACCESS_KEY_ID=<Your AWS Access Key>
export AWS_SECRET_ACCESS_KEY=<Your AWS Secret Key>

If you do not have a Hadoop installation which include above jars, you can also download them from maven central repository: aws-java-sdk-core-1.10.6.jar, aws-java-sdk-s3-1.10.6.jar and hadoop-aws-2.8.4.jar.

Start a Hive Metastore which will run in the background and listen on port 9083 (by default).

$ hcatalog/sbin/hcat_server.sh start
Started metastore server init, testing if initialized correctly...
Metastore initialized successfully on port[9083].

To verify if the MetaStore is running, check the Hive Metastore logs at hcatalog/var/log/

Step2: Download and Launch Presto Server

Goto Presto server installation, download and untar the pre-build server tarball. We will use 0.237.1 as an example in this tutorial.

$ cd /path/to/tutorial/root
$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.237.1/presto-server-0.237.1.tar.gz
$ tar -zxf presto-server-0.237.1.tar.gz
$ cd presto-server-0.237.1

Create a configuration file etc/config.properties to include the basic Presto configuration.

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080

Create etc/jvm.config to specify the following JVM configuration.

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Create etc/node.properties to include the following lines.

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/tmp/presto/data

Lastly, configure Presto Hive connector in etc/catalog/hive.properties, pointing to the Hive Metastore service that we just started. We also need to put AWS credentials here again so Presto can read the input files from S3.

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=<Your AWS Access Key>
hive.s3.aws-secret-key=<Your AWS Secret Key>

Run the Presto server in the background:

$ ./bin/launcher start

To verify Presto server is running, open http://localhost:8080 in your browser and check the server status from its web UI.

Step3: Launch a Presto CLI and Run Queries

Download Presto CLI which is a separate binary from the server.

$ cd /path/to/tutorial/root
$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.237.1/presto-cli-0.237.1-executable.jar
$ mv presto-cli-0.237.1-executable.jar presto
$ chmod +x presto

Connect to the server started in the previous step.

$ ./presto --server localhost:8080  --catalog hive --debug

Use thedefault schema

presto> use default;
USE

Create a new table based on the file in S3 with the schema. This information will be sent to Hive MetaStore.

presto:default> CREATE TABLE reason (
r_reason_sk integer,
r_reason_id varchar,
r_reason_desc varchar
) WITH (
external_location = 's3a://apc999/presto-tutorial/example-reason',
format = 'PARQUET'
);
CREATE TABLE

Scan the newly created table:

presto:default> SELECT * FROM reason limit 3;
r_reason_sk | r_reason_id | r_reason_desc
-------------+------------------+------------------------
1 | AAAAAAAABAAAAAAA | Package was damaged
2 | AAAAAAAACAAAAAAA | Stopped working
3 | AAAAAAAADAAAAAAA | Did not get it on time
(3 rows)
Query 20200703_074406_00011_8vq8w, FINISHED, 1 node
http://localhost:8080/ui/query.html?20200703_074406_00011_8vq8w
Splits: 18 total, 18 done (100.00%)
CPU Time: 0.5s total, 6 rows/s, 2.06KB/s, 27% active
Per Node: 0.1 parallelism, 0 rows/s, 279B/s
Parallelism: 0.1
Peak User Memory: 0B
Peak Total Memory: 219B
Peak Task Total Memory: 219B
0:04 [3 rows, 1002B] [0 rows/s, 279B/s]

Step4: Stop Servers

$ cd /path/to/tutorial/root
$ presto-server-0.237.1/bin/launcher stop
$ apache-hive-2.3.7-bin/hcatalog/sbin/hcat_server.sh stop

Summary

In this tutorial, we demonstrated how to run setup Presto, Hive Metastore to run SQL queries with data in a public S3 bucket. Hope you find it useful.

--

--

Bin Fan

VP of Open Source, Founding Member@Alluxio, ex-Google, PhD from CMU,