Integrating Presto with HUE

How to get PrestoDB accessible via HUE with Prestogres and bit of hackery

In the world of Big Data Analytics, the possibilities are endless. Tools like Hadoop and Hive offer magnificent capabilities in exploring this data with existing SQL skill sets. Exploring terabytes on terabytes of data in a matter of tens of minutes is impressive from an engineering point of view — not so much from an Analyst perspective. The downside of having access to a massive amount of data, is that queries can last even hours.

Enter Presto, a SQL-on-Files offering by Facebook, designed from the ground up to be a fast interface for the same files residing in HDFS. Developed as an internal tool for analysts, it has grown to be a mature open source offering, with connectors to products such as Hive, MySQL, Cassandra and Apache Kafka.

HUE — shorthand for Hadoop User Experience, is a wonderful tool from the team at Cloudera for interacting with Hadoop. It contains web-based user interfaces for such things as Hive, Pig, Solr, Oozie and a host of other Hadoop related tools. It also comes bundled with Cloudera Hadoop, and often is a natural choice for many use cases.


Presto exacto!

I was recently faced with a problem of thinking about what was the best way of testing Presto against Hive. Presto is a SQL-on-everything product developed by Facebook and open sourced for everyone’s benefit.

The prime directive of the Presto project is SPEED and FAMILIARITY. Presto is a fully SQL-92 compliant tool that can access all the files stored in Hadoop, and also talk to Hive to access its’ tables. Presto performs most actions twice as fast as Hive, and often with more complex queries speed improvements can be to the tune of 10x.

These improvements are achieved by Presto by omitting MapReduce, the familiar method from Hadoop, completely. Instead, it uses a Directed-action-graph algorithm.

Presto also supports reading data from sources other than Hadoop, and most impressively allows joining data from two distinct sources. You may have a table in Hive and another in Cassandra, and Presto will help you combine these. A powerful tool to have as an analyst. The up-to-date list of connector as of Presto 0.79 is:

  • Cassandra — a distributed NoSQL store
  • Hive — a SQL Interface to Hadoop
  • Apache Kafka — a distributed publish-subscribe messaging service
  • MySQL — The bog-standard of SQL servers
  • PostgreSQL — The golden standard of SQL servers
  • TPCH Connector for testing and benchmarking
  • JMX Connector for internal monitoring

One of the problems faced was the user interface, which presto does not offer natively. Instead it exposes a JDBC driver or a Command Line Interface that will be familiar to MySQL users.

The client already being a big user of Hue it made sense to integrate Presto in to the user interface.

Hackin’ it away

Hue contains an App for accessing traditional RDMBs and other databases. The list of supported Databases include MySQL and PostgreSQL, but sadly no support for JDBC connectors has been provided yet.

Enter Treasure Data, with their wonderful open source project Prestogres. With the help of a patched PostgreSQL server and clustering aide pgpool II, Prestogres enables accessing Presto with ANY tool that can speak PostgreSQL.

By leveraging this ability we can extend Hue’s user interface to include access to Presto as well:

Logical architecture of the setup. Adapted from the image at (https://github.com/treasure-data/prestogre)

Step 1 — Install Presto

Installation of Presto is fairly straightforward. Just follow the steps here to unpack all your files and create configurations. A good rule of thumb is to have separate servers for the Presto Coordinators and Prestogres.

Create a catalog file for your desired Hive server. Point the endpoint to the instance where the Hive Metastore lives, and usually at port 10000. Test using the Presto CLI to see if a connection has been established. You should be able to already query hive tables like this with dizzying speed.

Often the biggest problem you’ll face with Presto is that it needs Java 7 or higher to function, so please check your java version with:

java -version

If like me you’re stuck running Ubuntu 12.04 (I know, rite?) on your servers follow these steps to get a later version:

sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

Step 2 — Install & Configure Prestogres

Installing Prestogres is a bit trickier. Prestogres requires Postgres 9.3 or higher. Ubuntu 12.04 does not have versions this high available, so installation on it requires adding custom repositories:

sudo apt-get update
sudo apt-get install python-software-properties
wget —quiet -O — https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

Add this line to your /etc/apt/sources.list:

deb http://apt.postgresql.org/pub/repos/apt/ codename-pgdg main

Finally, run :

sudo apt-get update
sudo apt-get install postgresql-9.4

When you’ve got a Postgresql version that’s high enough, follow the instructions at https://github.com/treasure-data/prestogres to get the system initially set up.

When you’re done with initial installation, configure the server in the following files:

data_dir/pgpool/pgpool.conf:

Configuration here is fairly straightforward — Point to the right Presto server URI and Port.

  • set listen_addresses to ‘*’ to enable global access. Don’t worry, you’ll limit this to particular nodes in another config
  • presto_server: should be the uri and port of your new presto installation
  • presto_catalog: should be the name of the catalog file that contains the hive configurations for presto
  • presto_schema: The schema to point to in the Hive database you want to access. Default is a good start.

data_dir/pgpool/pool_hba.conf

Pool_hba is a modified pgpool access control file. This file will help you define who has access to the postgres instance and to which databases. Add the following line to the file, replacing your ip with the XXX’s:

host all all XX.XX.XX.XX/32 prestogres_trust pg_database:postgres,pg_user:pg

This configuration will allow passwordless access to a user named pg to the database run by postgresql.

Run Prestogres

Follow the instructions to the end on the prestogres repository and start Presto. Test on the same server according to their instructions.

Step 3 — Configure and Patch HUE for Prestogres

On to the good stuff! If you’re a future-person running Hue 3.7.0 you can skip the next chapter. If not, Proceed

1.Patch Hue 3.6.0

Hue 3.6.0 has a little bug in the postgresql connector that causes the table listings not to display correctly. To fix this, log on to your hue server, and use this command to find the file:

sudo locate postgresql_lib.py

it usually lives somewhere like this: desktop/libs/librdbms/src/librdbms/server

Apply these changes to the file:

https://github.com/Ile2/hue/commit/6d02db1f0b58b4103bc131526f4734b4a9a5a2d5

2. Configure Hue

If you’re running Cloudera Hadoop with the Cloudera manager, go to :

Hue -> Configuration -> Advanced -> Safety Valve

If you’re running a custom install of Hue, find

desktop/conf/pseudo-distributed.ini

Add the following lines to the configuration:

[librdbms]
[[databases]]
[[[postgresql]]]
nice_name=”Presto”
name=default
engine=postgresql
host=ip.of.prestogres.server
port=9900
user=pg
password=

3. Restart Hue

And that is it! Enjoy a fully working integration of Hue and Presto. You can now query Presto databases with the Hue interface. Grant access to your analysts, technicians, engineers or data scientists!

Ah, this is the life!
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.