Using Amazon Athena with Apache Zeppelin

3 min readJan 12, 2017

Here is how we can use Amazon Athena as a backend for Apache Zeppelin. Amazon Athena is something like Presto as a service, which provides WebUI and JDBC interface. With this we can easily run queries on data sitting in Amazon S3. Apache Zeppelin is a web-based notebook that enables interactive data analytics. We can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

Motivation:

To run Apache Zeppelin on my laptop, offloading heavy lifting workload to cloud. The architecture looks like below.

Zeppelin on laptop > (jdbc) > Athena > (presto/hive metastore) > S3

This is very convinient in particular for ad-hoc type of analysis because we can leverage great power of cloud and run analytical workload without spinning up Hadoop or Spark cluster.

Approach:

Amazon Athena provides JDBC driver, you can find it and some sample code here: Accessing Amazon Athena with JDBC, so we can connect Athena thru Spark code on Zeppelin, or thru JDBC interpreter.

Unfortunately the first approach is not achievable right now because some functionality is not implemented yet in the JDBC driver. You can find more about it here: prepareStatement in JDBC Driver? (Maybe we can implement wrapper class with prepareStatement?)

Thus, this time I took the second approach, which is, using Apache Zeppelin’s JDBC interpreter.

Interpreter configuration:

Here is the interpreter configuration. Based on the Amazon’s document, the items we have to configure are:

default.driver: com.amazonaws.athena.jdbc.AthenaDriver
default.url: jdbc:awsathena://athena.${region}.amazonaws.com:443/
default.uesr: YOUR_AWS_ACCESS_KEY. Could be removed if you are using instance profile though I haven’t tested it.
default.password: YOUR_AWS_SECRET_KEY. Could be removed if you are using instance profile though I haven’t tested it.
default.s3_staging_dir: YOUR S3 DIRECTORY for staging. This must be located in same region with your Athena.

Also you have to add the driver jar to dependency as below.

Creating table

Athena seems it has own built-in hive-metastore, so we have to tell it table schema using CREATE EXTERNAL TABLE. What’s little bit tricky is that we cannot run CREATE EXTERNAL TABLE thru JDBC because it’s not standard sql but HiveQL. So we MAY have to run those DDL via Athena UI like below. This time I used the public example data on S3 from Amplab.

BTW, I also tried ORC tables and it worked as well!

Running queries

Now we can run queries!

Remark

While I was playing with this, I found that INSERT OVERWRITE TABLE is not supported in Athena. So we have to build data ingestion pipeline outside of Athena.