How to Connect S3 Data to Looker using Starburst Presto

Published in

Hashmap, an NTT DATA Company

11 min readAug 4, 2020

by Senthil Anandarangan and Ganesh Hegde

In this era of increasing adoption of data-driven solutions, faster data accessibility becomes increasingly critical for data analysts.

This blog post is aimed to take you through a ‘source’ data access solution using the Starburst Presto SQL query engine and Looker Analytics. AWS S3 is our data source.

What is Starburst Presto?

Starburst Enterprise Presto (SEP) is the commercial distribution of the Presto distributed SQL query execution engine. It is maintained by Starburst and free to download and use. You will need to purchase an Enterprise License to enable additional security features (like Ranger, Okta integration), more connectors (Delta Lake connector is one of the most powerful), enhanced cost-based query optimizer, tech support, and many more.

Starburst Presto simplifies the Presto cluster installation steps which can be self-installed into on-prem, Cloud VMs, or packaged with powerful Kubernetes clusters which adds the power of scalability and flexible runtime deployment options.

In addition to that, Starburst Presto has a marketplace offering in AWS and Azure which enables easy installation and subscription pricing models. For some companies, cloud subscription pricing providing support and maintenance of clusters plays better than going on a self -service model. For AWS Marketplace pricing refer to this link.

The demo solution in this blog post is using Starburst Enterprise Presto on Azure Kubernetes Service (AKS). You may choose to use AWS Elastic Kubernetes Service (EKS), Google Kubernetes Engine, or Red Hat OpenShift by choosing a relevant image from Starburst DockerHub.

Let's take a look at our goals:

A successful Starburst Enterprise Presto cluster installation.
Connect Presto to an AWS S3 object storage containing a public Weather dataset.
Query weather data directly from S3 into Looker BI using its inbuilt Presto Connector.

Are you ready for action? Hang-tight — there is quite a bit of detail to absorb especially on the Presto concepts and the Cluster Setup!

If you have not done so already, follow the download request and request for an Enterprise trial license here: https://www.starburstdata.com/

NOTE — The technical part of this blog post is written under an assumption that readers have some basic familiarity with Azure CLI commands, Kubernetes and Looker. And NO experience with Presto!

Let’s start creating a test DEMO environment with a minimal configuration:

AKS cluster with 2 Nodes of standard DS3_V2 size [4Core 14GB] Linux VMs.
Azure Database for PostgreSQL server [purpose explained in later parts of this article].
Azure VNET and Subnet in the same region of AKS and Database [Can use existing VNet as long as it is on the same region].
AWS S3 object storage with a CSV/JSON/Parquet/TXTFILE dataset to Query.
Looker 7.x and above that support Presto DB connector.

Step 1: Create AKS cluster Nodes

SUBSCRIPTION=[Choose Subscription name]

RES_GROUP_NAME=[Choose Resource group name]

AKS_NAME=[Choose AKS name]

LOCATION=[Choose Location]

VNET_NAME=[Choose/Create VNet]

SUBNET_NAME=[Choose/Create Subnet]

SUBNET_RES_ID=$(az network vnet subnet list \ — vnet-name $VNET_NAME \ — resource-group $RES_GROUP_NAME \ — query “([? name == ‘$SUBNET_NAME’ ])[0].id” | cut -d ‘“‘ -f 2)

Run below script-

az aks create — resource-group $RES_GROUP_NAME \

— name $AKS_NAME \

— enable-cluster-autoscaler \

— node-count 2 \

— min-count 2 \

— max-count 2 \

— node-vm-size Standard_DS3_v2 \

— generate-ssh-keys \

— vnet-subnet-id $SUBNET_RES_ID \

— network-plugin azure \

— enable-addons monitoring

Verify 2 Nodes created in the K8S cluster using command-

> kubectl get Node

Step 2: Prepare for Starburst Presto installation

2a) Download below 2 YAML files which has the Starburst Container image URL

Missioncontrol.yaml : https://starburstdata.s3.us-east-2.amazonaws.com/mission-control/0.20/k8s/missioncontrol.yaml

Postgres.yaml : https://starburstdata.s3.us-east-2.amazonaws.com/mission-control/0.20/k8s/postgres.yaml

2b) Setup Starburst Hive Connector

The Hive connector allows querying data stored in a distributed file system like Amazon S3. It supports Data files in varying formats, typically stored in Parquet, JSON, CSV, text file formats.
SEP on Kubernetes provides an automatic configuration of the Hive connector. Such a connector allows you to either access an external Metastore or uses built-in internal Presto cluster Metastore as well.

2c) Setup storage database for metastore data

Metastore requires a PostgreSQL database to store its data.
Create Azure Database for PostgreSQL server [Let me call it ‘starburst-metastore’ as it will be referenced later in this article] with Basic configuration, 1 vCore(s), 10 GB [For Demo purpose]. For Production Metastore storage recommended configuration is 8GB of memory and 2 CPU cores.
An alternate option for metastore storage is to use a built-in ephemeral PostgreSQL instance, however, it is not recommended for production use.
As it is used traditionally, the main purpose of metastore is to store Metadata about how the data files are mapped to schemas and tables from the source system, which is stored in a database, such as PostgreSQL, and is accessed via the Hive metastore service (HMS) running on Presto cluster nodes.

Tip — Pods running under Presto Nodes must have network access to the Hive metastore and the storage system. So, make sure the Hive metastore is running in the same virtual network.

Step 3: AWS S3 configuration

The Hive Connector can read and write tables that are stored in S3. This is accomplished by having a table or database location that uses an S3 prefix. For example, S3://mybucket/data
Example dataset we used for this article is a public weather dataset denoting current weather of a specific US location running every 2 minutes.
Data format is JSON and the sample looks like this:

Now we are ready to install Starburst Presto container image on AKS Nodes

Step 1) Edit missioncontrol.yaml file on the following section, replace with credentials from Azure database ‘starburst-metastore’

[Recommended practice is to source the username and password from the Key Vault Service]

Step 2) Apply missoncontrol.yaml file

> kubectl apply -f missioncontrol.yaml

Step 3) Apply postgres.yaml file

> kubectl apply -f postgres.yaml

Verify Services installed and Status

> kubectl get pods

This should show you POD status as follows:

If you see all the PODS in running status, Congratulations! you have successfully installed all Services.

Now the next step is to start and run the cluster nodes through the Mission Control UI

Mission Control is a web-based management console for Starburst Enterprise Presto.

Launching Mission control UI:

> kubectl get svc — to get the external-IP of the load balancer which is attached to the mission-control service:

Open a browser window & point the following URL

http://<load balancer ip>:5042 [username/password :: admin/admin]

Mission Control UI

‘hive’ is the mandatory name for Data source that use Internal Metastore.

Steps to add new Datasource:

Click the + to add new Data Source in Mission control
Fill the basic information
In Metastore connection, choose Internal Metastore and fill in the ‘starburst-metastore’ Azure DB credentials

In S3 connection to a dataset, choose the type of access — Secret keys or IAM roles or Default credentials. For Secret Keys access, Access Key Id and Secret Access Key to be created on the S3 location.
In the Security tab, choose the access type, Read-Only access’ for example.
In the Advanced tab, choose ‘Enable writing to external Hive tables’ option.
Click Create/Update.

Steps to Start Cluster:

Click Run and it should start Running and show status like below:

Steps to Query S3 data source:

Connect your Starburst Presto Mission control Cluster to your favorite SQL database management tool like DBVisualizer and start Querying your S3 dataset as below — [refer to the JSON structure in the Setup section]

create table jsondata (
data varchar
)
with (
format = ‘TEXTFILE’,
external_location = ‘s3://<bucket-name>/’
);
create view weather_data_vw AS
SELECT
cast(json_extract_scalar(data, ‘$.latitude’) AS DECIMAL(37,4)) AS latitude,
cast(json_extract_scalar(data, ‘$.longitude’) AS DECIMAL(37,4)) AS longitude,
cast(json_extract_scalar(data, ‘$.timezone’) AS VARCHAR) AS timezone,
cast(json_extract_scalar(data, ‘$.currently.time’) AS DECIMAL) AS time,
cast(json_extract_scalar(data, ‘$.currently.temperature’) AS DECIMAL(37,2)) AS temperature,
cast(json_extract_scalar(data, ‘$.currently.apparentTemperature’) AS DECIMAL(37,2)) AS apparentTemperature,
cast(json_extract_scalar(data, ‘$.currently.humidity’) AS DECIMAL(37,2)) AS humidity,
cast(json_extract_scalar(data, ‘$.currently.windSpeed’) AS DECIMAL(37,2)) AS windSpeed
from jsondata;
SELECT * from weather_data_vw;

Launching Starburst Cluster Overview — web console for monitoring Query execution plan and results

> kubectl get svc — to get the external-IP of the load balancer which is attached to the mission-control service

Open a browser window & point the following URL

http://<load balancer ip>:8080 [username/password :: admin/no password]

Cluster Overview console is a great tool to monitor Query execution logs, Current running queries, Active Worker count, sorting results, and validate Live Query plans.

Click on the Query ID to view Live Query plans and loads of other useful information.

Connecting Looker to Starburst Presto

To enable access from the Looker application into our Starburst Presto cluster, we must “whitelist” the Looker IP addresses.

More info: https://docs.looker.com/setup-and-management/enabling-secure-db

Now we are ready to connect to Presto.

Under the Admin -> Database/Connections section, create a new connection, and choose PrestoDB as the database. Enter the public IP address of your Presto Coordinator along with the port.

Developing LookML

Under “Develop -> (your Project)” you should see the tables appear for writing LookML Models and Views (In this case “weather_data_vw”)

Note: If the data in S3 is in JSON format, it might be helpful if each field is mapped in the Presto table for easier modeling in LookML.

Time values might need to be CAST to ‘number’ or ‘timestamp’ is represented as seconds (Ex:1593471785 ) before it can be used for chronologically sorting the data in Looker. Otherwise, Looker may not recognize it as a date/time field.

Querying sample data

Select the dimensions and measures on the left panel and click ‘Run’ to query the data:

Exploring the data in Looker before creating a Dashboard by selecting multiple Dimensions and measures with sorting and filters:

Under the Browse tab, create a dashboard by adding multiple tiles with the views created in LookML:

To summarize what we learned…

Starburst Enterprise Presto [SEP] provides simplified setup options to get started quickly in comparison to the Client-Server installation of the Apache Presto configuration. In addition to that, what we like is the flexibility in deployment options like Kubernetes cluster deployment on all major cloud providers, DockerHub container image installation, and the available marketplace offering for easy installation on AWS and Azure.

SEP provides tech support for large customers in production, numerous in-built connectors, and provides enterprise subscription add-ons like ODBC, Ranger, Sentry, Oracle, Teradata, K8S, and an enhanced cost-based optimizer. This blog post has not explored the performance aspects of Starburst Presto which will be covered in the upcoming blog post.

Some of the notable challenges include:

Sizing of the VMs cannot be determined upfront in order to estimate running costs. Although auto-scaling is supported to handle unexpected loads, the cost to maintain VMs may increase.
As there is no data storage or caching layer, every Presto Query request to object storages like AWS S3 may incur transfer out costs (though it is very nominal).
Support and maintenance of clusters like security patches and updates have to be considered.
As data is read directly from heterogeneous sources, there is no curation process that results in complexity to create mapping tables. Changes to source data fields may adversely affect and force Presto SQL queries to be re-written.
SQL expertise is required in the Presto environment to write performant queries. For example, the Looker query was taking too long with an order by clause on a large result set.

The views above are limited to SEP installation, using SEP for direct query on AWS S3 and consuming the results into Looker Views.

Competition and Pricing

AWS Athena provides a ‘managed’ Presto SQL query engine where you pay only for the queries you run — without having to undertake the responsibility of managing the Presto environment. However, if you are a data-driven organization where a bunch of data scientists and data analysts are going to bombard cross SQL queries across multiple data sources on daily basis, then Athena becomes a costly affair compared to having your own enterprise production-grade Starburst Presto Cluster environment. Cost factor analysis should be done to evaluate long term cost of ownership of these two competing products.

Refer to this table for per Node monthly pricing of a production-grade Starburst Presto environment:

[Recommended Configuration for PRODUCTION Starburst Presto Kubernetes Cluster Node]

* Per Node pricing, Node size determined based on Loads, Min Node size is 2, recommended minimum Node size is 5 for production.

Closing Thoughts

This blog post covers how to set up and run Starburst Enterprise Presto to run SQL queries directly over S3 data sources, minimal sizing requirements, and pricing involved. We have also shown how to connect Starburst Presto with Looker to create quick visualizations over the source data and analyze data patterns. This feature puts Presto in advantage to a data analyst to gain quick data insights without any ETL wait time usually performed by their data engineering counterpart. Presto may not heavy-lift the entire OLAP data load, but can co-exist and replace some data sections which don’t need to necessarily be moved into a full-fledged data warehouse before it can be queried!

Need Help with Your Cloud Initiatives?

If you are considering migrating or modernizing data and analytics products and applications in the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our service offerings.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore technologies from diverse perspectives while enjoying a drink of choice.

Senthil Anandarangan and Ganesh Hedge are Cloud, Data, and UI Engineers with Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.