Marrying Presto with Pinot and visualizing data on Superset.

Sachin Tripathi
4 min readJan 25, 2020

--

Recently Haibo Wang of Uber shared a wonderful article on how they are using Presto with Pinot

From article:

We engineered a solution that allows Presto’s engine to query Pinot’s data stores in real time, optimized for low query latency. Our new system utilizes the versatile Presto query syntax to allow joins, geo-spatial queries, and nested queries, among other requests. In addition, it enables queries of data in Pinot with a freshness of seconds. With this solution, we further optimized query performance by enabling aggregate pushdown, predicates pushdown, and limit pushdown, which reduces unnecessary data transfer and improves query latency by more than 10x.

This solution enabled greater analytical capabilities for operations teams across Uber. Now, users can fully utilize the flexibility of SQL to represent more complex business metrics, and render query results into a dashboard using in-house tools. This capability has improved our operations efficiency and reduced operations cost.

So I thought to do “hands-on” to understand the basic functionality of Pinot.

After this, you will be able to :

  • Setup a cluster on GKE
  • Deploy Pinot which will consume data from realtime stream like kafka
  • Deploy Presto
  • Deploy Superset which will connect pinot servers via Presto
  • Visualize data in Superset

Prerequisite :

Use helm to deploy pinot:

Helm is a tool for managing Charts. Charts are packages of pre-configured Kubernetes resources.

To be able to use Helm, the server-side component tiller needs to be installed on your cluster.

helm init --service-account tiller

Deploy pinot cluster by:

helm install --namespace "pinot-quickstart" --name "pinot" .

Check deployment status:

kubectl get all -n pinot-quickstart

Bring up a Kafka Cluster for realtime data ingestion:

helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install --namespace "pinot-quickstart" --name kafka incubator/kafka

After this step you should get required service,statefulset.apps,pod.

You can check this by :

kubectl get all -n pinot-quickstart

Create Kafka topic:

kubectl -n pinot-quickstart exec kafka-0 -- kafka-topics --zookeeper kafka-zookeeper:2181 --topic flights-realtime --create --partitions 1 --replication-factor

Load data into Kafka and create Pinot schema/table

kubectl apply -f pinot-realtime-quickstart.yml

Query pinot data:

port-forwarding and open Pinot query console on your web browser.

./query-pinot-data.sh
query-pinot-data

Create a storage class:

Make changes to storageClassName in presto-coordinator.yaml and superset.yaml.

A StorageClass provides a way for administrators to describe the “classes” of storage they offer.
Each StorageClass contains the fields provisioner, parameters, and reclaimPolicy, which are used when a PersistentVolume belonging to the class needs to be dynamically provisioned.

Deploy Presto with Pinot plugin:

kubectl apply -f presto-coordinator.yaml

Use pinot connector where pinot act as a datastore for presto

Port forward Presto for queries optimizations.

./presto.sh
presto-cli

Deploy superset:

kubectl apply -f superset.yaml

Set up Admin account (First time)

kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'export FLASK_APP=superset:app && flask fab create-admin'

Init Superset (First time)

kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset db upgrade'
kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset init'

Access Superset UI

./open-superset-ui.sh

Add presto source in Superset

In sources Add Database:

Database Source added

Add airlinestats table from this database

table-added

Create visualization of the table

Count of flights

Hopefully this overview has helped you getting started with Pinot-Presto

Special thanks to Xiang Fu and Kishore Gopalakrishna for their help.

If you found this helpful please share it on your favorite social media so other people can find it, too. 👏

I write about Distributed Systems, Python, Docker, data science,life lessons and more. If any of that’s of interest to you, read more here and follow me at linkedin and Youtube .

--

--