Getting started with Presto and Apache Kudu backend on Kubernetes

Sabarish Sasidharan
4 min readAug 7, 2020

--

Here we will see how to quickly get started with Apache Kudu and Presto on Kubernetes

Architecture

Introduction

We will do this in 3 parts

  • Install Kudu
  • Install Presto
  • Connect Presto to Kudu

If you just want to get hold of the helm charts -> They are here

Installing Kudu

I used the helm chart from kudu github. It worked sort of ok.

git clone https://github.com/apache/kudu
cd kudu/kubernetes/helm

I had to make some changes to make it fit within the k8s in my macbook. These are the limits that worked for me

Master: 200m cpu/256Mi memory
Tablet server: 200m cpu/756Mi memory

Also remember to update the storage class to use whatever you storage class default is

Note that Kudu helm chart provisions 3 disks for each tablet server but configures only one of them for use. Ideally we would want to use different disks for the WAL, the data and the metadata.

And finally we can install kudu. The following installs to default namespace

helm install apace-kudu ./kudu
kubectl port-forward svc/kudu-master-ui 8050:8051

I was trying different cpu and memory values and the masters were going up and down in a loop. I suppose they need to run together. So I killed them all so they could start in unison.

kubectl delete pods kudu-master-0 kudu-master-1 kudu-master-2 --force --grace-period=0

And voila! Kudu dashboard is now accessible at http://localhost:8050

Installing Presto

And now onto Presto (prestosql Presto).

helm repo add stable https://kubernetes-charts.storage.googleapis.com
helm pull stable/presto

You would want to change the following in values.yaml as the defaults for memory were quite high

workers: 0
...
query:
maxMemory: "600MB"
maxMemoryPerNode: "100MB"
maxTotalmemoryPerNode: "501MB"
jvm:
maxHeapSize: "1G"

Setting workers to 0 means that scheduler schedules even in the coordinator. So a single pod is enough. I was constrained on resources, so used this approach.

Also the liveness and readiness probes were failing. So I commented them out ;-)

helm install presto ./presto
kubectl port-forward $(kubectl get pods|grep presto|awk '{print $1}') 8080:8080

And Presto was up and running. It was not this smooth as I had to adjust the values for maxMemory and it had dependencies with other memory values and it took some trial and error to even figure out what the error message was saying.

I could now access at http://localhost:8080. You can use any user name.

Connecting Presto to Kudu

Now configuring Kudu catalog in Presto. So we need to add the catalog directory to node.properties. And in the catalog directory we need to have kudu coordinates for Presto to work with Kudu

In templates/configmap-coordinator.yaml, you need to add the catalog config dir (and node id) to node.properties

data:
node.properties: |
node.environment={{ .Values.server.node.environment }}
node.id=4f81a5ea-3025-4181-b94d-9ebc877adbc6
node.data-dir={{ .Values.server.node.dataDir }}
plugin.dir={{ .Values.server.node.pluginDir }}
catalog.config-dir=/usr/lib/presto/catalog

And define the following kudu catalog config to be later mounted in the presto coordinator as a volume

---
apiVersion: v1
kind: ConfigMap
metadata:
name: presto-catalog
labels:
app: {{ template "presto.name" . }}
chart: {{ template "presto.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
component: coordinator
data:
kudu.properties: |
connector.name=kudu
kudu.client.master-addresses=kudu-master-0.kudu-masters.default.svc.cluster.local:7051,kudu-master-1.kudu-masters.default.svc.cluster.local:7051,kudu-master-2.kudu-masters.default.svc.cluster.local:7051
kudu.client.default-admin-operation-timeout=300s
kudu.client.default-operation-timeout=300s
kudu.client.default-socket-read-timeout=100s

Now we need to mount this configmap as a volume in deployment-coordinator.yaml, so that Presto sees the kudu.properties in /usr/lib/presto/catalog

         volumeMounts:
- mountPath: {{ .Values.server.config.path }}
name: config-volume
- mountPath: /usr/lib/presto/catalog
name: catalog-volume

Now redeploy presto or simply delete the pod.

Test it out

Once the pod is back, you can now login to presto

kubectl exec -it $(kubectl get pods|grep presto|awk '{print $1}') bash
presto --catalog kudu --schema default

You can now create tables, insert and query.

presto:default> CREATE TABLE rivers (
-> name varchar WITH (primary_key = true),
-> length_km integer,
-> destination varchar WITH (encoding = 'dictionary'),
-> discharge_m3 integer WITH (nullable = true)
-> ) WITH (
-> partition_by_hash_columns = ARRAY['name'],
-> partition_by_hash_buckets = 16,
-> number_of_replicas = 3
-> );
presto:default> select * from rivers;
name | length_km | destination | discharge_m3
-------------+-----------+---------------------+--------------
Amazon | 6575 | Atlantic Ocean | 209000
Ganges | 2704 | Bay of Bengal | 12037
Nile | 6650 | Mediterranean Ocean | 2800
Brahmaputra | 3969 | Ganges | 19800
(4 rows)
Query 20200810_064156_00019_yenh5, FINISHED, 1 node
Splits: 32 total, 32 done (100.00%)
0.22 [4 rows, 164B] [18 rows/s, 752B/s]

And also get to use your favorite BI tools like Tableau on your Kudu data via Presto

--

--