CDAP in Kubernetes Deployment Guide

Terence Yim
Jul 8 · 4 min read

CDAP Operator is a new open source project that enables easy deployment in a Kubernetes cluster. In this blog, we will walk through steps on how to to deploy CDAP in a Minikube setup. Similar steps can be applied to deploying CDAP in any Kubernetes cluster.

Start Minikube

Since we are going to be demonstrating running on Minikube, install Minikube before proceeding with setting up of CDAP. After installing Minikube, start it by giving enough CPUs and memory for the VM to have a stable environment to host all the necessary resources. We recommend using atleast 4 CPUs and 8GB memory:

$ minikube start --cpus 4 --memory 8192

After starting up the Minikube, run the following commands to fix a service routing issue in Minikube:

$ minikube ssh
$ sudo ip link set docker0 promisc on
$ exit

Install CDAP Controller

To install the CDAP Controller, first check out the CDAP operator project from GitHub:

$ git clone https://github.com/cdapio/cdap-operator.git

After cloning the CDAP operator project, install the CDAP CRD and RBAC needed by the controller, followed by deploying the CDAP controller inside the “system” namespace.

$ kubectl apply -f cdap-operator/config/crds
$ kubectl apply -f cdap-operator/config/default/rbac
$ kubectl apply -f cdap-operator/config/default/manager

You can verify the controller is running correctly by getting the pod status:

$ kubectl get pod --namespace=systemNAME                READY   STATUS    RESTARTS   AGE
cdap-controller-0 1/1 Running 0 30s

Install Supporting Services

Skip this section if the services mentioned below are already available to use.

Before starting CDAP in Kubernetes, we need to setup a couple of supporting services. We are going to use PostgreSQL as the storage engine, and Elasticsearch as the metadata search backend. CDAP also needs a distributed file system for sharing files among pods, such as HDFS, GCS, S3, or Azure Blob Storage. To simplify configurations and dependencies, we are going to start a single node Apache Hadoop inside Minikube to provide the HDFS service.

Installing PostgreSQL

We use Helm chart to install PostgreSQL in Minikube. If you don’t have Helm, follow the Helm installation guide.

To install PostgreSQL using Helm, run the following commands:

$ helm init
$ helm install --name postgres stable/postgresql --set postgresqlPassword=secretpass,postgresqlDatabase=cdap

Verify that PostgreSQL is running and already has a database called “cdap”.

Installing ElasticSearch

We are also using Helm chart to install ElasticSearch in Minikube.

$ helm repo add elastic https://helm.elastic.co
$ helm install --name elasticsearch elastic/elasticsearch --version 6.5.3-alpha1 --set replicas=1 --set minimumMasterNodes=1 --set resources.requests.memory=500Mi

Installing Single Node Apache Hadoop

To start a single node Hadoop, deploy a pod with a single container using the Hadoop Docker image and expose HDFS service.

$ cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: hadoop
labels:
app: hadoop
spec:
containers:
- name: hadoop
image: sequenceiq/hadoop-docker:2.7.1
---
apiVersion: v1
kind: Service
metadata:
name: hadoop
spec:
selector:
app: hadoop
ports:
- protocol: TCP
port: 9000
targetPort: 9000
EOF

Since the Hadoop Docker image is quite large, it will take a while before the pod transitions into RUNNING state.

$ kubectl get pod/hadoopNAME     READY   STATUS    RESTARTS   AGE
hadoop 1/1 Running 0 88s

After the pod is running, create a HDFS directory for CDAP to use.

$ kubectl exec -it hadoop -- /usr/local/hadoop/bin/hdfs dfs -mkdir /cdap

Creating CDAP instance

Now we are ready to create our first CDAP instance in Kubernetes.

Create Secret for CDAP

First, we need to set up a secret for storing the PostgreSQL database user name and password.

$ export CDAP_SECURITY=$(cat << EOF | base64 | tr -d '\n'
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>data.storage.sql.jdbc.username</name>
<value>postgres</value>
</property>
<property>
<name>data.storage.sql.jdbc.password</name>
<value>$(kubectl get secret postgres-postgresql -o 'jsonpath={.data.postgresql-password}' | base64 --decode)</value>
</property>
</configuration>
EOF
)
$ cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: cdap-secret
type: Opaque
data:
cdap-security.xml: $CDAP_SECURITY
EOF

Service Account and Cluster Role Binding

CDAP interacts with Kubernetes API server for service discovery and also for deploying system applications. We need to setup appropriate cluster role for the service account used by CDAP pods. We create a new service account with the “edit” cluster role to provide better isolation.

$ kubectl create serviceaccount cdap
$ kubectl create clusterrolebinding cdap --clusterrole=edit --serviceaccount=default:cdap

Deploy a new CDAP instance

A new CDAP instance can be deployed with the following minimal CDAP resource:

$ cat << EOF | kubectl apply -f -
apiVersion: cdap.cdap.io/v1alpha1
kind: CDAPMaster
metadata:
name: test
spec:
locationURI: hdfs://hadoop:9000
serviceAccountName: cdap
securitySecret: cdap-secret
config:
enable.preview: "true"
data.storage.implementation: postgresql
data.storage.sql.jdbc.connection.url: jdbc:postgresql://postgres-postgresql:5432/cdap
data.storage.sql.jdbc.driver.name: org.postgresql.Driver
metadata.storage.implementation: elastic
metadata.elasticsearch.cluster.hosts: elasticsearch-master
hdfs.user: root
EOF

Please refer to the CRD for all the available configurations for the CDAP resource. You can also deploy more than one CDAP instance to the same Kubernetes cluster with different names, or in different namespaces.

It will take some time for CDAP to start up completely, please be patient. Once you see the user interface pod is up and running, you can get the UI service URL from Minikube and open it with your browser.

$ minikube service cdap-test-userinterface --url

You can use the UI to monitor when CDAP is fully functional, and when it does, you can use CDAP just like before. The only exception is that you need to create a compute profile and use it to run your data application, since running user application in Kubernetes is not yet supported by CDAP.

Give it a try and let us know!

CDAP is a 100% open-source framework for build data analytics applications

Terence Yim

Written by

Software Engineer. Passionate about distributed system, big data, and open source software.

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade