Apache Kylin on Kubernetes

Gopi Kannedhara
9 min readAug 2, 2020

--

Introduction:

Kylin is currently a leader in open source big data-based OLAP tools. The performance of Kylin is much efficient when results need to be displayed in fraction of milliseconds. I am not going to talk about its features in this blog. If you are interested in Kylin features, please check out our colleague’s article: https://www.tigeranalytics.com/blog/apache-kylin-architecture

Problem with the existing cluster:

  1. When multiple cubes run at the same time, there will be load on job server which can cause break down
  2. Having Hadoop, Spark, Hbase services running on cluster along with Kylin can cause node failures or slowness in query performance
  3. High Availability for Kylin in case of node failure is difficult and requires downtime
  4. Query load balancing is difficult when multiple users access the query server, which can lead to slowness

Why Kubernetes:

  1. Auto-scaling on job pods can reduce the load on the job server and improve the performance.
  2. Kylin runs as a client process in Kubernetes hence less burden on the Hadoop cluster.
  3. With replica set and Memcached pod, downtime can be avoided.
  4. Autoscaling of query servers can balance the load from multiple users.
  5. Memcached helps us to maintain query cache faster and reliable.

How it works:

In architecture, we have installed Kylin client in Kubernetes whereas Hadoop, Hive, HBase, Spark services run on a Hadoop cluster in distributed mode. When Kylin’s job submitted in Kubernetes pod as a client process, request gets submitted to the Hadoop cluster and the job will be executed in distributed mode. After job completion, a response will be given back to the Kylin client process which is in Kubernetes pod.

Architecture:

Our Hadoop cluster is Google Dataproc and the Kubernetes cluster is GKE.

Hadoop Cluster:

We wanted our Hadoop cluster as stateless. GCS / S3 is used as the underlying storage for HBase.

The benefit of going with the above approach:

1. To Handle autoscale, in case of multiple jobs
— Having HDFS as storage and autoscaling the data nodes may lead us into the “NameNode Safe Mode” error frequently
2. To Handle the failover scenario, If one node/multiple nodes go down.
— Replication factor 3 can handle if one node goes down but can lead us into data loss in case of multiple nodes failover.

Build your own Kylin-Client Image:

Building a docker image of Kylin’s client is the key to the entire process. As part of the image build process, all Hadoop, Hive, HBase, Spark, Zookeeper, Kylin client versions with proper version compatibility should be installed. Without blabbering much, let’s have a look at the Docker file.

FROM centos:7.3.1611

MAINTAINER Gopi

WORKDIR /tmp

# install jdk and other commands
RUN set -x \
&& yum install -y which \
java-1.8.0-openjdk \
java-1.8.0-openjdk-devel \
krb5-workstation \
&& yum clean all

# version variables
ENV HADOOP_VERSION=2.10.0
ENV HIVE_VERSION=2.3.7
ENV HBASE_VERSION=1.5.0
ENV SPARK_VERSION=2.4.5
ENV ZK_VERSION=3.4.14

ARG APACHE_HOME=/usr/lib

RUN set -x \
&& mkdir -p $APACHE_HOME

RUN (cd $APACHE_HOME && curl -O https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz)
RUN (cd $APACHE_HOME && curl -O http://apachemirror.wuchna.com/hive/hive-${HIVE_VERSION}/apache-hive-${HIVE_VERSION}-bin.tar.gz)
RUN (cd $APACHE_HOME && curl -O https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz)
RUN (cd $APACHE_HOME && curl -O https://archive.apache.org/dist/hbase/${HBASE_VERSION}/hbase-${HBASE_VERSION}-bin.tar.gz)
RUN (cd $APACHE_HOME && curl -O https://archive.apache.org/dist/zookeeper/zookeeper-${ZK_VERSION}/zookeeper-${ZK_VERSION}.tar.gz)

ENV JAVA_HOME /etc/alternatives/jre

#install hive
ENV HIVE_HOME=$APACHE_HOME/hive
RUN (cd $APACHE_HOME && tar -zxvf apache-hive-${HIVE_VERSION}-bin.tar.gz)
RUN (cd $APACHE_HOME && rm -r apache-hive-${HIVE_VERSION}-bin.tar.gz)
RUN set -x && ln -s $APACHE_HOME/apache-hive-${HIVE_VERSION}-bin $HIVE_HOME

# install hadoop
ENV HADOOP_HOME=$APACHE_HOME/hadoop
RUN (cd $APACHE_HOME && tar -zxvf hadoop-${HADOOP_VERSION}.tar.gz)
RUN (cd $APACHE_HOME && rm -r hadoop-${HADOOP_VERSION}.tar.gz)
RUN set -x && ln -s $APACHE_HOME/hadoop-${HADOOP_VERSION} $HADOOP_HOME
RUN (rm $HADOOP_HOME/etc/hadoop/core-site.xml )
RUN (rm $HADOOP_HOME/etc/hadoop/hdfs-site.xml )
RUN (rm $HADOOP_HOME/etc/hadoop/yarn-site.xml )

#install hbase
ENV HBASE_HOME=$APACHE_HOME/hbase
RUN (cd $APACHE_HOME && tar -zxvf hbase-${HBASE_VERSION}-bin.tar.gz)
RUN (cd $APACHE_HOME && rm -r hbase-${HBASE_VERSION}-bin.tar.gz)
RUN set -x && ln -s $APACHE_HOME/hbase-${HBASE_VERSION} $HBASE_HOME

#install spark
ENV SPARK_HOME=$APACHE_HOME/spark
RUN (cd $APACHE_HOME && tar -zxvf spark-${SPARK_VERSION}-bin-hadoop2.7.tgz)
RUN (cd $APACHE_HOME && rm -r spark-${SPARK_VERSION}-bin-hadoop2.7.tgz)
RUN set -x && ln -s $APACHE_HOME/spark-${SPARK_VERSION}-bin-hadoop2.7 $SPARK_HOME

#install zk
ENV ZK_HOME=$APACHE_HOME/zookeeper
RUN (cd $APACHE_HOME && tar -zxvf zookeeper-${ZK_VERSION}.tar.gz)
RUN (cd $APACHE_HOME && rm -r zookeeper-${ZK_VERSION}.tar.gz)
RUN set -x && ln -s $APACHE_HOME/zookeeper-${ZK_VERSION} $ZK_HOME

ENV PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$HBASE_HOME/bin:$ZK_HOME/bin

ARG USER=apache_kylin
ENV USER_HOME=/usr/${USER}
ENV KYLIN_VERSION=3.0.2
ENV HADOOP_CONF_HOME=$HADOOP_HOME/conf
ENV HIVE_CONF_HOME=$HIVE_HOME/conf
ENV HBASE_CONF_HOME=$HBASE_HOME/conf
ENV KYLIN_HOME=$USER_HOME/kylin
ENV KYLIN_HADOOP_CONF_HOME=$KYLIN_HOME/hadoop-conf

RUN set -x \
&& mkdir -p $KYLIN_HOME && mkdir -p $KYLIN_HADOOP_CONF_HOME

RUN (cd $KYLIN_HOME && curl -O https://archive.apache.org/dist/kylin/apache-kylin-${KYLIN_VERSION}/apache-kylin-${KYLIN_VERSION}-bin-hbase1x.tar.gz)
RUN (cd $KYLIN_HOME && tar -zxvf apache-kylin-${KYLIN_VERSION}-bin-hbase1x.tar.gz)
RUN (cd $KYLIN_HOME && rm -r apache-kylin-${KYLIN_VERSION}-bin-hbase1x.tar.gz)
RUN (cd $KYLIN_HOME && cp -r $KYLIN_HOME/apache-kylin-${KYLIN_VERSION}-bin-hbase1x/* .)
RUN (cd $KYLIN_HOME && rm -r $KYLIN_HOME/apache-kylin-${KYLIN_VERSION}-bin-hbase1x)

#Required jars for memcached functionality

RUN (cd $KYLIN_HOME/tomcat/lib && curl -O https://repo1.maven.org/maven2/de/javakaffee/msm/memcached-session-manager-tc7/2.1.1/memcached-session-manager-tc7-2.1.1.jar )
RUN (cd $KYLIN_HOME/tomcat/lib && curl -O https://repo1.maven.org/maven2/de/javakaffee/msm/memcached-session-manager/2.1.1/memcached-session-manager-2.1.1.jar )

# hadoop-gcs connector jar to connect to gcs from hadoop — gcs is underlying storage for hive tables
RUN curl -O https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar
RUN cp gcs-connector-hadoop2-latest.jar $HADOOP_HOME/share/hadoop/common/lib/

#copy hbase*.jar to spark/lib
RUN ln -s $HBASE_HOME/lib/hbase* $SPARK_HOME/jars/

RUN ln -s $SPARK_HOME $KYLIN_HOME/spark

#add libsnappy.so native library — if hadoop distribution doesn’t have it by default
RUN (yes | yum install snappy snappy-devel)
RUN ln -s /usr/lib64/libsnappy.so $HADOOP_HOME/lib/native/libsnappy.so
RUN ln -s /usr/lib64/libsnappy.so.1 $HADOOP_HOME/lib/native/libsnappy.so.1

# install system tools
RUN set -x \
&& yum install -y openssh-clients \
cronie \
unzip \
sudo \
net-tools \
iftop \
tcpdump \
perf \
telnet \
bind-utils \
&& yum clean all

RUN set -x \
&& groupadd -r $USER \
&& useradd -r -m -g $USER $USER -d $USER_HOME \
&& echo ‘$USER ALL=(ALL) NOPASSWD:ALL’ >> /etc/sudoers

RUN chown -R $USER:$USER $KYLIN_HOME

RUN set -x \
&& unzip -qq $KYLIN_HOME/tomcat/webapps/kylin.war -d $KYLIN_HOME/tomcat/webapps/kylin \
&& chown -R $USER:$USER $KYLIN_HOME/tomcat/webapps/kylin \
&& rm $KYLIN_HOME/tomcat/webapps/kylin.war \
&& ln -s $HADOOP_CONF_HOME/core-site.xml $KYLIN_HADOOP_CONF_HOME/core-site.xml \
&& ln -s $HADOOP_CONF_HOME/hdfs-site.xml $KYLIN_HADOOP_CONF_HOME/hdfs-site.xml \
&& ln -s $HADOOP_CONF_HOME/yarn-site.xml $KYLIN_HADOOP_CONF_HOME/yarn-site.xml \
&& ln -s $HADOOP_CONF_HOME/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml \
&& ln -s $HADOOP_CONF_HOME/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml \
&& ln -s $HADOOP_CONF_HOME/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml \
&& ln -s $HADOOP_CONF_HOME/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml \
&& ln -s $HIVE_CONF_HOME/hive-site.xml $KYLIN_HADOOP_CONF_HOME/hive-site.xml \
&& ln -s $HBASE_CONF_HOME/hbase-site.xml $KYLIN_HADOOP_CONF_HOME/hbase-site.xml \
&& chown -R $USER:$USER $KYLIN_HADOOP_CONF_HOME

ENV TOOL_HOME=$USER_HOME/bin
RUN set -x \
&& mkdir -p $TOOL_HOME
COPY bin $TOOL_HOME
COPY crontab.txt /tmp/crontab.txt

RUN /usr/bin/crontab -u $USER /tmp/crontab.txt \
&& chmod 777 $TOOL_HOME/* && chmod 777 $KYLIN_HOME/*
EXPOSE 7070

# Cleanup
RUN rm -rf /tmp/*

# install system tools
RUN set -x \
&& yum install -y openssh-clients \
cronie \
unzip \
sudo \
net-tools \
iftop \
tcpdump \
perf \
telnet \
bind-utils \
&& yum clean all

RUN set -x \
&& groupadd -r $USER \
&& useradd -r -m -g $USER $USER -d $USER_HOME \
&& echo ‘$USER ALL=(ALL) NOPASSWD:ALL’ >> /etc/sudoers

RUN chown -R $USER:$USER $KYLIN_HOME

RUN set -x \
&& unzip -qq $KYLIN_HOME/tomcat/webapps/kylin.war -d $KYLIN_HOME/tomcat/webapps/kylin \
&& chown -R $USER:$USER $KYLIN_HOME/tomcat/webapps/kylin \
&& rm $KYLIN_HOME/tomcat/webapps/kylin.war \
&& ln -s $HADOOP_CONF_HOME/core-site.xml $KYLIN_HADOOP_CONF_HOME/core-site.xml \
&& ln -s $HADOOP_CONF_HOME/hdfs-site.xml $KYLIN_HADOOP_CONF_HOME/hdfs-site.xml \
&& ln -s $HADOOP_CONF_HOME/yarn-site.xml $KYLIN_HADOOP_CONF_HOME/yarn-site.xml \
&& ln -s $HADOOP_CONF_HOME/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml \
&& ln -s $HADOOP_CONF_HOME/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml \
&& ln -s $HADOOP_CONF_HOME/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml \
&& ln -s $HADOOP_CONF_HOME/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml \
&& ln -s $HIVE_CONF_HOME/hive-site.xml $KYLIN_HADOOP_CONF_HOME/hive-site.xml \
&& ln -s $HBASE_CONF_HOME/hbase-site.xml $KYLIN_HADOOP_CONF_HOME/hbase-site.xml \
&& chown -R $USER:$USER $KYLIN_HADOOP_CONF_HOME

ENV TOOL_HOME=$USER_HOME/bin
RUN set -x \
&& mkdir -p $TOOL_HOME
COPY bin $TOOL_HOME
COPY crontab.txt /tmp/crontab.txt

RUN /usr/bin/crontab -u $USER /tmp/crontab.txt \
&& chmod 777 $TOOL_HOME/* && chmod 777 $KYLIN_HOME/*
EXPOSE 7070

# Cleanup
RUN rm -rf /tmp/*

Connection Between Hadoop Cluster-Kylin Client:

To make sure jobs getting submitted to the Hadoop cluster, all cluster configuration files [hdfs-site.xml,hive-site.xml,mapred-site.xml,hbase-site.xml,yarn-site.xml] should be copied to respective conf folder ex: Hadoop’s conf, hive’s conf, Hbase’s conf and $KYLIN_HOME/hadoop_conf folder.

To establish a connection between Hadoop cluster and Kylin client pods, make an entry of all Hadoop cluster’s nodes hostname, IP in /etc/hosts of pods. This can be done using the host alias entry in the deployment Yaml file.

Kylin.properties :

Make your own Kylin properties file with few details as below:

kylin.cache.memcached.hosts=memcached-service-name:11211 [otherwise memcache will not be used]
kylin.query.cache-signature-enabled
=true
kylin.query.lazy-query-enabled
=true
kylin.metrics.memcached.enabled
=true
kylin.query.segment-cache-enabled
=true
kylin.env.hdfs-working-dir
=hdfs:namenode/kylin [Otherwise kylin will try to store data in pod local]
kylin.server.mode
=[job or query based on your pod functionality]
kylin.server.cluster-servers
=kylin-job-service-name,kylin-query-service-name [Otherwise meta data will not be in sync]
kylin.storage.hbase.compression-codec
=snappy [by default sequence file format is available, which can lead to performance issues]
kylin.env.hadoop-conf-dir
=$KYLIN_HOME/hadoop-conf [needed to connect kylin client to Hadoop cluster, it should be same path in kylin-client image as well.Please have an eye on this]
kylin.engine.spark-conf.spark.master
=yarn
kylin.engine.spark-conf.spark.submit.deployMode
=cluster [Client mode gives an error as Hadoop cluster doesn't recognise pod's host name]
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress
=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec
=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec
=org.apache.spark.io.SnappyCompressionCodec [Otherwise there can be a conflict as we have used snappy for kylin.storage.hbase.compression-codec]

Tomcat Server.xml:

<Manager className="de.javakaffee.web.msm.MemcachedBackupSessionManager"
memcachedNodes="kylin-memcached-service-name:11211"
storageKeyPrefix="context"
requestUriIgnorePattern=".*\.(ico|png|gif|jpg|css|js)$"
/>

Above configuration is configurable for failover scenario of Memcached pods, which means multiple pods can be used

Core-site.xml:

</property>
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>google.cloud.auth.service.account.json.keyfile</name>
<value>/path/access.json</value>
</property>

The above parameters are different for AWS.

Folder structure to follow:

Note: Templates for Kubernetes deployment can be found in Kylin’s GitHub repo

Sample Deployment file:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kylin-job
namespace: kylin
spec:
serviceName: kylin-svc
replicas: 1
selector:
matchLabels:
app: kylin
role: job
template:
metadata:
labels:
app: kylin
role: job
spec:
hostAliases:
- ip: cluster-master
hostnames
:
- host-name
- ip: custer-worker-1
hostnames
:
- host-name
- ip: cluster-worker-2
hostnames
:
- host-name
- ip: cluster-worker-3
hostnames
:
- hostname
containers
:
- name: kylin
image: kylin-client
imagePullPolicy: Always
command:
- sh
- -c
args:
- cp $KYLIN_HOME/tomcat-conf/* $KYLIN_HOME/tomcat/conf;
cp $KYLIN_HOME/kylin-more-conf $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/classes/;
$TOOL_HOME/bootstrap.sh server -d;
ports:
- containerPort: 7070
- containerPort: 7443
volumeMounts:
- name: kylin-job-conf
mountPath: /usr/apache_kylin/kylin/conf
- name: tomcat-conf
mountPath: /usr/apache_kylin/kylin/tomcat-conf
- name: kylin-more-conf
mountPath: /usr/apache_kylin/kylin/kylin-more-conf
- name: hadoop-conf
mountPath: /usr/lib/hadoop/conf
- name: hive-conf
mountPath: /usr/lib/hive/conf
- name: hbase-conf
mountPath: /usr/lib/hbase/conf
- name: kylin-logs
mountPath: /usr/apache_kylin/kylin/logs
- name: tomcat-logs
mountPath: /usr/apache_kylin/kylin/tomcat/logs
- name: secretofjsonfile
mountPath: secretpath
resources:
requests:
memory: 5Gi
cpu: 1
limits:
memory: 5Gi
cpu: 1
volumes:
- name: secretofjsonfile
secret:
secretName: secretofjsonfile
- name: kylin-logs
emptyDir:
sizeLimit: 20Gi
- name: tomcat-logs
emptyDir:
sizeLimit: 10Gi
- configMap:
name: hadoop-conf
name: hadoop-conf
- configMap:
name: hive-conf
name: hive-conf
- configMap:
name: hbase-conf
name: hbase-conf
- configMap:
name: kylin-job-conf
name: kylin-job-conf
- configMap:
name: tomcat-conf
name: tomcat-conf
- configMap:
name: kylin-more-conf
name: kylin-more-conf
updateStrategy:
type: RollingUpdate

Sample service.yaml:

apiVersion: v1
kind: Service
metadata:
name: kylin-svc
namespace: kylin
spec:
ports:
- name: http
port: 80
targetPort: 7070
- name: https
port: 443
targetPort: 7443
selector:
app: kylin
role: job [to connect to kylin job pod]
type: LoadBalancer

The above load balancer gives an External IP when deployment is completed. Kylin can be accessed at http://ExternalIP/kylin

Query pod can be created by replacing job with query in deployment and service YAML files.

Issues to look out for :

Version compatibility: Kylin-Client and Hadoop-Cluster versions should be same otherwise compatibility issues may arise

Hbase Jars dependency: Hbase dependency jars should be copied to spark jars folder otherwise dependency issues may arise. Command added in Dockerfile

Snappy issue 1: When snappy compression is used, Hadoop distribution should have libsnappy.so in the native folder of $HADOOP_HOME. Command added in Dockerfile.

Snappy issue 2: When the cube built with spark, the output should be compressed with snappy. The following properties are required.

conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec
=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec
=org.apache.spark.io.SnappyCompressionCodec

GCS connector issue: When GCS is used as storage for the hive table, the Hadoop-GCS connector is required. We have added the dependency in “$HADOOP_HOME/share/hadoop/common/lib/ path” . To access GCS from the hive, the job/query pod should have access.json [GCP]/ secret_key and access_key [AWS] in core-site.xml.

Hive Intermediate Table is not found: When cube builds step is with spark, “hive intermediate table not found” issue occurs. This issue is due to a bug in Kylin code and the same is resolved in version 3.0.2 and 2.6.0

Copy spark folder under Kylin: Copying entire spark folder to $KYLIN_HOME folder is required otherwise yarn can’t recognize dependency jars in the Hadoop cluster while in spark deploy mode.

Metadata sync issues between job pod and query pod: When we tried to build the segment in the job’s pod and query the segment immediately in query’s pod data is not available. This issue is due to lag in the metadata sync. To resolve this issue, please mention the job service name and query service name in the cluster host property.

kylin.server.cluster-servers=kylin-job-service-name,kylin-query-service-name

References:

Happy to know feedback !! or a query !!

--

--