Apache Kylin on Kubernetes

9 min readAug 2, 2020

Introduction:

Kylin is currently a leader in open source big data-based OLAP tools. The performance of Kylin is much efficient when results need to be displayed in fraction of milliseconds. I am not going to talk about its features in this blog. If you are interested in Kylin features, please check out our colleague’s article: https://www.tigeranalytics.com/blog/apache-kylin-architecture

Problem with the existing cluster:

When multiple cubes run at the same time, there will be load on job server which can cause break down
Having Hadoop, Spark, Hbase services running on cluster along with Kylin can cause node failures or slowness in query performance
High Availability for Kylin in case of node failure is difficult and requires downtime
Query load balancing is difficult when multiple users access the query server, which can lead to slowness

Why Kubernetes:

Auto-scaling on job pods can reduce the load on the job server and improve the performance.
Kylin runs as a client process in Kubernetes hence less burden on the Hadoop cluster.
With replica set and Memcached pod, downtime can be avoided.
Autoscaling of query servers can balance the load from multiple users.
Memcached helps us to maintain query cache faster and reliable.

How it works:

In architecture, we have installed Kylin client in Kubernetes whereas Hadoop, Hive, HBase, Spark services run on a Hadoop cluster in distributed mode. When Kylin’s job submitted in Kubernetes pod as a client process, request gets submitted to the Hadoop cluster and the job will be executed in distributed mode. After job completion, a response will be given back to the Kylin client process which is in Kubernetes pod.

Architecture:

Our Hadoop cluster is Google Dataproc and the Kubernetes cluster is GKE.

Hadoop Cluster:

We wanted our Hadoop cluster as stateless. GCS / S3 is used as the underlying storage for HBase.

The benefit of going with the above approach:

1. To Handle autoscale, in case of multiple jobs
— Having HDFS as storage and autoscaling the data nodes may lead us into the “NameNode Safe Mode” error frequently
2. To Handle the failover scenario, If one node/multiple nodes go down.
— Replication factor 3 can handle if one node goes down but can lead us into data loss in case of multiple nodes failover.

Build your own Kylin-Client Image:

Building a docker image of Kylin’s client is the key to the entire process. As part of the image build process, all Hadoop, Hive, HBase, Spark, Zookeeper, Kylin client versions with proper version compatibility should be installed. Without blabbering much, let’s have a look at the Docker file.

FROM centos:7.3.1611
MAINTAINER Gopi
WORKDIR /tmp
# install jdk and other commands
RUN set -x \
&& yum install -y which \
java-1.8.0-openjdk \
java-1.8.0-openjdk-devel \
krb5-workstation \
&& yum clean all
# version variables
ENV HADOOP_VERSION=2.10.0
ENV HIVE_VERSION=2.3.7
ENV HBASE_VERSION=1.5.0
ENV SPARK_VERSION=2.4.5
ENV ZK_VERSION=3.4.14
ARG APACHE_HOME=/usr/lib
RUN set -x \
&& mkdir -p $APACHE_HOME
RUN (cd $APACHE_HOME && curl -O https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz)
RUN (cd $APACHE_HOME && curl -O http://apachemirror.wuchna.com/hive/hive-${HIVE_VERSION}/apache-hive-${HIVE_VERSION}-bin.tar.gz)
RUN (cd $APACHE_HOME && curl -O https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz)
RUN (cd $APACHE_HOME && curl -O https://archive.apache.org/dist/hbase/${HBASE_VERSION}/hbase-${HBASE_VERSION}-bin.tar.gz)
RUN (cd $APACHE_HOME && curl -O https://archive.apache.org/dist/zookeeper/zookeeper-${ZK_VERSION}/zookeeper-${ZK_VERSION}.tar.gz)
ENV JAVA_HOME /etc/alternatives/jre
#install hive
ENV HIVE_HOME=$APACHE_HOME/hive
RUN (cd $APACHE_HOME && tar -zxvf apache-hive-${HIVE_VERSION}-bin.tar.gz)
RUN (cd $APACHE_HOME && rm -r apache-hive-${HIVE_VERSION}-bin.tar.gz)
RUN set -x && ln -s $APACHE_HOME/apache-hive-${HIVE_VERSION}-bin $HIVE_HOME
# install hadoop
ENV HADOOP_HOME=$APACHE_HOME/hadoop
RUN (cd $APACHE_HOME && tar -zxvf hadoop-${HADOOP_VERSION}.tar.gz)
RUN (cd $APACHE_HOME && rm -r hadoop-${HADOOP_VERSION}.tar.gz)
RUN set -x && ln -s $APACHE_HOME/hadoop-${HADOOP_VERSION} $HADOOP_HOME
RUN (rm $HADOOP_HOME/etc/hadoop/core-site.xml )
RUN (rm $HADOOP_HOME/etc/hadoop/hdfs-site.xml )
RUN (rm $HADOOP_HOME/etc/hadoop/yarn-site.xml )
#install hbase
ENV HBASE_HOME=$APACHE_HOME/hbase
RUN (cd $APACHE_HOME && tar -zxvf hbase-${HBASE_VERSION}-bin.tar.gz)
RUN (cd $APACHE_HOME && rm -r hbase-${HBASE_VERSION}-bin.tar.gz)
RUN set -x && ln -s $APACHE_HOME/hbase-${HBASE_VERSION} $HBASE_HOME
#install spark
ENV SPARK_HOME=$APACHE_HOME/spark
RUN (cd $APACHE_HOME && tar -zxvf spark-${SPARK_VERSION}-bin-hadoop2.7.tgz)
RUN (cd $APACHE_HOME && rm -r spark-${SPARK_VERSION}-bin-hadoop2.7.tgz)
RUN set -x && ln -s $APACHE_HOME/spark-${SPARK_VERSION}-bin-hadoop2.7 $SPARK_HOME
#install zk
ENV ZK_HOME=$APACHE_HOME/zookeeper
RUN (cd $APACHE_HOME && tar -zxvf zookeeper-${ZK_VERSION}.tar.gz)
RUN (cd $APACHE_HOME && rm -r zookeeper-${ZK_VERSION}.tar.gz)
RUN set -x && ln -s $APACHE_HOME/zookeeper-${ZK_VERSION} $ZK_HOME
ENV PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$HBASE_HOME/bin:$ZK_HOME/bin
ARG USER=apache_kylin
ENV USER_HOME=/usr/${USER}
ENV KYLIN_VERSION=3.0.2
ENV HADOOP_CONF_HOME=$HADOOP_HOME/conf
ENV HIVE_CONF_HOME=$HIVE_HOME/conf
ENV HBASE_CONF_HOME=$HBASE_HOME/conf
ENV KYLIN_HOME=$USER_HOME/kylin
ENV KYLIN_HADOOP_CONF_HOME=$KYLIN_HOME/hadoop-conf
RUN set -x \
&& mkdir -p $KYLIN_HOME && mkdir -p $KYLIN_HADOOP_CONF_HOME
RUN (cd $KYLIN_HOME && curl -O https://archive.apache.org/dist/kylin/apache-kylin-${KYLIN_VERSION}/apache-kylin-${KYLIN_VERSION}-bin-hbase1x.tar.gz)
RUN (cd $KYLIN_HOME && tar -zxvf apache-kylin-${KYLIN_VERSION}-bin-hbase1x.tar.gz)
RUN (cd $KYLIN_HOME && rm -r apache-kylin-${KYLIN_VERSION}-bin-hbase1x.tar.gz)
RUN (cd $KYLIN_HOME && cp -r $KYLIN_HOME/apache-kylin-${KYLIN_VERSION}-bin-hbase1x/* .)
RUN (cd $KYLIN_HOME && rm -r $KYLIN_HOME/apache-kylin-${KYLIN_VERSION}-bin-hbase1x)
#Required jars for memcached functionality
RUN (cd $KYLIN_HOME/tomcat/lib && curl -O https://repo1.maven.org/maven2/de/javakaffee/msm/memcached-session-manager-tc7/2.1.1/memcached-session-manager-tc7-2.1.1.jar )
RUN (cd $KYLIN_HOME/tomcat/lib && curl -O https://repo1.maven.org/maven2/de/javakaffee/msm/memcached-session-manager/2.1.1/memcached-session-manager-2.1.1.jar )
# hadoop-gcs connector jar to connect to gcs from hadoop — gcs is underlying storage for hive tables
RUN curl -O https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar
RUN cp gcs-connector-hadoop2-latest.jar $HADOOP_HOME/share/hadoop/common/lib/
#copy hbase*.jar to spark/lib
RUN ln -s $HBASE_HOME/lib/hbase* $SPARK_HOME/jars/
RUN ln -s $SPARK_HOME $KYLIN_HOME/spark
#add libsnappy.so native library — if hadoop distribution doesn’t have it by default
RUN (yes | yum install snappy snappy-devel)
RUN ln -s /usr/lib64/libsnappy.so $HADOOP_HOME/lib/native/libsnappy.so
RUN ln -s /usr/lib64/libsnappy.so.1 $HADOOP_HOME/lib/native/libsnappy.so.1
# install system tools
RUN set -x \
&& yum install -y openssh-clients \
cronie \
unzip \
sudo \
net-tools \
iftop \
tcpdump \
perf \
telnet \
bind-utils \
&& yum clean all
RUN set -x \
&& groupadd -r $USER \
&& useradd -r -m -g $USER $USER -d $USER_HOME \
&& echo ‘$USER ALL=(ALL) NOPASSWD:ALL’ >> /etc/sudoers
RUN chown -R $USER:$USER $KYLIN_HOME
RUN set -x \
&& unzip -qq $KYLIN_HOME/tomcat/webapps/kylin.war -d $KYLIN_HOME/tomcat/webapps/kylin \
&& chown -R $USER:$USER $KYLIN_HOME/tomcat/webapps/kylin \
&& rm $KYLIN_HOME/tomcat/webapps/kylin.war \
&& ln -s $HADOOP_CONF_HOME/core-site.xml $KYLIN_HADOOP_CONF_HOME/core-site.xml \
&& ln -s $HADOOP_CONF_HOME/hdfs-site.xml $KYLIN_HADOOP_CONF_HOME/hdfs-site.xml \
&& ln -s $HADOOP_CONF_HOME/yarn-site.xml $KYLIN_HADOOP_CONF_HOME/yarn-site.xml \
&& ln -s $HADOOP_CONF_HOME/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml \
&& ln -s $HADOOP_CONF_HOME/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml \
&& ln -s $HADOOP_CONF_HOME/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml \
&& ln -s $HADOOP_CONF_HOME/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml \
&& ln -s $HIVE_CONF_HOME/hive-site.xml $KYLIN_HADOOP_CONF_HOME/hive-site.xml \
&& ln -s $HBASE_CONF_HOME/hbase-site.xml $KYLIN_HADOOP_CONF_HOME/hbase-site.xml \
&& chown -R $USER:$USER $KYLIN_HADOOP_CONF_HOME
ENV TOOL_HOME=$USER_HOME/bin
RUN set -x \
&& mkdir -p $TOOL_HOME
COPY bin $TOOL_HOME
COPY crontab.txt /tmp/crontab.txt
RUN /usr/bin/crontab -u $USER /tmp/crontab.txt \
&& chmod 777 $TOOL_HOME/* && chmod 777 $KYLIN_HOME/*
EXPOSE 7070
# Cleanup
RUN rm -rf /tmp/*
# install system tools
RUN set -x \
&& yum install -y openssh-clients \
cronie \
unzip \
sudo \
net-tools \
iftop \
tcpdump \
perf \
telnet \
bind-utils \
&& yum clean all
RUN set -x \
&& groupadd -r $USER \
&& useradd -r -m -g $USER $USER -d $USER_HOME \
&& echo ‘$USER ALL=(ALL) NOPASSWD:ALL’ >> /etc/sudoers
RUN chown -R $USER:$USER $KYLIN_HOME
RUN set -x \
&& unzip -qq $KYLIN_HOME/tomcat/webapps/kylin.war -d $KYLIN_HOME/tomcat/webapps/kylin \
&& chown -R $USER:$USER $KYLIN_HOME/tomcat/webapps/kylin \
&& rm $KYLIN_HOME/tomcat/webapps/kylin.war \
&& ln -s $HADOOP_CONF_HOME/core-site.xml $KYLIN_HADOOP_CONF_HOME/core-site.xml \
&& ln -s $HADOOP_CONF_HOME/hdfs-site.xml $KYLIN_HADOOP_CONF_HOME/hdfs-site.xml \
&& ln -s $HADOOP_CONF_HOME/yarn-site.xml $KYLIN_HADOOP_CONF_HOME/yarn-site.xml \
&& ln -s $HADOOP_CONF_HOME/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml \
&& ln -s $HADOOP_CONF_HOME/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml \
&& ln -s $HADOOP_CONF_HOME/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml \
&& ln -s $HADOOP_CONF_HOME/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml \
&& ln -s $HIVE_CONF_HOME/hive-site.xml $KYLIN_HADOOP_CONF_HOME/hive-site.xml \
&& ln -s $HBASE_CONF_HOME/hbase-site.xml $KYLIN_HADOOP_CONF_HOME/hbase-site.xml \
&& chown -R $USER:$USER $KYLIN_HADOOP_CONF_HOME
ENV TOOL_HOME=$USER_HOME/bin
RUN set -x \
&& mkdir -p $TOOL_HOME
COPY bin $TOOL_HOME
COPY crontab.txt /tmp/crontab.txt
RUN /usr/bin/crontab -u $USER /tmp/crontab.txt \
&& chmod 777 $TOOL_HOME/* && chmod 777 $KYLIN_HOME/*
EXPOSE 7070
# Cleanup
RUN rm -rf /tmp/*

Connection Between Hadoop Cluster-Kylin Client:

To make sure jobs getting submitted to the Hadoop cluster, all cluster configuration files [hdfs-site.xml,hive-site.xml,mapred-site.xml,hbase-site.xml,yarn-site.xml] should be copied to respective conf folder ex: Hadoop’s conf, hive’s conf, Hbase’s conf and $KYLIN_HOME/hadoop_conf folder.

To establish a connection between Hadoop cluster and Kylin client pods, make an entry of all Hadoop cluster’s nodes hostname, IP in /etc/hosts of pods. This can be done using the host alias entry in the deployment Yaml file.

Kylin.properties :

Make your own Kylin properties file with few details as below:

kylin.cache.memcached.hosts=memcached-service-name:11211 [otherwise memcache will not be used]
kylin.query.cache-signature-enabled=true
kylin.query.lazy-query-enabled=true
kylin.metrics.memcached.enabled=true
kylin.query.segment-cache-enabled=true
kylin.env.hdfs-working-dir=hdfs:namenode/kylin [Otherwise kylin will try to store data in pod local]
kylin.server.mode=[job or query based on your pod functionality]
kylin.server.cluster-servers=kylin-job-service-name,kylin-query-service-name [Otherwise meta data will not be in sync]
kylin.storage.hbase.compression-codec=snappy [by default sequence file format is available, which can lead to performance issues]
kylin.env.hadoop-conf-dir=$KYLIN_HOME/hadoop-conf [needed to connect kylin client to Hadoop cluster, it should be same path in kylin-client image as well.Please have an eye on this]
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster [Client mode gives an error as Hadoop cluster doesn't recognise pod's host name]
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec [Otherwise there can be a conflict as we have used snappy for kylin.storage.hbase.compression-codec]

Tomcat Server.xml:

<Manager className="de.javakaffee.web.msm.MemcachedBackupSessionManager"
         memcachedNodes="kylin-memcached-service-name:11211"
         storageKeyPrefix="context"
         requestUriIgnorePattern=".*\.(ico|png|gif|jpg|css|js)$"
/>

Above configuration is configurable for failover scenario of Memcached pods, which means multiple pods can be used

Core-site.xml:

</property>
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>google.cloud.auth.service.account.json.keyfile</name>
<value>/path/access.json</value>
</property>

The above parameters are different for AWS.

Folder structure to follow:

Note: Templates for Kubernetes deployment can be found in Kylin’s GitHub repo

Sample Deployment file:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kylin-job
  namespace: kylin
spec:
  serviceName: kylin-svc
  replicas: 1
  selector:
    matchLabels:
      app: kylin
      role: job
  template:
    metadata:
      labels:
        app: kylin
        role: job
    spec:
      hostAliases:
        - ip: cluster-master
          hostnames:
            - host-name
        - ip: custer-worker-1
          hostnames:
            - host-name
        - ip: cluster-worker-2
          hostnames:
            - host-name
        - ip: cluster-worker-3
          hostnames:
            - hostname
      containers:
      - name: kylin
        image: kylin-client
        imagePullPolicy: Always
        command:
        - sh
        - -c
        args:
        - cp $KYLIN_HOME/tomcat-conf/* $KYLIN_HOME/tomcat/conf;
          cp $KYLIN_HOME/kylin-more-conf $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/classes/;
          $TOOL_HOME/bootstrap.sh server -d;
        ports:
        - containerPort: 7070
        - containerPort: 7443
        volumeMounts:
        - name: kylin-job-conf
          mountPath: /usr/apache_kylin/kylin/conf
        - name: tomcat-conf
          mountPath: /usr/apache_kylin/kylin/tomcat-conf
        - name: kylin-more-conf
          mountPath: /usr/apache_kylin/kylin/kylin-more-conf
        - name: hadoop-conf
          mountPath: /usr/lib/hadoop/conf
        - name: hive-conf
          mountPath: /usr/lib/hive/conf
        - name: hbase-conf
          mountPath: /usr/lib/hbase/conf
        - name: kylin-logs
          mountPath: /usr/apache_kylin/kylin/logs
        - name: tomcat-logs
          mountPath: /usr/apache_kylin/kylin/tomcat/logs
        - name: secretofjsonfile
          mountPath: secretpath
        resources:
          requests:
            memory: 5Gi
            cpu: 1
          limits:
            memory: 5Gi
            cpu: 1
      volumes:
      - name: secretofjsonfile
        secret:
         secretName: secretofjsonfile
      - name: kylin-logs
        emptyDir:
          sizeLimit: 20Gi
      - name: tomcat-logs
        emptyDir:
          sizeLimit: 10Gi
      - configMap:
          name: hadoop-conf
        name: hadoop-conf
      - configMap:
          name: hive-conf
        name: hive-conf
      - configMap:
          name: hbase-conf
        name: hbase-conf
      - configMap:
          name: kylin-job-conf
        name: kylin-job-conf
      - configMap:
          name: tomcat-conf
        name: tomcat-conf
      - configMap:
          name: kylin-more-conf
        name: kylin-more-conf
  updateStrategy:
    type: RollingUpdate

Sample service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: kylin-svc
  namespace: kylin
spec:
  ports:
  - name: http
    port: 80
    targetPort: 7070
  - name: https
    port: 443
    targetPort: 7443
  selector:
    app: kylin
    role: job [to connect to kylin job pod]
  type: LoadBalancer

The above load balancer gives an External IP when deployment is completed. Kylin can be accessed at http://ExternalIP/kylin

Query pod can be created by replacing job with query in deployment and service YAML files.

Issues to look out for :

Version compatibility: Kylin-Client and Hadoop-Cluster versions should be same otherwise compatibility issues may arise

Hbase Jars dependency: Hbase dependency jars should be copied to spark jars folder otherwise dependency issues may arise. Command added in Dockerfile

Snappy issue 1: When snappy compression is used, Hadoop distribution should have libsnappy.so in the native folder of $HADOOP_HOME. Command added in Dockerfile.

Snappy issue 2: When the cube built with spark, the output should be compressed with snappy. The following properties are required.

conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec

GCS connector issue: When GCS is used as storage for the hive table, the Hadoop-GCS connector is required. We have added the dependency in “$HADOOP_HOME/share/hadoop/common/lib/ path” . To access GCS from the hive, the job/query pod should have access.json [GCP]/ secret_key and access_key [AWS] in core-site.xml.

Hive Intermediate Table is not found: When cube builds step is with spark, “hive intermediate table not found” issue occurs. This issue is due to a bug in Kylin code and the same is resolved in version 3.0.2 and 2.6.0

Copy spark folder under Kylin: Copying entire spark folder to $KYLIN_HOME folder is required otherwise yarn can’t recognize dependency jars in the Hadoop cluster while in spark deploy mode.

Metadata sync issues between job pod and query pod: When we tried to build the segment in the job’s pod and query the segment immediately in query’s pod data is not available. This issue is due to lag in the metadata sync. To resolve this issue, please mention the job service name and query service name in the cluster host property.

kylin.server.cluster-servers=kylin-job-service-name,kylin-query-service-name

References:

Deploy Kylin on Kubernetes

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that…

kylin.apahe.org

Kylin on Kubernetes in a production env

Edit description

issues.apache.org

apache/kylin

Extreme OLAP Engine for Big Data Apache Kylin is an open-source Distributed Analytics Engine, contributed by eBay Inc…

github.com

Happy to know feedback !! or a query !!