NEXT GENERATION OLAP ANALYTICS -APACHE KYLIN

10 min readNov 20, 2019

OLAP Analytics on Big Data Distributed System

For most of Business Intelligence Consultants, we would remember the old times of our datawarehouse and BI tools by hearth, however, with the new era of big data, datawarehouses were our old raw disks, and datalakes have became our new next generational data storages, for that reason, I thought that everyone like me was wondering about: How can we deal now with our cluster Hadoop (AWS S3, Azure Datalake, GCP Hadoop, etc), because these are these new distributed system? and How can we make the exploitation of the data in a elegant style?

Then this is the main aim of this post, to show you a very very fancy and nice tool calling Apache Kylin 2.6.4, which is running on the top of Spark and Hadoop Ecosystem, and you can create your cube and applied some concepts from Ralph Kimball dimensional modeling for your current Data Lake. So this makes me feel like this old granny like the picture. SO HAPPYY!!!!!!!

Before going to I want you to introduce about the main benefits that I check on this post:

https://www.slideshare.net/Hadoop_Summit/apache-kylin-cubes-on-hadoop

The Cube is totally transparent for the end user:

A totally of the key points from incremental refresh till monitoring your process:

First is much better in performance transformation than Hive:

You have a integration with Power BI and Tableau.

You can check in detail about some more benefits and advantages that Apache Kylin has in comparison with another big data tools.

Software requirements for Apache Kylin:

Apache HDFS 3.2.0

Apache HBase 2.2.0

Apache Hive 3.1.0

Apache Zookeeper

Apache Spark (Optional)

Apache Kafka (Optional)

Then that I made an introduction about Apache Kylin, let’s go to the hard work, to install and setting up our environment from scratch.

  1. Installing EE Red Hat Linux 7.7

I have to have Red Hat Linux 7.7

https://developers.redhat.com/products/rhel/download

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/installation_guide/sect-installation-source-x86

Once you installed RHE 7.7, we have to install docker

Getting a No Cost Developer Subscription Option:

https://developers.redhat.com/articles/getting-red-hat-developer-subscription-what-rhel-users-need-know/

  1. You have to enable repos for that reason we have to accept the terms of conditions:

https://www.redhat.com/wapps/tnc/ackrequired?site=candlepin&event=attachSubscription

  1. Join the Red Hat Developer:
  1. Subscribe using the following command:

subscription-manager register — username advinculacesar — password Ze15adv$ — auto-attach — force

yum install wget

rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

yum install htop

htop

  1. Installing Hadoop with Java Development Kit 8 101

Since HADOOP (https://hadoop.apache.org/old/releases.html ) last release date was at 8 Aug 2018, it is working with JDK 8 (https://cwiki.apache.org/confluence/display/hadoop/HadoopJavaVersions ).

rpm -ivh jdk-8u101-linux-x64.rpm

vi /etc/profile

. /etc/./profile

  1. Sharing SSH Keys for Symmetric Authentication between nodes of Hadoop Cluster

It is just important that your Master Node share the ssh key, to its Slaves.

rm –rf /root/.ssh

ssh-keygen –t dsa

cat /root/.ssh/id_dsa.pub >> /root/.ssh/authorized_keys

scp /root/.ssh/id_dsa.pub apache2:/root

rm –rf /root/.ssh

ssh-keygen –t dsa

cat /root/.ssh/id_dsa.pub >> /root/.ssh/authorized_keys

Testing

  1. Installation and configuration Hadoop Cluster

Download the tar ball about Hadoop.

wget https://hadoop.apache.org/release/3.2.0.html

tar –xvf hadoop-3.2.0.tar.gz

vi /etc/profile

export HADOOP_HOME=/u01/hadoop-3.2.0

export HADOOP_HOME=/u01/hadoop-3.2.0

export HDFS_NAMENODE_USER=”root”

export HDFS_DATANODE_USER=”root”

export HDFS_SECONDARYNAMENODE_USER=”root”

export YARN_RESOURCEMANAGER_USER=”root”

export YARN_NODEMANAGER_USER=”root”

. /etc/./profile

hadoop version

vi $HADOOP_HOME/etc/hadoop/core-site.xml

<property>

<name>fs.default.name</name>

<value>hdfs://apache1:9000</value>

</property>

https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/ClusterSetup.html

vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<property>

<name>dfs.namenode.name.dir</name>

<value>/u01/hadoop-3.2.0/data/nameNode</value>

</property >

<property>

<name>dfs.datanode.data.dir</name>

<value>/u01/hadoop-3.2.0/data/dataNode</value>

</property >

<property>

<name>dfs.replication</name>

<value>1</value>

</property >

vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

<property>

<name>yarn.acl.enable</name>

<value>0</value>

</property>

<property>

<name>yarn.resourcemanager.hostname</name>

<value>apache1</value>

</property>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<property>

<name>yarn.app.mapreduce.am.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>

</property>

<property>

<name>mapreduce.map.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>

</property>

<property>

<name>mapreduce.reduce.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>

</property>

vi $HADOOP_HOME/etc/hadoop/workers

scp /u01/hadoop-3.2.0.tar.gz apache2:/u01

Repeat this step in every slave node:

ssh root@apache2

cd /u01

ls

tar –xvf hadoop-3.2.0.tar.gz

exit

cd /u01/hadoop-3.2.0

vi copyConfigFile.sh

for node in apache1 apache2; do

scp /u01/hadoop-3.2.0/etc/hadoop/* $node:/u01/hadoop-3.2.0/etc/hadoop/;

done

sh copyConfigFile.sh

Do the same in all nodes:

vi /u01/hadoop-3.2.0/etc/hadoop/hadoop-env.sh

JAVA_HOME=/usr/java/jdk1.8.0_101

hdfs namenode -format

$HADOOP_HOME/sbin/start-dfs.sh

http://apache1:9870/dfshealth.html#tab-overview

$HADOOP_HOME/sbin/start-yarn.sh

http://apache1:8088

mapred historyserver

Installation of Hive

En apache kylin 2.6.4 no tiene soporte para apache hive 3.1.2.

Pore so se esta usando el release 2.3.6.

Reference: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallingHivefromaStableRelease

wget https://www-us.apache.org/dist/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz

wget https://www-us.apache.org/dist/hive/hive-2.3.6/apache-hive-2.3.6-bin.tar.gz

wget https://www-eu.apache.org/dist/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

tar -xzf apache-hive-1.2.2-bin.tar.gz

tar -xzf apache-hive-2.3.6-bin.tar.gz

tar -xzf apache-hive-3.1.2-bin.tar.gz

mkdir $HADOOP_HOME/hive

mv apache-hive-3.1.2-bin/* $HADOOP_HOME/hive

mv apache-hive-2.3.6-bin/* $HADOOP_HOME/hive

ls –l $HADOOP_HOME/hive

nano /etc/profile

. /etc/./profile

hadoop fs -mkdir /tmp

hadoop fs -mkdir /user

hadoop fs -mkdir /user/hive

hadoop fs -mkdir /user/hive/warehouse

hadoop fs -chmod g+w /tmp

hadoop fs -chmod g+w /user/hive/warehouse

wget http://dev.mysql.com/get/mysql57-community-release-el7-8.noarch.rpm

sudo yum localinstall mysql57-community-release-el7–8.noarch.rpm

sudo yum install mysql-community-server

yum install mysql-connector-java

ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar

systemctl stop mysqld

systemctl set-environment MYSQLD_OPTS=” — skip-grant-tables”

systemctl start mysqld

mysql -u root

mysql> UPDATE mysql.user SET authentication_string = PASSWORD(‘MyNewPassword’)

-> WHERE User = ‘root’ AND Host = ‘localhost’;

mysql> FLUSH PRIVILEGES;

mysql> quit

mysql> ALTER USER ‘root’@’localhost’ IDENTIFIED BY ‘MyNewPass’;

systemctl stop mysqld

systemctl unset-environment MYSQLD_OPTS

systemctl start mysqld

mysql -u root -p

ALTER USER ‘root’@’localhost’ IDENTIFIED BY ‘Ze15adv$’;

CREATE DATABASE metastore;

USE metastore;

CREATE USER ‘hiveuser’@’%’ IDENTIFIED BY ‘Ze15adv$’;

GRANT all on *.* to ‘hiveuser’@localhost identified by ‘Ze15adv$’;

flush privileges;

ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar

Create a new file hive-site.xml

vi $HIVE_HOME/conf/hive-site.xml

<configuration>

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>

<description>metadata is stored in a MySQL server</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>MySQL JDBC driver class</description>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hiveuser</value>

<description>user name for connecting to mysql server</description>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>Ze15adv$</value>

<description>password for connecting to mysql server</description>

</property>

</configuration>

cd $HIVE_HOME

schematool -initSchema -dbType mysql

hive

show tables;

Edit:

This section is only for compatibility purposes between Hive 3.1.0 and Apache Kylin 2.6.4, please check with your vendor, the real correction background of the issue.

mysql –u root -p

USE metastore;

ALTER TABLE TBLS ADD REWRITE_ENABLED BIT(1) NOT NULL;

UPDATE TBLS SET REWRITE_ENABLED=IS_REWRITE_ENABLED;

ALTER TABLE DBS ADD CATALOG_NAME varchar(256) not Null;

UPDATE DBS SET CATALOG_NAME=CTLG_NAME;

Testing HIVE

CREATE TABLE pokes (foo INT, bar STRING);

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

ALTER TABLE pokes ADD COLUMNS (new_col INT);

LOAD DATA LOCAL INPATH ‘./examples/files/kv1.txt’ OVERWRITE INTO TABLE pokes;

LOAD DATA LOCAL INPATH ‘./examples/files/kv2.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008–08–15');

LOAD DATA LOCAL INPATH ‘./examples/files/kv3.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008–08–08');

SELECT a.foo FROM invites a WHERE a.ds=’2008–08–15';

INSTALLATION ZOOKEEPER

Zookeeper 3.5.6 (18 Oct 2019)

wget https://www-eu.apache.org/dist/zookeeper/zookeeper-3.5.6/apache-zookeeper-3.5.6-bin.tar.gz

tar -xzf apache-zookeeper-3.5.6-bin.tar.gz

cd /u01/apache-zookeeper-3.5.6-bin/conf

vi zoo.cfg

tickTime=2000

dataDir=/var/lib/zookeeper

clientPort=2181

/u01/apache-zookeeper-3.5.6-bin/bin/zkServer.sh start

lsof –i :2181

INSTALLATIONS HBASE

References:

https://hbase.apache.org/book.html#_introduction

https://hbase.apache.org/downloads.html, latest version 2.2.2

https://hbase.apache.org/book.html#quickstart

wget https://www-eu.apache.org/dist/hbase/2.2.2/hbase-2.2.2-bin.tar.gz

tar -xzf hbase-2.2.2-bin.tar.gz

nano /u01/hbase-2.2.2/conf/hbase-site.xml

<property>

<name>hbase.rootdir</name>

<value>file:///u01/hbase-2.2.2</value>

</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/u01/hbase-2.2.2/zookeeper</value>

</property>

<property>

<name>hbase.unsafe.stream.capability.enforce</name>

<value>false</value>

<description>

Controls whether HBase will check for stream capabilities (hflush/hsync).

Disable this if you intend to run on LocalFileSystem, denoted by a rootdir

with the ‘file://’ scheme, but be mindful of the NOTE below.

WARNING: Setting this to false blinds you to potential data loss and

inconsistent system state in the event of process and/or node failures. If

HBase is complaining of an inability to use hsync or hflush it’s most

likely not a false positive.

</description>

</property>

Edit: Which ever issue you can check the log at

cat /u01/hbase-2.2.2/logs/hbase-root-master-apache1.out

nano /u01/hbase-2.2.2/bin/hbase

CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar:$HBASE_HOME/lib/*

/u01/hbase-2.2.2/bin/start-hbase.sh

/u01/hbase-2.2.2/bin/hbase shell

status

create ‘test’, ‘cf’

put ‘test’, ‘row1’, ‘cf:a’, ‘value1

put ‘test’, ‘row2’, ‘cf:b’, ‘value2

scan ‘test’

INSTALLATION APACHE KYLIN

yum install net-tools

wget https://www-eu.apache.org/dist/kylin/apache-kylin-2.6.4/apache-kylin-2.6.4-bin-hadoop3.tar.gz

tar -zxvf apache-kylin-2.6.4-bin-hadoop3.tar.gz

tar -zxvf apache-kylin-2.6.4-bin-hadoop3.tar.gz

nano /etc/profile

. /etc/./profile

echo $KYLIN_HOME

$KYLIN_HOME/bin/download-spark.sh

$KYLIN_HOME/bin/kylin.sh start

http://apache1:7070/kylin/login

user: ADMIN

password: KYLIN

http://kylin.apache.org/docs/tutorial/kylin_sample.html

  1. Run ${KYLIN_HOME}/bin/sample.sh; Restart Kylin server to flush the caches;
  2. Logon Kylin web with default user and password ADMIN/KYLIN, select project learn_kylin in the project dropdown list (left upper corner);
  3. Select the sample Cube kylin_sales_cube, click “Actions” -> “Build”, pick up a date later than 2014–01–01 (to cover all 10000 sample records);
  4. Check the build progress in the “Monitor” tab, until 100%;
  5. Execute SQLs in the “Insight” tab, for example:

$KYLIN_HOME/bin/sample.sh

Publicado por Google DriveNotificar uso inadecuado–Actualizado automáticamente cada 5 minutos

--

--

Responses (1)