NEXT GENERATION OLAP ANALYTICS -APACHE KYLIN
OLAP Analytics on Big Data Distributed System
For most of Business Intelligence Consultants, we would remember the old times of our datawarehouse and BI tools by hearth, however, with the new era of big data, datawarehouses were our old raw disks, and datalakes have became our new next generational data storages, for that reason, I thought that everyone like me was wondering about: How can we deal now with our cluster Hadoop (AWS S3, Azure Datalake, GCP Hadoop, etc), because these are these new distributed system? and How can we make the exploitation of the data in a elegant style?
Then this is the main aim of this post, to show you a very very fancy and nice tool calling Apache Kylin 2.6.4, which is running on the top of Spark and Hadoop Ecosystem, and you can create your cube and applied some concepts from Ralph Kimball dimensional modeling for your current Data Lake. So this makes me feel like this old granny like the picture. SO HAPPYY!!!!!!!
Before going to I want you to introduce about the main benefits that I check on this post:
https://www.slideshare.net/Hadoop_Summit/apache-kylin-cubes-on-hadoop
The Cube is totally transparent for the end user:
A totally of the key points from incremental refresh till monitoring your process:
First is much better in performance transformation than Hive:
You have a integration with Power BI and Tableau.
You can check in detail about some more benefits and advantages that Apache Kylin has in comparison with another big data tools.
Software requirements for Apache Kylin:
Apache HDFS 3.2.0
Apache HBase 2.2.0
Apache Hive 3.1.0
Apache Zookeeper
Apache Spark (Optional)
Apache Kafka (Optional)
Then that I made an introduction about Apache Kylin, let’s go to the hard work, to install and setting up our environment from scratch.
- Installing EE Red Hat Linux 7.7
I have to have Red Hat Linux 7.7
https://developers.redhat.com/products/rhel/download
Once you installed RHE 7.7, we have to install docker
Getting a No Cost Developer Subscription Option:
- You have to enable repos for that reason we have to accept the terms of conditions:
https://www.redhat.com/wapps/tnc/ackrequired?site=candlepin&event=attachSubscription
- Join the Red Hat Developer:
- Subscribe using the following command:
subscription-manager register — username advinculacesar — password Ze15adv$ — auto-attach — force
yum install wget
rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install htop
htop
- Installing Hadoop with Java Development Kit 8 101
Since HADOOP (https://hadoop.apache.org/old/releases.html ) last release date was at 8 Aug 2018, it is working with JDK 8 (https://cwiki.apache.org/confluence/display/hadoop/HadoopJavaVersions ).
rpm -ivh jdk-8u101-linux-x64.rpm
vi /etc/profile
. /etc/./profile
- Sharing SSH Keys for Symmetric Authentication between nodes of Hadoop Cluster
It is just important that your Master Node share the ssh key, to its Slaves.
rm –rf /root/.ssh
ssh-keygen –t dsa
cat /root/.ssh/id_dsa.pub >> /root/.ssh/authorized_keys
scp /root/.ssh/id_dsa.pub apache2:/root
rm –rf /root/.ssh
ssh-keygen –t dsa
cat /root/.ssh/id_dsa.pub >> /root/.ssh/authorized_keys
Testing
- Installation and configuration Hadoop Cluster
Download the tar ball about Hadoop.
wget https://hadoop.apache.org/release/3.2.0.html
tar –xvf hadoop-3.2.0.tar.gz
vi /etc/profile
export HADOOP_HOME=/u01/hadoop-3.2.0
export HADOOP_HOME=/u01/hadoop-3.2.0
export HDFS_NAMENODE_USER=”root”
export HDFS_DATANODE_USER=”root”
export HDFS_SECONDARYNAMENODE_USER=”root”
export YARN_RESOURCEMANAGER_USER=”root”
export YARN_NODEMANAGER_USER=”root”
. /etc/./profile
hadoop version
vi $HADOOP_HOME/etc/hadoop/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://apache1:9000</value>
</property>
https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/ClusterSetup.html
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/u01/hadoop-3.2.0/data/nameNode</value>
</property >
<property>
<name>dfs.datanode.data.dir</name>
<value>/u01/hadoop-3.2.0/data/dataNode</value>
</property >
<property>
<name>dfs.replication</name>
<value>1</value>
</property >
vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>apache1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
vi $HADOOP_HOME/etc/hadoop/workers
scp /u01/hadoop-3.2.0.tar.gz apache2:/u01
Repeat this step in every slave node:
ssh root@apache2
cd /u01
ls
tar –xvf hadoop-3.2.0.tar.gz
exit
cd /u01/hadoop-3.2.0
vi copyConfigFile.sh
for node in apache1 apache2; do
scp /u01/hadoop-3.2.0/etc/hadoop/* $node:/u01/hadoop-3.2.0/etc/hadoop/;
done
sh copyConfigFile.sh
Do the same in all nodes:
vi /u01/hadoop-3.2.0/etc/hadoop/hadoop-env.sh
JAVA_HOME=/usr/java/jdk1.8.0_101
hdfs namenode -format
$HADOOP_HOME/sbin/start-dfs.sh
http://apache1:9870/dfshealth.html#tab-overview
$HADOOP_HOME/sbin/start-yarn.sh
mapred historyserver
Installation of Hive
En apache kylin 2.6.4 no tiene soporte para apache hive 3.1.2.
Pore so se esta usando el release 2.3.6.
wget https://www-us.apache.org/dist/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz
wget https://www-us.apache.org/dist/hive/hive-2.3.6/apache-hive-2.3.6-bin.tar.gz
wget https://www-eu.apache.org/dist/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar -xzf apache-hive-1.2.2-bin.tar.gz
tar -xzf apache-hive-2.3.6-bin.tar.gz
tar -xzf apache-hive-3.1.2-bin.tar.gz
mkdir $HADOOP_HOME/hive
mv apache-hive-3.1.2-bin/* $HADOOP_HOME/hive
mv apache-hive-2.3.6-bin/* $HADOOP_HOME/hive
ls –l $HADOOP_HOME/hive
nano /etc/profile
. /etc/./profile
hadoop fs -mkdir /tmp
hadoop fs -mkdir /user
hadoop fs -mkdir /user/hive
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse
wget http://dev.mysql.com/get/mysql57-community-release-el7-8.noarch.rpm
sudo yum localinstall mysql57-community-release-el7–8.noarch.rpm
sudo yum install mysql-community-server
yum install mysql-connector-java
ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar
systemctl stop mysqld
systemctl set-environment MYSQLD_OPTS=” — skip-grant-tables”
systemctl start mysqld
mysql -u root
mysql> UPDATE mysql.user SET authentication_string = PASSWORD(‘MyNewPassword’)
-> WHERE User = ‘root’ AND Host = ‘localhost’;
mysql> FLUSH PRIVILEGES;
mysql> quit
mysql> ALTER USER ‘root’@’localhost’ IDENTIFIED BY ‘MyNewPass’;
systemctl stop mysqld
systemctl unset-environment MYSQLD_OPTS
systemctl start mysqld
mysql -u root -p
ALTER USER ‘root’@’localhost’ IDENTIFIED BY ‘Ze15adv$’;
CREATE DATABASE metastore;
USE metastore;
CREATE USER ‘hiveuser’@’%’ IDENTIFIED BY ‘Ze15adv$’;
GRANT all on *.* to ‘hiveuser’@localhost identified by ‘Ze15adv$’;
flush privileges;
ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar
Create a new file hive-site.xml
vi $HIVE_HOME/conf/hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Ze15adv$</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>
cd $HIVE_HOME
schematool -initSchema -dbType mysql
hive
show tables;
Edit:
This section is only for compatibility purposes between Hive 3.1.0 and Apache Kylin 2.6.4, please check with your vendor, the real correction background of the issue.
mysql –u root -p
USE metastore;
ALTER TABLE TBLS ADD REWRITE_ENABLED BIT(1) NOT NULL;
UPDATE TBLS SET REWRITE_ENABLED=IS_REWRITE_ENABLED;
ALTER TABLE DBS ADD CATALOG_NAME varchar(256) not Null;
UPDATE DBS SET CATALOG_NAME=CTLG_NAME;
Testing HIVE
CREATE TABLE pokes (foo INT, bar STRING);
CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
ALTER TABLE pokes ADD COLUMNS (new_col INT);
LOAD DATA LOCAL INPATH ‘./examples/files/kv1.txt’ OVERWRITE INTO TABLE pokes;
LOAD DATA LOCAL INPATH ‘./examples/files/kv2.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008–08–15');
LOAD DATA LOCAL INPATH ‘./examples/files/kv3.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008–08–08');
SELECT a.foo FROM invites a WHERE a.ds=’2008–08–15';
INSTALLATION ZOOKEEPER
Zookeeper 3.5.6 (18 Oct 2019)
wget https://www-eu.apache.org/dist/zookeeper/zookeeper-3.5.6/apache-zookeeper-3.5.6-bin.tar.gz
tar -xzf apache-zookeeper-3.5.6-bin.tar.gz
cd /u01/apache-zookeeper-3.5.6-bin/conf
vi zoo.cfg
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
/u01/apache-zookeeper-3.5.6-bin/bin/zkServer.sh start
lsof –i :2181
INSTALLATIONS HBASE
References:
https://hbase.apache.org/book.html#_introduction
https://hbase.apache.org/downloads.html, latest version 2.2.2
https://hbase.apache.org/book.html#quickstart
wget https://www-eu.apache.org/dist/hbase/2.2.2/hbase-2.2.2-bin.tar.gz
tar -xzf hbase-2.2.2-bin.tar.gz
nano /u01/hbase-2.2.2/conf/hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>file:///u01/hbase-2.2.2</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/u01/hbase-2.2.2/zookeeper</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
<description>
Controls whether HBase will check for stream capabilities (hflush/hsync).
Disable this if you intend to run on LocalFileSystem, denoted by a rootdir
with the ‘file://’ scheme, but be mindful of the NOTE below.
WARNING: Setting this to false blinds you to potential data loss and
inconsistent system state in the event of process and/or node failures. If
HBase is complaining of an inability to use hsync or hflush it’s most
likely not a false positive.
</description>
</property>
Edit: Which ever issue you can check the log at
cat /u01/hbase-2.2.2/logs/hbase-root-master-apache1.out
nano /u01/hbase-2.2.2/bin/hbase
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar:$HBASE_HOME/lib/*
/u01/hbase-2.2.2/bin/start-hbase.sh
/u01/hbase-2.2.2/bin/hbase shell
status
create ‘test’, ‘cf’
put ‘test’, ‘row1’, ‘cf:a’, ‘value1
put ‘test’, ‘row2’, ‘cf:b’, ‘value2
scan ‘test’
INSTALLATION APACHE KYLIN
yum install net-tools
wget https://www-eu.apache.org/dist/kylin/apache-kylin-2.6.4/apache-kylin-2.6.4-bin-hadoop3.tar.gz
tar -zxvf apache-kylin-2.6.4-bin-hadoop3.tar.gz
tar -zxvf apache-kylin-2.6.4-bin-hadoop3.tar.gz
nano /etc/profile
. /etc/./profile
echo $KYLIN_HOME
$KYLIN_HOME/bin/download-spark.sh
$KYLIN_HOME/bin/kylin.sh start
http://apache1:7070/kylin/login
user: ADMIN
password: KYLIN
http://kylin.apache.org/docs/tutorial/kylin_sample.html
- Run ${KYLIN_HOME}/bin/sample.sh; Restart Kylin server to flush the caches;
- Logon Kylin web with default user and password ADMIN/KYLIN, select project learn_kylin in the project dropdown list (left upper corner);
- Select the sample Cube kylin_sales_cube, click “Actions” -> “Build”, pick up a date later than 2014–01–01 (to cover all 10000 sample records);
- Check the build progress in the “Monitor” tab, until 100%;
- Execute SQLs in the “Insight” tab, for example:
$KYLIN_HOME/bin/sample.sh
Publicado por Google Drive–Notificar uso inadecuado–Actualizado automáticamente cada 5 minutos