Kafka, Hive, Scala, Spark, Pig Installation on Windows WSL 2 on Ubuntu 20.04 LTSspark
This is a follow up of my previous post where Hadoop was installed. If not done yet, please follow previous post. Also for zookeeper, scoop, mysql, hbase installation check my part2 story.
Install Kafka:
wget curl "https://downloads.apache.org/kafka/2.6.3/kafka_2.13-2.6.3.tgz"tar -xzf kafka_2.13-2.6.3.tgzsudo mv kafka_2.13-2.6.3 kafkasudo mv kafka /usr/local/kafka
Start Kafka cluster (do not close this terminal):
cd /usr/local/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
Open another terminal either in VSCode or Ubuntu APP terminal(do not close this terminal):
cd /usr/local/kafka
bin/kafka-server-start.sh config/server.properties
Open another terminal either in VSCode or Ubuntu APP terminal, create a topic and list it(do not close this terminal):
cd /usr/local/kafkabin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic TestKafkabin/kafka-topics.sh --list --zookeeper localhost:2181
In same terminal(PRODUCER TERMINAL), start producer to send message to kafka cluster (type as shown in below image):
cd /usr/local/kafkabin/kafka-console-producer.sh --broker-list localhost:9092 --topic TestKafka
Open another terminal either (CONSUMER TERMINAL)in VSCode or Ubuntu APP terminal(do not close this terminal), start consumer kafka(at this point nothing is shown):
cd /usr/local/kafkabin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TestKafka --from-beginning
Go back to PRODUCER TERMINAL and type some text and check in CONSUMER TERMINAL whether it appeared or not.
Hive Installation (Connect to MySQL on localhost):
In prompt type:
cd ~wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gztar -xzf apache-hive-3.1.2sudo mv apache-hive-3.1.2-bin hive-bin.tar.gzsudo mv apache-hive-3.1.2-bin.tar.gz hivesudo mv hive /usr/local
Open .bashrc file
code ~/.bashrc
OR
sudo nano ~/.bashrc
Add following in end (save and close):
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
Reload the profile:
source ~/.bashrc
Add a following folders in hadoop and local, give permission for read write and check whether accessible:
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehousehdfs dfs -chmod g+w /user/hive/warehouse
hdfs dfs -chmod g+w /tmphdfs dfs -ls /hdfs dfs -chmod g+w /tmp
If tmp folder creation fails then may be hadoop services are not running. Try below 2 commands and then above commands again:
sudo service ssh restartstart-all.sh (from hadoop sbin)
Configure hive-site.xml File and edit it:
cd $HIVE_HOME/confcp hive-default.xml.template hive-site.xmlcode hive-site.xml OR sudo nano hive-site.xml
Add following entries in beginning of file (so later there is no problem in starting hive):
<property><name>system:java.io.tmpdir</name><value>/tmp/hive/java</value></property><property><name>system:user.name</name><value>${user.name}</value></property>
In same file check following setting looks like this (save and close):
<property>
<name>hive.metastore.warehouse.dir</name><value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property><property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true&useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property><property>
<name>hive.conf.validation</name>
<value>false</value>
<description>Enables type checking for registered Hive configurations</description>
</property>
</configuration>
Copy mysql-connector-java-5.1.48.jar to hive installation lib folder:
cd :wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.48/mysql-connector-java-5.1.48.jarsudo mv mysql-connector-java-5.1.48.jar $HIVE_HOME/lib
Follow command in sequence:
sudo $HIVE_HOME/bin/hive --service metastoresudo $HIVE_HOME/bin/schematool -dbType mysql -initSchemasudo $HIVE_HOME/bin/schematool -dbType mysql -info
Launch sql, hadoop before going in hive:
sudo service mysql startstart-all.sh
Launch hive (should show hive prompt):
$HIVE_HOME/bin/hive
Try this command to create a sample table:
create table emp (id int, name string, country string, state string, salary int) row format delimited fields terminated by ',';
To quit hive type(may be twice) :
:quit;
Scala Installation
cd ~sudo apt-get install scala
Spark Installation
wget https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgztar -xvzf spark-3.0.3-bin-hadoop3.2.tgzsudo mv spark-3.0.3-bin-hadoop3.2 sparksudo mv spark /usr/local
Pyspark Installation
Usually Ubuntu come inssalled with python pre-installed. To check in command, type python -V or python3 -V. If not then check on internet how to install python for Ubuntu 20.04. To install pyspark:
sudo apt install -y python3-pip # to install pip for pythonpip3 install pyspark
Add following lines in ~/.bashrc :
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.48.jar
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
Give hadoop, hive access to spark:
sudo cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/
sudo cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml $SPARK_HOME/conf/
sudo cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/
To void hive sql connection error from spark, copy the jdbc connector from hive lib to spark jar:
sudo cp -r $HIVE_HOME/lib/mysql-connector-java-5.1.48.jar $SPARK_HOME/jars/
Edit spark-env.sh file (code $SPARK_HOME/conf/spark-env.sh) and add following lines:
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.48.jar
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
Edit Spark default conf file (code $SPARK_HOME/conf/spark-defaults.conf)
sudo cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.confcode $SPARK_HOME/conf/spark-defaults.conf
and add following line:
spark.driver.host localhost
Source bash (profile file again to get the new paths)
source ~/.bashrc
Start spark using spark-shell (Scala) or pyspark:
spark-shellOR pyspark
Start Spark Web UI
- From WSL by first installing google-chrome in WSL
sudo wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debsudo dpkg -i google-chrome-stable_current_amd64.debsudo apt install --fix-broken –ysudo dpkg -i google-chrome-stable_current_amd64.debsudo apt install --fix-broken –ysudo dpkg -i google-chrome-stable_current_amd64.deb
Start google-chrome to open in WSL command prompt:
google-chrome
Start Standalone Spark Master server:
$SPARK_HOME/sbin/start-master.sh
In chrome type below address to open Spark WEB UI
http://127.0.0.1:8080/
2. or in local windows chrome
http://localhost:8080/
To stop spark services,
$SPARK_HOME/sbin/stop-all.sh
To quit spark use:
:quit # for spark-shell
exit() # for pyspark shellOR CTRL + D # common way to exit
Install PIG
Download and install Pig :
cd ~wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gztar -xvf pig-0.17.0.tar.gzsudu mv pig-0.17.0sudo mv pig /usr/local
Open .bashrc file:
code ~/.bashrc
Or
sudo nano ~/.bashrc
add these linees in end:
export PIG_HOME=/usr/local/pigexport PATH=$PATH:/usr/local/pig/binexport PIG_CLASSPATH=$HADOOP_HOME/conf
Reload bash
source .bashrc
Start pig to operate on HDFS:
pig