Kafka, Hive, Scala, Spark, Pig Installation on Windows WSL 2 on Ubuntu 20.04 LTSspark

5 min readJan 20, 2022

This is a follow up of my previous post where Hadoop was installed. If not done yet, please follow previous post. Also for zookeeper, scoop, mysql, hbase installation check my part2 story.

Install Kafka:

wget curl "https://downloads.apache.org/kafka/2.6.3/kafka_2.13-2.6.3.tgz"tar -xzf kafka_2.13-2.6.3.tgzsudo mv kafka_2.13-2.6.3 kafkasudo mv kafka /usr/local/kafka

Start Kafka cluster (do not close this terminal):

cd /usr/local/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties

Open another terminal either in VSCode or Ubuntu APP terminal(do not close this terminal):

cd /usr/local/kafka
bin/kafka-server-start.sh config/server.properties

Open another terminal either in VSCode or Ubuntu APP terminal, create a topic and list it(do not close this terminal):

cd /usr/local/kafkabin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic TestKafkabin/kafka-topics.sh --list --zookeeper localhost:2181

**Note** usage of | grep… is not required

In same terminal(PRODUCER TERMINAL), start producer to send message to kafka cluster (type as shown in below image):

cd /usr/local/kafkabin/kafka-console-producer.sh --broker-list localhost:9092 --topic TestKafka

Open another terminal either (CONSUMER TERMINAL)in VSCode or Ubuntu APP terminal(do not close this terminal), start consumer kafka(at this point nothing is shown):

cd /usr/local/kafkabin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TestKafka --from-beginning

Go back to PRODUCER TERMINAL and type some text and check in CONSUMER TERMINAL whether it appeared or not.

CONSUMER TERMINAL

Hive Installation (Connect to MySQL on localhost):

In prompt type:

cd ~wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gztar -xzf apache-hive-3.1.2sudo mv apache-hive-3.1.2-bin hive-bin.tar.gzsudo mv apache-hive-3.1.2-bin.tar.gz hivesudo mv hive /usr/local

Open .bashrc file

code ~/.bashrc 
OR
sudo nano ~/.bashrc

Add following in end (save and close):

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Reload the profile:

source ~/.bashrc

Add a following folders in hadoop and local, give permission for read write and check whether accessible:

hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehousehdfs dfs -chmod g+w /user/hive/warehouse
hdfs dfs -chmod g+w /tmphdfs dfs -ls /hdfs dfs -chmod g+w /tmp

If tmp folder creation fails then may be hadoop services are not running. Try below 2 commands and then above commands again:

sudo service ssh restartstart-all.sh (from hadoop sbin)

Configure hive-site.xml File and edit it:

cd $HIVE_HOME/confcp hive-default.xml.template hive-site.xmlcode hive-site.xml OR sudo nano hive-site.xml

Add following entries in beginning of file (so later there is no problem in starting hive):

<property><name>system:java.io.tmpdir</name><value>/tmp/hive/java</value></property><property><name>system:user.name</name><value>${user.name}</value></property>

In same file check following setting looks like this (save and close):

<property>
<name>hive.metastore.warehouse.dir</name><value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property><property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true&amp;useSSL=false</value>
</property>


<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>root</value>
</property>

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>true</value>
</property>

<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

<property>
 <name>datanucleus.autoCreateTables</name>
 <value>True</value>
 </property><property>
    <name>hive.conf.validation</name>
    <value>false</value>
    <description>Enables type checking for registered Hive   configurations</description>
  </property>
</configuration>

Copy mysql-connector-java-5.1.48.jar to hive installation lib folder:

cd :wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.48/mysql-connector-java-5.1.48.jarsudo mv mysql-connector-java-5.1.48.jar $HIVE_HOME/lib

Follow command in sequence:

sudo $HIVE_HOME/bin/hive --service metastoresudo $HIVE_HOME/bin/schematool -dbType mysql -initSchemasudo $HIVE_HOME/bin/schematool -dbType mysql -info

Launch sql, hadoop before going in hive:

sudo service mysql startstart-all.sh

Launch hive (should show hive prompt):

$HIVE_HOME/bin/hive

Try this command to create a sample table:

create table emp (id int, name string, country string, state string, salary int) row format delimited fields terminated by ',';

To quit hive type(may be twice) :

:quit;

Scala Installation

cd ~sudo apt-get install scala

Spark Installation

wget https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgztar -xvzf spark-3.0.3-bin-hadoop3.2.tgzsudo mv spark-3.0.3-bin-hadoop3.2 sparksudo mv spark /usr/local

Pyspark Installation

Usually Ubuntu come inssalled with python pre-installed. To check in command, type python -V or python3 -V. If not then check on internet how to install python for Ubuntu 20.04. To install pyspark:

sudo apt install -y python3-pip  # to install pip for pythonpip3 install pyspark

Add following lines in ~/.bashrc :

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.48.jar
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

Give hadoop, hive access to spark:

sudo cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/
sudo cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml $SPARK_HOME/conf/
sudo cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/

To void hive sql connection error from spark, copy the jdbc connector from hive lib to spark jar:

sudo cp -r $HIVE_HOME/lib/mysql-connector-java-5.1.48.jar $SPARK_HOME/jars/

Edit spark-env.sh file (code $SPARK_HOME/conf/spark-env.sh) and add following lines:

export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.48.jar
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

Edit Spark default conf file (code $SPARK_HOME/conf/spark-defaults.conf)

sudo cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.confcode $SPARK_HOME/conf/spark-defaults.conf

and add following line:

spark.driver.host	localhost

Source bash (profile file again to get the new paths)

source ~/.bashrc

Start spark using spark-shell (Scala) or pyspark:

spark-shellOR pyspark

Start Spark Web UI

From WSL by first installing google-chrome in WSL

sudo wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debsudo dpkg -i google-chrome-stable_current_amd64.debsudo apt install --fix-broken –ysudo dpkg -i google-chrome-stable_current_amd64.debsudo apt install --fix-broken –ysudo dpkg -i google-chrome-stable_current_amd64.deb

Start google-chrome to open in WSL command prompt:

google-chrome

Start Standalone Spark Master server:

$SPARK_HOME/sbin/start-master.sh

In chrome type below address to open Spark WEB UI

http://127.0.0.1:8080/

2. or in local windows chrome

http://localhost:8080/

To stop spark services,

$SPARK_HOME/sbin/stop-all.sh

To quit spark use:

:quit  # for spark-shell
exit()   # for pyspark shellOR CTRL + D  # common way to exit

Install PIG

Download and install Pig :

cd ~wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gztar -xvf pig-0.17.0.tar.gzsudu mv pig-0.17.0sudo mv pig /usr/local

Open .bashrc file:

code ~/.bashrc 
Or 
sudo nano ~/.bashrc

add these linees in end:

export PIG_HOME=/usr/local/pigexport PATH=$PATH:/usr/local/pig/binexport PIG_CLASSPATH=$HADOOP_HOME/conf

Reload bash

source .bashrc

Start pig to operate on HDFS:

pig