How to Read Data from Hive & Write to MS SQL Table Using Spark-Shell

Amar Gurung
Zaloni Engineering
Published in
2 min readJan 13, 2022

Blog Overview:

This blog walks readers through the proper use case pre-requisites and steps to read data from Hive and write to MS SQL Server Table use case.

Pre-requisites for Use Case:

  1. scala
  2. spark-shell
  3. beeline
  4. hive 3.1.0
  5. spark 2.3.2
  6. mssql-cli v: 1.0.0 ( Refer https://github.com/dbcli/mssql-cli/blob/master/doc/installation/linux.md )
  7. Driver sqljdbc4–2.0.jar

Use Case Steps:

  1. Create Hive Table using Beeline
create database organization_db;
create table employee(id int,name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

2. Create a CSV file & copy it to the HDFS path

[~]$ cat emp.csv
1,rohan
2,amar
3,jeet
[~]$ hadoop fs -put emp.csv hdfs://hdp314/warehouse/tablespace/managed/hive/organization_db.db/employee
[~]$ hadoop fs -cat hdfs://hdp314/warehouse/tablespace/managed/hive/organization_db.db/employee/emp.csv
1,rohan
2,amar
3,jeet

3. Create MS SQL Table

mssql-cli -U <username> -P <password> -S <sql server ip/hostname> -d <db_name>
db_name> create database orgdb;
Time: 1.005s (a second)Commands completed successfully.db_name> use orgdb;Time: 0.502sCommands completed successfully.orgdb;> create schema org;Time: 0.453sCommands completed successfully.orgdb;> create table org.employee(id int primary key , name varchar(1024));Time: 0.453sCommands completed successfully.

4. Scala Code

val hiveDf = spark.sql("select * from organization_db.employee")
hiveDf.show()
val df = hiveDf.write.format("jdbc").mode(org.apache.spark.sql.SaveMode.Overwrite)
df.option("url","jdbc:sqlserver://<sql-server-ip>:1433;databaseName=orgdb").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable", "org.employee").option("user", "sa").option("password","xxxxx").save()

5. Invoke spark-shell & paste the above code

spark-shell --master yarn --jars /<path-to-driver>/sqljdbc4-2.0.jar

6. Sample output of the spark-shell

[~]$ spark-shell --master yarn --jars <driver-path>/sqljdbc4-2.0.jarSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).Spark context Web UI available at http://xxxxx:4041Spark context available as 'sc' (master = yarn, app id = application_1630517835182_0012).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 2.3.2.3.1.4.0-315/_/Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_271)scala> val hiveDf = spark.sql("select * from organization_db.employee")hiveDf: org.apache.spark.sql.DataFrame = [id: int, name: string]scala> hiveDf.show()+---+-----+| id| name|+---+-----+| 1|rohan|| 2| amar|| 3| jeet|+---+-----+scala> val df = hiveDf.write.format("jdbc").mode(org.apache.spark.sql.SaveMode.Overwrite)df: org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row] = org.apache.spark.sql.DataFrameWriter@21838ea3scala> df.option("url","jdbc:sqlserver://xxxxx:1433;databaseName=orgdb").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable", "org.employee").option("user", "sa").option("password","xxxx").save()

7. Validate the data in MS SQL Table

[amar@crrt-edge ~]$ mssql-cli -U <user_name> -P <password> -S <sql server ip/hostname>-d orgdborgdb> select * from org."employee";Time: 4.674s (4 seconds)+------+--------+| id   | name   ||------+--------|| 3    | jeet   || 1    | rohan  || 2    | amar   |+------+--------+(3 rows affected)

--

--