Reading Data from Spark or Hive Metastore and MySQL

shorya sharma
Data Engineering on Cloud
5 min readNov 28, 2021

--

In this article, we’ll learn to use Hive in the PySpark project and connect to the MySQL database through PySpark using Spark over JDBC.

Hello again! So, a keen interest in PySpark brought you here, or some requirement at your workplace or for whatever reason, I am glad you are here.

We will also look at some very useful functionalities as a bonus.

If you have not seen my article regarding setting up PySpark in windows, I highly recommend you to do that as a prerequisite.

Let’s get started!

Import Libraries and Create a Spark Session

If you see the above code, we have enabled hive support which will let us use hive functionalities.

Create Two Data Frames

Setup required Hive Metastore Database and Tables

Create a Database and Tables to Store these Data Frames in Hive.

Output of Creating Database
Output of Creating Tables
Validation of Created Database in Spark Warehouse
Validation of Hive Tables Created in Database

Note: if you get an error while performing the above code, then follow the below steps.

Step 1: Download Hadoop zip file from the below URL, extract the files, and paste into ‘C:\spark-3.1.2-bin-hadoop3.2\bin folder’.

Step 2: Restart the kernel and open Jupyter notebook from the Anaconda prompt as an administrator.

Transformations

Let us perform some useful transformations like JOIN, PIVOT, CUSTOM UNION, and removal of duplicate columns after joining.

Join Operation

Output of Inner Joining Tables (empDF and deptDF)

Pivot Operation or Transpose

Output of Transposing

Union of Data Frames with Different Columns

Removing Duplicate Column Names After Join

Output of Inner Join Operation
Output for Removing Duplicate Column Names After Join

I know you are waiting for me to show you how to connect PySpark with MySQL, here it is.

Download the JDBC Driver: https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.45.tar.gz

Extract “mysql-connector-java-5.1.45.tar.gz” and place the file inside ‘C:\spark-3.1.2-bin-hadoop3.2\jars’

And finally.

Spark Over JDBC using MySQL

Wolaaa! we accomplished so much today congratulations.

--

--

shorya sharma
Data Engineering on Cloud

Assistant Manager at Bank Of America | Ex-Data Engineer at IBM | Ex - Software Engineer at Compunnel inc. | Python | Data Science