Analytics Vidhya
Published in

Analytics Vidhya

Spark SQL Hive Job Conversion

Use Case

  • Use Facts and dimension model and do multiple join.
  • All the facts and dimension files are stored as parquet
  • Create a temporary table and join

Prerequisite

  • Azure account
  • Create Azure Databricks with Spark 3.0
  • Create a Blob storage to store files.
  • Azure keyvault
  • Copied parquet files for
  • Fact
  • dimState
  • dimCounty
  • dimRace
  • dimSex
  • All these were created from US population data (publicly available)
  • this is just simulating join so please your own dataset if needed.
  • i am using databricks notebook using scala
  • i created a container called uspopulationinput
  • Each fact and dinmentions were stored in corresponding folder.
  • Documentation to create the data set needed for this article. https://github.com/balakreshnan/Accenture/blob/master/cap/populationdataset.md

Steps to take a parquet file create view and build joins

  • First configure the blob storage information like account name and key
  • i am using keyvault to get the storage key
val accbbstorekey = dbutils.secrets.get(scope = "allsecrects", key = "accbbstore")spark.conf.set( "fs.azure.account.key.accbbstore.blob.core.windows.net", accbbstorekey)val factdata = spark.read.parquet("wasbs://containername@storagename.blob.core.windows.net/fact/*.parquet") 
display(factdata)
%sql 
DROP VIEW factdata
  • Now let’s create temporary view for the files
  • First is factdata — view
%sql 
CREATE TEMPORARY VIEW factdata USING org.apache.spark.sql.parquet OPTIONS ( path "wasbs://containername@storagename.blob.core.windows.net/fact/*.parquet" )
%sql
SELECT * FROM factdata;
%sql
CREATE TEMPORARY VIEW dimstate USING org.apache.spark.sql.parquet OPTIONS ( path "wasbs://containername@storagename.blob.core.windows.net/dimstate/*.parquet" )
%sql
CREATE TEMPORARY VIEW dimcounty USING org.apache.spark.sql.parquet OPTIONS ( path "wasbs://containername@storagename.blob.core.windows.net/dimcountryname/*.parquet" )
%sql
CREATE TEMPORARY VIEW dimrace USING org.apache.spark.sql.parquet OPTIONS ( path "wasbs://containername@storagename.blob.core.windows.net/dimRace/*.parquet" )
%sql
CREATE TEMPORARY VIEW dimsex USING org.apache.spark.sql.parquet OPTIONS ( path "wasbs://containername@storagename.blob.core.windows.net/dimsex/*.parquet" )
  • Now let’s join using HiveQL like query to see if can join
  • This exercise was part of trying to see if HiveQL (similar) can be ported to Spark 3.0
%sql 
Select factdata.*, dimstate.StateName as dimStateName, dimcounty.CountyName as dimCountyName, dimrace.Race as dimRace, dimsex.Sex as dimSex from factdata inner join dimstate on factdata.StateName = dimstate.StateName join dimcounty on factdata.countyName = dimcounty.CountyName join dimrace on factdata.race = dimrace.Race join dimsex on factdata.sex = dimsex.Sex
  • As you can see above we are joining Facts and Dimension and showing columns from both to confirm if it can join.
  • This is just to validate.
  • There might be dependencies and data type mismatch when moved from HIVE QL
  • Also the UDF (user defined function) which is another something to watch for.
  • This also only applies if you are using HIVEQL as ETL and want to port to Spark SQL.

Originally published at https://github.com.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store