[Data Engineering] with Azure Synapse Apache Spark Pools on IntelliJ (Scala), VS Code (PySpark)

Keshav Singh
6 min readOct 3, 2022

--

[This blog is intended for beginners on the data engineering journey. We will be learning to establish data engineering environment with high bar engineering rigor and develop a batch data application leveraging Azure’s Synapse Spark Pool for compute.]

This blog is divided into 2 segments, the first part would be on the IDE IntelliJ with Scala development and the second part would be targeted towards VS Code PySpark based development community. Lets dive right in!

Azure Synapse Spark

Spark Pool (Cluster) and Config details

Azure Synapse Analytics is Microsoft’s SaaS azure offering a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. To learn more: https://azure.microsoft.com/en-us/products/synapse-analytics/#overview

It offers Optimized and fully managed Apache Spark for big data compute at scale. Upon spinning up an instance on Azure we provision a spark pool named (sparkpool).

Configuration For the Spark Pool

Part I — Scala Development on IDE IntelliJ

Scala (Scalable Language) is general purpose programming language offering both functional and object oriented paradigm for data application developers. Spark natively has been developed in Scala and this is a compiled and type-safe language. Scala and PySpark perform equally for dataframe operations. Regular Scala is robust, and is 10x to 20x faster than Python however benchmarks to Python is irrelevant to PySpark as it converted to Spark SQL and executed over JVM. For more on performance comparisons click.

This blog is for both the developer community intending to work with Scala and PySpark on Intellij and VS Code respectively. We will learn how seamlessly easy it is to make it happen with Synapse Spark Pool.

Install IntelliJ Community edition for Window https://www.jetbrains.com/idea/download/#section=windows

We will need to ensure winutils, Java, Python and ensure the Path is set on the environment variable.

Winutils for Windows
Add To the Path Environment Variable

Now on the IntelliJ we need to add the Azure Toolkit Plugin and the Scala Plugin.

Plugin Installations
Create a New Maven Project
Add configurations
Your tools section should start showing an integrated Azure option, login to your Azure Account

Once logged in, Click on Run>Edit Configuration and Configure the remote run to the spark pool with the upload path for the artifacts.

Configuration for the remote run in cluster
Your ribbon must now have some additional options with the Spark Pool

Lets look at Customer data, this is a sample data we will be working with we have this data in Delta format at our ADLS Gen2 storage account. To learn more on DELTA click. We have CustomerId column as the Primary Key for this data and DateModified column as the Record watermark.

Customer Data Scenario

We write a simple application in Scala to read the data from the Customer and Filter out to write Washington Customers to a directory on the desired Azure Data Lake Path. [It intentional to not use Notebook to do such as we are trying to establish grounds for a application developer experience.]

Washington Customer Transformation Code

Click on the right corner to Spark Run Job.

Right Corner of the IDE

Lets try to unpack the below image. The default_artifact.jar produced after successful build is uploaded based on the Remote Run Configurations and soon after the jar is submitted for running. We could view the same by viewing the actively running jobs on configured sparkpool.

The Job is successful
Transformed Data is Available at Storage.

Create a Product Pipeline With the Developed Jar

Once tested we can now deploy the default_artifact.jar to a ADF pipeline for processing the data.

Create a Pipeline with the Developed JAR
Configure and Attach the Jar
Conducting a pipeline trial run.
Result Validation

On this section we have successfully established, created, and deployed a Scala data application. In the next section we will learn to do this in PySpark with Visual Studio Code.

Part II — PySpark Development on Visual Studio Code

We will look to re-create the last application in PySpark. Before we begin make sure path environment variable is set for the Python installation. Open your VS Code instance and add the Spark & Hive Tools Plugin

Spark & Hive Tools Plugin For VS Code

Once installed click on the View > Command Palette>Azure Sign In

Sign In Azure

Make sure you are able to navigate your Azure Resources and then Set the Default Spark Pool.

Set Default Spark Pool
Setting Spark Pool

You may experience a recommendation to install virtual environment.

Go ahead and pip install virtualenv

Following such a setttings.json will be automatically provisioned with the spark pool configuration.

Settings.json
Develop the comparable respective code in PySpark
Submit The job for Dev/Test Run
The code is then uploaded as .py file on the Synapse ADLS Container and Code is submitted for Data Processing
Job Complete and Status
Job Status
Data at Storage
Data Validation

This blog has been one of the most requested, it is intended to simply help get some of the data engineers early in their career, it gets them started with the env setup and and helps them understand and learn on the Azure Synapse Spark offerings. I hope it is found beneficial to the Azure community of users and developers.

--

--

Keshav Singh
Keshav Singh

Written by Keshav Singh

Principal Engineering Lead, Microsoft Purview Data Governance PG | Data ML Platform Architecture & Engineering | https://www.linkedin.com/in/keshavksingh