[Data Engineering] with Azure Synapse Apache Spark Pools on IntelliJ (Scala), VS Code (PySpark)
[This blog is intended for beginners on the data engineering journey. We will be learning to establish data engineering environment with high bar engineering rigor and develop a batch data application leveraging Azure’s Synapse Spark Pool for compute.]
This blog is divided into 2 segments, the first part would be on the IDE IntelliJ with Scala development and the second part would be targeted towards VS Code PySpark based development community. Lets dive right in!
Azure Synapse Spark
Azure Synapse Analytics is Microsoft’s SaaS azure offering a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. To learn more: https://azure.microsoft.com/en-us/products/synapse-analytics/#overview
It offers Optimized and fully managed Apache Spark for big data compute at scale. Upon spinning up an instance on Azure we provision a spark pool named (sparkpool).
Part I — Scala Development on IDE IntelliJ
Scala (Scalable Language) is general purpose programming language offering both functional and object oriented paradigm for data application developers. Spark natively has been developed in Scala and this is a compiled and type-safe language. Scala and PySpark perform equally for dataframe operations. Regular Scala is robust, and is 10x to 20x faster than Python however benchmarks to Python is irrelevant to PySpark as it converted to Spark SQL and executed over JVM. For more on performance comparisons click.
This blog is for both the developer community intending to work with Scala and PySpark on Intellij and VS Code respectively. We will learn how seamlessly easy it is to make it happen with Synapse Spark Pool.
Install IntelliJ Community edition for Window https://www.jetbrains.com/idea/download/#section=windows
We will need to ensure winutils, Java, Python and ensure the Path is set on the environment variable.
Now on the IntelliJ we need to add the Azure Toolkit Plugin and the Scala Plugin.
Once logged in, Click on Run>Edit Configuration and Configure the remote run to the spark pool with the upload path for the artifacts.
Lets look at Customer data, this is a sample data we will be working with we have this data in Delta format at our ADLS Gen2 storage account. To learn more on DELTA click. We have CustomerId column as the Primary Key for this data and DateModified column as the Record watermark.
We write a simple application in Scala to read the data from the Customer and Filter out to write Washington Customers to a directory on the desired Azure Data Lake Path. [It intentional to not use Notebook to do such as we are trying to establish grounds for a application developer experience.]
Click on the right corner to Spark Run Job.
Lets try to unpack the below image. The default_artifact.jar produced after successful build is uploaded based on the Remote Run Configurations and soon after the jar is submitted for running. We could view the same by viewing the actively running jobs on configured sparkpool.
Create a Product Pipeline With the Developed Jar
Once tested we can now deploy the default_artifact.jar to a ADF pipeline for processing the data.
On this section we have successfully established, created, and deployed a Scala data application. In the next section we will learn to do this in PySpark with Visual Studio Code.
Part II — PySpark Development on Visual Studio Code
We will look to re-create the last application in PySpark. Before we begin make sure path environment variable is set for the Python installation. Open your VS Code instance and add the Spark & Hive Tools Plugin
Once installed click on the View > Command Palette>Azure Sign In
Make sure you are able to navigate your Azure Resources and then Set the Default Spark Pool.
You may experience a recommendation to install virtual environment.
Go ahead and pip install virtualenv
Following such a setttings.json will be automatically provisioned with the spark pool configuration.
This blog has been one of the most requested, it is intended to simply help get some of the data engineers early in their career, it gets them started with the env setup and and helps them understand and learn on the Azure Synapse Spark offerings. I hope it is found beneficial to the Azure community of users and developers.