Apache Spark™ for Dummies

Kevin Sames
Microsoft Data Platform Community Hamburg
4 min readNov 21, 2022

Modern data architectures rely on the separation of storage and compute. Of course, each cloud provider calls its storage solution differently. Whether S3 at AWS, GCS at GCP or ADLS at Azure, the important features are low read, write and storage costs.

When choosing a compute engine, there is no way around Spark. But where does Spark come from and why is it so popular?

A quick Wikipedia query yields the following description

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Simply put, Spark tries to distribute the work among as many workers as possible.
Imagine you are reading a book (Lord of the Rings would of course be the generic choice here) and want to know in which chapters which characters are mentioned most often. If you would count alone, you would have to read the whole book and write down one line for each name in a tally sheet.

I think it should be clear by now that it’s better to get help.
So you take a friend for each chapter and have everyone count the numbers for each chapter and you yourself just add up all the lists at the end.

This is exactly what Spark’s great-grandfather (MapReduce) does.

Spark takes this concept and goes one step deeper: depending on the number of tasks a brain can perform simultaneously (in my case, unfortunately, only one task), the chapter is further divided. So if I could simultaneously count the even words with my left hand and the odd words with my right hand, Spark would further parallelize the chapter and cut the total time needed in half.

The best part is, Spark does the parallel switching for us and we only have to manage the cluster and figure out how many workers we need for our tasks.

Of course, you could also come up with the idea of setting up a worker for each word, but that wouldn’t really make you faster. Spark would need a long time to divide the work (how should my 3456th friend know which is his word to be counted) and also to carry the results at the end again together. In this example, the compilation would even be the same task as reading everything yourself. So the engineer has to think about how to design the cluster.

We won’t go too deep here, but there are some great experiments from the original inventors (Databricks) of Spark that go into more detail about optimizing a cluster.

Spark has come a long way and is now at a point where the unified interface for data teams in particular is brilliant.

You can work with Python, SQL, Scala, Java, C#, R…

When data has been read in, we call it a dataframe in spark and then we also like to call the object df.

df = spark.read.load("path/to/my/file.parquet")

Of course we can also read the data with SQL

Create Table myTable
USING parquet
Path "path/to/my/file.parquet"

From here on, there are (almost) no limits to SQL and Python capabilities.

To be honest, it is not a pleasant task to set up and maintain a Spark cluster. Besides, you certainly need more than just the storage and the engine to build a usable data architecture.
To allow the engineers, admins, analysts and scientists to focus on the real work and work collaboratively, there are Spark as a Platform solutions. Among the best known are BigQuery, Amazon EMR, Azure Synapse and Databricks.

All of them certainly have advantages and disadvantages. We’ll look at Azure Synapse for this article.

Azure Synapse combines both Spark, dedicated SQL pools, and serverless SQL pools under one hat. This is complemented by seamless Power BI, Purview Data Factory integration, among others.

Tables created using an Apache Spark pool (cluster) can be tapped and passed using the SQL serverless endpoint.

With mssparkutils Synapse provides a Python package with which the Spark Pool can interact with the storage.

from notebookutils import mssparkutils
mssparkutils.fs.mount(
"abfss://mycontainer@<accountname>.dfs.core.windows.net",
"/test",
{"linkedService":"mygen2account"}
)

Hopefully this high level introduction could inspire you to experiment with Spark yourself.

I am always grateful for feedback and additions.

If you are interested in more interesting content about the Microsoft Data Platform, then you are welcome to visit our Meetups in Hamburg, Germany!

https://www.meetup.com/de-DE/hamburger-ms-sql-server-usergroup-by-pass-deutschland-e-v/

--

--

Kevin Sames
Microsoft Data Platform Community Hamburg

Data Engineer | BI & Analytics Consultant | Mathematican ⚡Lightning-Fast Data Insights⚡ #enjoydata