Getting Started with Apache Spark on Databricks

Published in

Big Data Processing

6 min readFeb 24, 2022

This is a first in a series of multiple blogs about getting started with Apache spark on the Databricks.

Apache Spark

Apache Spark is a third-generation data analytics engine. It is a unified engine for big data processing. Unified means it offers both batch and stream processing. Apache Spark overcame the challenges of Hadoop with in-memory parallelization and delivering high performance for distributed processing.

Some of the key features offered by spark are

Analytics and Big Data processing
Machine learning capabilities for
Huge open source community and contributors
A distributed framework for general-purpose data processing
Support for Java, Scala, Python, R, and SQL
Integration with libraries for streaming and graph operations

Fig.1 shows the high-level picture of the Apache spark. At the base layer, we have a cluster. Since Spark is a distributed system, it uses a cluster or group of machines for data processing. Usually, commodity hardware is used for clusters, reducing the cost of running a Spark.

Spark uses distributed storage system such that data is stored on different nodes on the underlying cluster. Spark can keep it in any of the supported formats, s3, HDFS, and filesystem. A resource manager is responsible for allocating resources and managing the jobs being executed on Apache Spark. Spark supports a few resources managers, YARN, Mesos, Kubernetes, or Spark Standalone. Finally, the Spark core is responsible for the core processing of the data. Several libraries can be used with the Spark core. For example, spark SQL is responsible for SQL query execution, Spark Streaming is used for stream processing, MLLib is used for machine learning, and GraphX offers integration with graph databases.

Apache Spark Architecture

A Spark application consists of a driver process and a set of executor processes.

A driver process is the heart of the applications and sits in a node on top of the cluster. It has a separate Java virtual machine(JVM). It is responsible for executing the main() or entry function and controlling the Spark application’s execution. It is responsible for

Maintaining information about the Spark application
Responding to a user’s program or input
Analyzing, distributing, and scheduling work across the executors

Driver process also hosts the SparkSession or SparkContext, which controls the spark application and interacts with the spark programmatically. It is the main entry point to Spark functionality. Once you submit your code to Spark for execution, it creates DAG (directed acyclic graph), which later is converted to an actual physical execution plan.

An executor process is where the application code is run and does all the computation in the cluster. An executor process is responsible for two things.

Executing code assigned to it by the driver
Reporting the state of the computation on that executor back to the driver

Databricks

Databricks is a platform for data engineering, data science, and machine learning. The creators of Apache Spark found it; hence it uses Spark as its core run-time engine. Databricks platform abstracts the infrastructure management from the users and automates the cluster management. It uses Ipython notebooks to create, write, and submit applications.

It runs on all the major cloud platforms, AWS, Azure, GCP, and has its own offering. However, Azure Databricks is most common among all the offerings, and it is optimized to run with the Microsoft Azure services.

Nowadays, Databricks has become more than Spark as it has introduced a new eco-system that overcame the challenges of the legacy data processing systems. Datbricks offers different environments — SQL, Machine learning, and Data Science & Engineering.

Data Science & Engineering environment allows data engineers, scientists, and ML engineers to collaborate while working on the same data at a time. Some of the key concepts are as follows.

Workspace

A workspace — Similar to a development project workspace where all the data, computations, and assets are grouped for a project. Similarly, the notebook is a web-based interface to documents that contain runnable commands, visualizations, and narrative text. Finally, a Repos is a folder whose contents are co-versioned together by syncing them to a remote Git repository.

Databricks uses the Databricks File system (DBFS), a distributed file system within Databricks. While using Azure data bricks, it uses the Azure object storage to store the data and abstracts all the complexities, and you will use the directory and file semantics.

Also, data can be stored in the database tables as well. There are various types of tables — Global tables, Local tables. Global tables are available across all the clusters. Local tables are only available for the timespan of a session hence also known as temporary views.

Cluster

A set of computation resources and configurations on which you run notebooks and jobs. There are two types of clusters: all-purpose and job.

You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative, interactive analysis.
The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart a job cluster.

Job

A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis. Azure Databricks identifies two types of workloads subject to different pricing schemes: data engineering (job) and data analytics (all-purpose).

Data engineering An (automated) workload runs on a job cluster which the Azure Databricks job scheduler creates for each workload.
Data analytics An (interactive) workload runs on an all-purpose cluster. Interactive workloads typically run commands within an Azure Databricks notebook. However, running a job on an existing all-purpose cluster is also treated as an interactive workload.

Pool

A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

Databricks Runtime

The set of core components that run on the clusters managed by Azure Databricks. Azure Databricks offers several types of runtimes:

Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
Databricks Runtime for Genomics is a version of Databricks Runtime optimized for working with genomic and biomedical data.
Databricks Light is the Azure Databricks packaging of the open source Apache Spark runtime. It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime. You can select Databricks Light only when you create a cluster to run a JAR, Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive or notebook job workloads.

Now you know about Spark and Databricks, please follow the guide to create your cluster and run your first Spark job on it.

NB: All the definitions are copied from the Microsoft concept page

References

Getting Started with Apache Spark on Databricks

Apache Spark

Apache Spark Architecture

Databricks

Written by M Haseeb Asif