Notes for Databricks CRT020 Exam Prep Part 1

Lackshu Balasubramaniam
7 min readJan 19, 2020

--

As I walk through the Databricks exam prep for Apache Spark 2.4 with Python 3, I’m collating notes based on the knowledge expectation of of the exam. This is a snapshot of my review of materials. It’s a work in progress…

This post includes the following

  • Spark Architecture Components
  • Spark Execution
  • Spark Concepts

My references were:

Spark The Definitive Guide

Spark In Action

Databricks website

Databricks Engineering blog

Apache Spark 2.4 Docos

Various developer blogs that are fantastic. Too many to be called out.

Summary

The exam validates knowledge of the core components of DataFrames API and confirms understanding of Spark Architecture.

The exam lasts 180 minutes, consisting of

  • multiple-choice questions.
  • coding challenges

Note:

  • Implementation of coding challenges is done within Databricks platform
  • Access to Programming and API docs is made available during exam

Spark Architecture Components

Expectation of Knowledge:

  • Driver
  • Executor
  • Cores/Slots/Thread
  • Partitions

Spark Architecture

Figure 1: Logical Architecture

Description of architectural elements are in the subsequent pages.

Note:

The Spark Context represents connection to Spark Cluster, thus it’s shown in the logical architecture diagram.

SparkSession can be used interchangeably with SparkContext in the diagram above, post Spark 2.0 the session is the unified entry point of a Spark application, combining SparkContext and SQLContext among others.

Cluster

A cluster is grouping of machines where the resources of the machines are pooled together to leverage the cumulative resources as one.

Cluster manager: The service in the master node that controls the physical machines. Applications acquire resources on the cluster via the cluster manager.

Worker node: Any node in the cluster that can run application code.

Application

A user program built on Spark to perform a set of operations. Consists of a driver program and executors on the cluster. Each application has its own executor.

Descriptions of the elements of an application are in subsequent pages.

Driver

The JVM process that controls the execution and maintains the state of a Spark Application. The roles of the driver are:

  • creates the SparkContext
  • respond to user’s program or input
  • distributes and schedules work across executors
  • maintain the state and tasks of the executors
  • pipelines tranformations

­

Note:

Some of these concepts will be shared in subsequent pages.

Executor

A JVM process launched on a worker node that runs tasks assigned by the driver.

  • performs the computations.
  • keeps data in memory or disk storage.
  • reports state of computation in the executor back to the driver.

Cores/Slots/Threads

Figure 2: Driver and Slots in Executors

Each executor nodes has slots which are logical execution cores for running tasks in parallel. In the diagram above each executor has 4 slots available.

The driver sends tasks to the vacant slots on the executors when there is work to be executed. This ties to distributing and parallelizing an execution thread.

Note:

Data frames (which are built on top of RDDs under the covers) allow for distribution and parallelization of tasks due to their distributed nature.

Partitions

Resilient Distributed Datasets (RDDs) are a collections of data objects. They are the fundamental data structure for Spark since it’s inception.

These objects are stored in partitions. A partition is a collection of rows that sit on one worker node in the cluster. Thus a RDD is distributed across participating worker nodes.

When performing computations on RDDs, the partitions can be operated on in parallel.

Note:

Data frames are more commonly used currently in PySpark and abstract away the complexity of working with RDDs. Take note that RDD is a powerful API and can be used for low level manipulations.

Spark Execution

Expectation of Knowledge:

­

  • Jobs
  • Stages
  • Tasks

Job

A Job is parallel computation consisting of multiple tasks that gets spawned across executors in response to a Spark action, for example save or collect.

Jobs can be scheduled to execute either on a preexisting cluster or it’s own cluster(job-cluster).

Note:

Actions always return results.

Some of these concepts will be shared in subsequent pages.

Stage

Each job gets decomposed into sets of tasks called stages that depend on each other.

A stage represent a group of tasks that can be executed together to perform the same operation on multiple executors in parallel.

A stage is a combination of transformations which does not cause any shuffling of data across nodes.

Spark starts a new stage when shuffling is required in the job.

Task

A task is a unit of work that will be sent to one executor.

Each task is a combination of blocks of data and a set of transformations that will run on a single executor.

Spark Concepts

Expectation of Knowledge:

  • Partitioning
  • Shuffling
  • Wide vs. Narrow Transformations
  • Dataframe Transformations vs. Actions vs. Operations
  • Caching­
  • ­High-level Cluster Configuration

Partitioning

A RDD is partitioned into smaller chunks across worker nodes as mentioned earlier.

­Every worker node in the cluster contains one or more partitions. ­The number of partitions across executors is configurable.

If the number is too small it will reduce concurrency and possibly cause data skewing.

If the number is too high, there will be a mismatch between task scheduling and task execution.

The default value is total number of cores on all worker nodes.

­A partition does not span multiple worker nodes. Tuples (Rows) in the same partition will be on the same node.

­Spark assigns one task per partition which determines the level of parallelism. Each executor processes one task at a time.

Partitioning Scheme:

  • Hash partition : even distribution
  • ­Range partition : key based ordering. Tuples with same key reside in the same node.

Syntax:

coalesce: Collapses partition on the same worker to avoid shuffling.

df1.coalesce(n).getNumPartitions()

the getNumPartitions() operation causes the coalesce operation to occur immediately i.e. eager execution.

repartition: Repartition data frame to increase/decrease partitions. Will cause a shuffle.

df2 = df1.repartition(n)

df2 = df1.repartition(n, “column name”)

repartitionByRange: Repartition data frame to increase/decrease partitions. Will cause a shuffle. The resulting DataFrame is range partitioned.

df2 = df1.repartitionByRange(n, “column name”)

df2 = df1.repartitionByRange(“column name”)

Note:

The operations are lazy operations.

Shuffling

Shuffling is a process of redistributing data across partitions to fulfill an operation.

In essence, it’s physical movement of data between partitions across executors or over the wire between executors on different worker nodes.

Shuffling occurs to perform data transfer between stages. It occurs when data from multiple partitions needs to be combined in order to build partitions for a new RDD i.e. a map-reduce operation.

An example is bringing all data with the same key to the same executor, group by for example.

Shuffle Manager:

  • hash-based shuffle manager
  • sort-based shuffle manager

Note: Details of the shuffle manager operation are in the blog the link points to.

Operation

Spark operations consists of transformations and actions.

Transformation

Transformations are operations that create a new dataframe out of existing ones. Transformations will not be completed at the time you execute the line of code, they will only get executed once you have called an action.

This is the basis of lazy evaluation and pipe-lining transformations.

An example of a transformation might be to convert an integer into a float or to filter the data frame by a set of values.

Examples of transformations:

  • select
  • ­sum
  • ­groupBy
  • ­orderBy
  • ­filter
  • ­limit

Action

Actions initiate computations or convert dataframe(s) to native language types. Actions are commands that are computed by Spark right at the time of their execution.

It’s like the just-in-time concept in logistics operations. They consist of running all of the transformations in order to get back the required result.

An action will be executed by the executors in parallel where possible

Examples of actions:

  • show
  • ­count
  • ­collect
  • ­save

Wide vs. Narrow Transformations

Figure 3: Narrow Transform

Transformations consisting of narrow dependencies (narrow transformations) are those where each input partition in the source data frame will contribute to only one output partition in the target data .

Spark will automatically perform an operation called pipe-lining on narrow dependencies. If multiple filters are specified on a source data frame, they’ll all be optimized for example performed in-memory.

Figure 4: Wide Transform

Wide dependencies (or wide transformations) will have input partitions contributing to many output partitions. This will result in a shuffle where Spark will exchange partitions between executors.

Lazy Evaluation

The concept of lazy evaluation means that Spark will wait until required to execute the graph of computation instructions.

It leverages the concept of pipeline where entire steps(transforms) are chained from start to finish i.e. a sequence of transformations on data

The execution plan is built on the pipeline rather than individual transforms.

Caching

Caching is used in applications where the same data sets are reused across operations.

The data sets are stored in temporary storage across executors for quicker access. Storage can be memory or disk based. Caching is a lazy operation thus it occurs when data is accessed.

Caching data comes with a cost for serialization, deserialization and storage cost so they should be used sparingly.

Must manually specify the tables and queries to cache

Syntax:

To cache a DataFrame

­df1.cache()
df1.count()

the count() operation causes the cache operation to occur immediately i.e. eager execution.

High-level Cluster Configuration

Cluster configuration options:

  • Cluster/application sizing and sharing
  • How you want to share resources at the cluster or app level
  • ­Standard cluster vs. ­High concurrency cluster
  • ­Preemption
  • ­Fault Isolation
  • ­Enable Table Access Control
  • Dynamic allocation
  • ­Dynamically adjust resources application occupies based on workload
  • ­Autoscaling
  • ­Standard vs. ­Optimized
  • Network Configuration
  • ­SSH access to cluster

Note:

Refer to cluster creation for more information.

More to come…

--

--

Lackshu Balasubramaniam

Data engineering bloke who’s into books. I work on Azure Data Platforms and Databricks. My reading interest is sci-fi, psychology, economics, history, etc.