Databricks Certified Associate Developer for Apache Spark — tips to get prepared for the exam

Thiago Cordon
Data Arena
Published in
8 min readMay 31, 2021
Photo by Nguyen Dang Hoang Nhu on Unsplash

Databricks, founded by the creators of Apache Spark, is being largely adopted by many companies as a unified analytics engine for big data and machine learning. Gartner has classified Databricks as a leader in the last quadrant for Data Science and Machine Learning platforms.

Magic Quadrant for Data Science and Machine Learning Platforms — Gartner (March 2021).

As many companies are using Apache Spark, there is a high demand for professionals with skills in this framework but a scarce set of available candidates to fill these positions. As we can see below, the Big Data market will continue increasing meaning that the need for that kind of professionals will continue high.

Big Data market size revenue forecast worldwide from 2011 to 2027 (in billion U.S. dollars) — https://www.statista.com/statistics/254266/global-big-data-market-forecast/

Being prepared is crucial to fill the market needs and one way to do that is getting a certification. Although there are a lot of resources about Spark over the internet and even if you already work with it, getting a certification is a good way to prove your knowledge, and studying to get it is a good strategy to measure how much you know and what are the topics you have more difficulty and you should study to improve.

In this post, I’ll share some important tips that I’ve followed to get certified. I hope it can be useful for you too.

About the exam

Image from Databricks — https://academy.databricks.com/learning-paths

As demonstrated in the image above, the Apache Spark Associate Developer is applied for Data Engineer and Data Scientist learning paths. This exam will assess you in Spark architecture and in the use of Spark DataFram API to manipulate data.

General information ☑️

  • Exam length: The exam consists of 60 multiple-choice questions and you’ll have 120 minutes to complete — it sounds like a lot of time to do that but, believe me, it’s not.
  • Programming language: you can choose between Python or Scala.
  • Spark version: Although the exam for Spark 2.4 is still available, it’s better to get the test in the most recent version of Spark which is 3.0.
  • Exam language: English.
  • Pass score: The minimum passing score for the exam is 70% — you need to correctly answer a minimum of 42 questions.
  • Cost: 200.00 USD. No free retakes.
  • Resources available in the test environment: You will not be able to run code during the exam. A limited PDF version of the Spark API documentation for the language you are taking the exam will be available — it doesn’t help much because it’s extensive and you don’t have the find feature in this PDF. You’ll also have a notepad available to make notes.

Scheduling the exam 📆

It’s recommended to schedule your exam some time in advance because some times and days of the week may be more sought after. To schedule your exam:

1. Access the Databricks Academy and choose the desired exam.

Databricks certification page — Image from https://academy.databricks.com/category/certifications

2. Read the exam details and click on register.

3. You will be redirected to the Kryterion Webassessor, where you’ll have to register yourself. After registered, you can choose an exam, schedule the date/hour and proceed to checkout. Note that here is where you choose the test language between Python or Scala.

Kryterion Webassessor exam registration.

Security checks 👮

This is an online proctored exam so, there are some security requirements to be noticed.

Before your exam day, you’ll have to download and install an application called Sentinel (which works only in Windows) provided by Kriterion, the company responsible for applying the test. After downloading the application, you’ll need to register yourself with facial recognition. It’s an easy process and you’ll see these instructions on Kryterion Webassessor page after you schedule your exam.

👉 On the test day, the sponsor may require you to complete some security checks before starting your exam:

  • ID confirmation
  • 360 degree video review of your test environment

👉 Other important requirements to be noticed about the test environment:

  • There is only one active computer, one active monitor, one keyboard, and one mouse.
  • You are not wearing a lanyard, badge, hat, watch, or jewelry. (Remove them before the exam starts.)
  • You may not interact with anyone — aside from online support staff — during your exam.
  • You may not use dual monitors.
  • Breaks during an exam are only allowed when pre-approved by your test sponsor. If you interrupt your exam for a break, we will inform your test sponsor.
  • Do not lean out of the camera view during your exam. A proctor must be able to see you at all times.
  • Cell phones are not permitted in the testing area.
  • Reading the exam aloud is prohibited.

What is covered by the exam? 📋

Although the exam covers data manipulation, the SQL language is not assessed. All questions related to data manipulation will be asked to solve using Spark DataFrame API. Spark Streaming is another topic that the exam doesn’t cover.

👉 The exam questions are distributed into three categories:

Exam questions categories.

Spark DataFrame API questions represent most of the exam so, it should be your study focus if you have difficulties with that.

👉 Here is a list of topics assessed in the exam by each category. Use it to assess the topics you have more difficulties with and prioritize them in your study plan.

Spark Architecture — Conceptual

  • Cluster architecture: nodes, drivers, workers, executors, slots, etc.
  • Spark execution hierarchy: applications, jobs, stages, tasks, etc.
  • Shuffling
  • Partitioning
  • Lazy evaluation
  • Transformations vs Actions
  • Narrow vs Wide transformations

Spark Architecture — Applied

  • Execution deployment modes
  • Stability
  • Storage levels
  • Repartitioning
  • Coalescing
  • Broadcasting
  • DataFrames

Spark DataFrame API

  • Subsetting DataFrames (select, filter, etc.)
  • Column manipulation (casting, creating columns, manipulating existing columns, complex column types)
  • String manipulation (Splitting strings, regex)
  • Performance-based operations (repartitioning, shuffle partitions, caching)
  • Combining DataFrames (joins, broadcasting, unions, etc)
  • Reading/writing DataFrames (schemas, overwriting)
  • Working with dates (extraction, formatting, etc)
  • Aggregations
  • Miscellaneous (sorting, missing values, typed UDFs, value extraction, sampling)

Preparation 👨‍🎓👩‍🎓

According to the Databricks academy page, the minimally qualified candidate should:

  • Have a basic understanding of Spark Architecture, including Adaptive Query Execution.
  • Be able to apply the Spark DataFrame API to complete individual data manipulation task, including:

➡ selecting, renaming and manipulating columns

➡ filtering, dropping, sorting and aggregating rows

➡ joining, reading, writing and partitioning DataFrames

➡ working with UDFs and Spark SQL functions

  • Although not explicitly tested in the exam, the candidate must have a working knowledge of either Python or Scala, depending on the language you have chosen in your test.

👉 Study resources

I listed here some resources you can use to get prepared for the exam.

➡ Training — Apache Spark Programming with Databricks → this training is recommended to learn how to work with DataFrame API — remember that more than 70% of the exam questions are related to DataFrame API practice.

➡ Training — Quick Reference: Spark Architecture → this is recommended to learn the concepts of Spark Architecture and the distributed computing. This is one of the trainings inside the pack of Self Paced courses that Databricks sells.

➡ Training — Databricks Certified Developer for Spark 3.0 Practice Exams → this is a well-rated Udemy course with a comprehensive set of questions for the certification exam that you can practice. Most questions come with detailed explanations, giving you a chance to learn from your mistakes. There are also links to the Spark documentation and web contents to help you in your study.

➡ Book — Spark: The Definitive Guide: Big Data Processing Made Simple → this book covers Spark Architecture and DataFrame API usage. The recommended sections to study are:

  • I. Gentle Overview of Big Data and Spark
  • II. Structured APIs — DataFrames, SQL, and Datasets
  • IV. Production Applications

➡ Book — Learning Spark, 2nd Edition → this is another book that covers Spark Architecture and DataFrame API usage. It’s lighter than the book “Spark: The Definitive Guide” and covers the exam topics. The recommended sections to study are:

  • 1. Introduction to Apache Spark: A Unified Analytics Engine
  • 2. Downloading Apache Spark and Getting Started
  • 3. Apache Spark’s Structured APIs
  • 4. Spark SQL and DataFrames: Introduction to Built-in Data Sources (excluding the Spark SQL topics)
  • 5. Spark SQL and DataFrames: Interacting with External Data Sources (excluding the Spark SQL topics)
  • 7. Optimizing and Tuning Spark Applications

Spark documentation — Python API → this is the documentation available in PDF in the exam if you chose Python language. I recommend become familiar with this documentation, especially the sections pyspark.sq.module and pyspark package. Becoming familiar with this documentation will help you to quickly browse the PDF file on exam day if you need it.

Spark documentation — Scala API → this is the documentation available in PDF in the exam if you chose the Scala language. The same recommendation here: become familiar with this documentation to browse it quickly. I didn’t take the Scala exam but I would say that the following packages are important for the test:

  • org.apache.spark.rdd
  • org.apache.spark.sql

For the Databricks training mentioned above, one tip I give you is to check if your company has a partnership with Databricks. If so, depending on the level of partnership, they can offer discount vouchers or even totally free training.

Final considerations

I hope this article can help you to plan and prepare yourself for the exam or, at least, it helps you to study Spark 😄.

If you already have this certification, share your thoughts and how you got prepared.

Know someone who is preparing for this exam? Share this content.

Thanks for reading and best of luck in your exam! 🤞

--

--