A Roadmap to Acing the Databricks Certified Data Engineer Professional Exam

Gaurav Sohaliya
6 min readJul 15, 2023

--

Databricks Data Engineer Professional Badge

Having recently passed the Databricks Certified Data Engineer Professional Exam, I am excited to share my preparation roadmap that led to my success in passing the exam. This certification is a significant milestone for data engineers seeking validation of their skills and expertise in utilizing the Databricks ecosystem.

In this post, I will outline the roadmap to learn fundamental concepts to advanced topics, tips I discovered along the way and provide valuable insights into the resources and techniques. So, let’s dive in and unlock the secrets to passing the Databricks Certified Data Engineer Professional Exam on your first attempt.

All about the exam :

We won’t delve into details about exam registration, course syllabus, cost or exam structure in this discussion. For specific information, you can refer to official Databricks page here. Additionally, Databricks has also released Certification Overview video that offers a more detailed explanation.

This certification is recommended for data engineers with two or more years of experience in Databricks eco-system and following pathway is provided as guideline. It is not mandatory to follow, but it’ll place you in a favorable position if you intend to take the Professional Certification exam as this serves as the last milestone in databricks data engineering learning path.

Databricks Data Engineer learning pathway

It is absolutely possible to ace this certification exam even if you choose to directly pursue the final step of the pathway, as I did. I had less than 2 years of experience as Data engineer and no prior background in Lake house architecture. Therefore, it is perfectly acceptable to prepare for this certification without obtaining base-level certifications or having prior experience

However, it’s important to acknowledge that this exam was challenging. It covers a wide range of topics and requires you to remember numerous details. The questions not only test your conceptual understanding but also assess your knowledge on specific subjects.

How to Start?

The learning path for this certification may vary from person to person. However, I would like to share how I personally prepared for this certification. Instead of starting with the Associate certification, I directly jumped into the Professional certification, therefore I started by understanding the fundamentals of Databricks, specifically Delta Lake and the Lakehouse architecture.

Once I had a solid grasp of the core concepts, I found the following resources immensely valuable for delving deeper into the Databricks ecosystem. If you have already obtained the Associate Data Engineer certification, you can skip this step:

  1. Databricks Certified Data Engineer Associate — Udemy
  2. apache-spark-programming-with-databricks.dbc
  3. Databricks Certified Associate Developer — YouTube

I recommend trying the Data Engineer Associate Free Mock Test, if you havn’t already. It can be a helpful practice exercise. Additionally, Databricks offers a free video course specifically designed for the professional certification. You should definitely go through it. You can find the course here. Once you complete this course, congratulations! You will be halfway through your preparation journey. At this point, you just need to cover the remaining topics listed below and you will be ready to attempt your first mock test.

Topics You Should Not Miss

There are 60 multiple-choice questions on the certification exam. The questions will be distributed by high-level topic in the following way:

  • Databricks Tooling — 20% (12/60)
  • Data Processing — 30% (18/60)
  • Data Modeling — 20% (12/60)
  • Security and Governance — 10% (6/60)
  • Monitoring and Logging — 10% (6/60)
  • Testing and Deployment — 10% (6/60)

Data Processing :

Delta Lake

  1. If you are not aware working mechanism of Delta lake, you should first start with book — Delta Lake: The Definitive Guide
  2. Databricks tech talks by Denny Lee
    - Tech Talk | Delta Lake Part 1: Unpacking the Transaction Log
    - Tech Talk | Delta Lake Part 2: Enforcing and Evolving the Schema
    - Tech Talk | Delta Lake Part 3: How do DELETE, UPDATE, and MERGE
    - Diving into Delta Lake 2.0
  3. All the topics mentioned in databricks docs for delta lake are very important. Frequently asked topics are Vacuum, Data skipping, Z-ordering, CDC, Column Mapping, Table constraints, Clone, Schema evolution, etc.

Structured Streaming

  1. I found this course very helpful to understand Structured Streaming from scratch by Prashant Padey (Learning Journal). You can refer Chapter 3, 5, and 6.
  2. Official Documentation: I would suggest to go through only below topics:
    - Window Operations on Event Time
    - Join Operations
    - Starting Streaming Queries
  3. Spark Performance Tuning : From this topic you can expect question on join optimization and spark sql parameter tuning to get better performance.

Databricks Tooling :

  1. Auto Loader :
    -
    Schema inference and evolution
    - File Deletion Mode
  2. Clusters : If you know the basics of clusters then these are the topic that you can go through – create and manage cluster, cluster access mode, initialization scripts, single node and GPU-enabled cluters, cluster logs, etc.
  3. Notebooks : Create, manage, schedule and run notebooks, widgets, unit testing, run a notebook from another notebook.
  4. Workflows : Create and run jobs, view and manage job runs, repair job are the list of topics to have hands on.
  5. Other topics:
    - Jobs API 2.0
    - dbutils
    - COPY INTO load data
    - MLFlow

Data Modeling :

  1. You can except few questions on Data Lakehouse paradigm and medallion architecture about different stages of paradigm. What businees values can be derived and which spark operation are applicable on each layer of architecture.
  2. Slowly Changing Dimensions : Questions can be asked on usecase application based on SCD types.
  3. SQL : Table constraints, Table properties, partition and partitions keys.
  4. Optimize and Z-Ordering, data skipping
  5. Delta Merge

Security and Governance :

  1. Security & Compliance:
    - Notebook Permissions
    - Cluster Permissions
    - Job Permissions
    - Secret Management & Access Control
  2. Unity Cateloge : Create catelogs, schemas, tables, view (Dynamic Views) and manage it.

Monitoring and Logging :

  1. Spark UI : Questions are revolves around Cluster Logs, Spark application logs, performance metrices, memory skew/spill, etc.
  2. Debugging with the Apache Spark UI
  3. Web UI
  4. SparkListeners
  5. Memory Skew/Spill , Techniques
  6. Spark UI Simulator

Testing and Deployment :

  1. In these sections, few questions can be asked on Libraries, Repos, DBFS, etc,.
  2. Import Python Library, Notebook scoped Library
  3. Test Notebook
  4. Repair Job Failure
  5. Supported Git and CICD Operations

Mock Tests

There are two available mock tests. You can choose between the following options:

  1. Databricks Certified Data Engineer Professional — Mock Exams
  2. Practice Exams: Databricks Data Engineer Professional.

Based on my personal experience, I highly recommend purchasing the second option Practice Test. It offers an excellent question bank and was instrumental in my success during the exam.

The Practice Exams package includes two tests. I suggest starting with the first test and aiming for a score above 80%. After completing the mock test, carefully review all the explanations and accompanying links provided with each answer. To aid in your last-minute revision, I have attached a helpful cheat sheet. Additionally, I have included some extra links that can further enhance your understanding of Databricks concepts.

Final Tips

In this article, I have included helpful resources such as relevant documents, blog posts, and additional links. Make sure to explore all these resources thoroughly. They proved invaluable in my exam preparation journey.

I found that the questions are quite lengthy, requiring careful attention to grasp their context. To excel in the exam, take your time and truly comprehend what each question is asking. The exam creators tries to trick you with seemingly correct answers that are, in fact, incorrect within the given context.

I hope that by following this roadmap and utilizing the provided resources, you will achieve success in the Databricks Certified Data Engineer Professional Exam. Best of luck on your journey!

--

--

Gaurav Sohaliya

Sr. Data Engineer @ Target Corporation | Ex @BankofAmerica @Jio | Love to Code and Build Algos