How to prepare for Databricks Certified Data Engineer Associate Examination?

Moniruddin Ahammed
5 min readJul 3, 2022

--

Databricks Certified Data Engineer Associate

What is Databricks Data Engineer Associate Certification?

Databricks Certified Data Engineer associate examination assesses individual’s understanding of Databricks Lakehouse platform, different components of Lakehouse and what are the real-life use cases you can solve easily using Databricks Lakehouse platform. A person should be comfortable to write code in SQL and Python/Scala. Lakehouse platform works on top of Spark, so during the preparation you will learn more about Spark. Knowledge of Spark is essential. A person should be able to build basic ETL pipelines using Spark Structure streaming and batch job, create dashboards and implement best security practices.

Format of the exam :

Number of questions : 45

Type of questions : Multiple choice questions and only one option is correct.

Duration : 90 Minutes.

Passing score : 70%

Exam Fee : USD 200$. You can retake exam as many times as you want but for each attempt you have to pay USD 200$.

Result : You’ll see the result (PASS/FAIL) immediately after the exam on your computer screen. If you pass the exam, you will receive your digital badge within 24 hours after the exam.

Expiration : 2 years

Where to register for the certification : https://www.webassessor.com/databricks

Learning Pathway :

If you do not have prior knowledge of Databricks Lakehouse platform, you are highly encouraged to complete first “Databricks Lakehouse Fundamental Accreditation”. After completion of this fundamental course, you will have a broader view about the capabilities of Lakehouse platform and what are the common data engineering pain points it solve.

The important topics covered in the exams are :

  1. Databricks Workspace and Services :

- Create and manage clusters. Different type of clusters for specific workload.

https://docs.databricks.com/clusters/basics.html

- Understanding basic Python Notebook and how it works.

- Basic Dbutils , fs & sh commands .

2. Delta lake :

- Structure of Delta Lake Transaction Log and what information it contains.

- Managing Delta Lake tables. managed table vs external table. Table Properties.

- Optimization — ZORDER, Bloom Filter Index, Vacuuming.

- Versioning — Concept of versioning and how to retrieve historical data.

- CTAS , MERGE SQL command, Views (Temp vs Global).

- Knowledge of Paquet file structure.

3. Incremental Data Processing :

- Autoloader- How autoloader works, storing state information( eg RocksDB )

https://docs.databricks.com/ingestion/auto-loader/index.html

cloudFiles — Loading files from cloud storage. Pros and cons of directory listing and File Notification.

- https://docs.databricks.com/ingestion/auto-loader/file-detection-modes.html

- Create SQL UDF.

4. Spark structured Streaming :

- Checkpointing.

- Different type of triggers and interval.

- Monitoring stream queries.

- Purpose of watermark and window in streaming queries.

https://docs.databricks.com/spark/latest/structured-streaming/production.html

- Multi-Hop/Medallion Architecture. Business reasons of using Bronze , Silver and Gold tables.

5. Delta live tables :

- Benefits and features.

- create data pipelines using SQL.

- Migrating SQL Pipeline to Delta Live Table.

6. Data Governance :

- Managing Permissions for databases, tables & views. (ACL).

- Configuring Privileges for Production Data and Derived tables for individual user and team(group).

- Unity Catalog.

- How secrets works.

7. Databricks repos :

- Accessing common repos like GitHub, Bitbucket, AWS CodeCommit, GitLab etc.

- Branching and merging of code.

- CI/CD Process.

8. Databricks SQL & Dashboards :

- SQL endpoint.

- Scheduling SQL query and configuring ALERT.

- Creation of SQL dashboard.

9. Multitask jobs :

- Jobs scheduling.

- Tasks orchestration.

- JOB clusters — Benefits and how to solve the common issues.

Useful courses, labs and documents for preparation :

- “Data Engineering with Databricks” available on Databricks Academy (https://databricks.com/learn/training/home). Good preparatory course for the exam.

- Practice The Data Engineering Notebook (https://github.com/databricks-academy/data-engineering-with-databricks) . You can create a trial Databricks account (its free for 14 days) and use any public cloud (AWS, GCP, Azure) to create cluster for learning. It is must for beginners.

(Please note : you have to bear the associated infrastructure cost of public cloud during this period. For AWS, the cost mainly includes EC2 machines for cluster, NAT gateway, S3 storage cost. For learning purpose, I would suggest to use a single node cluster with m5a.large instance (smallest available). When not in use delete/terminate cluster. (understand the difference between terminate and delete of a cluster))

- Read Databricks documentation and blogs.

- Sample Practice Test: https://files.training.databricks.com/assessments/practice-exams/PracticeExam-DataEngineerAssociate.pdf? For incorrectly answered questions, revisit those concepts again and practice.

Additional resources :

  1. Data Engineer Associate Slides .
  2. Delta Live Tables video.
  3. Data Engineering demo video.

For latest information and official guidelines about Databricks Certified Data Engineer associate examination, always refer below link -

FAQ- https://files.training.databricks.com/lms/docebo/databricks-academy-faq.pdf?

Useful Tips :

1) Use Databricks Free Trial account(Premium subscription with 14 days full trial) for learning.

2) Databricks is available on AWS, GCP & AZURE Public cloud (CSP) . Use your preferred cloud platform. If you already comfortable with CSP — X, chose that platform only. Do not spend much time in learning new cloud service provider’s platform.

3) During Trial period, you don’t have to pay subscription fees to Databricks but you have to pay associated infrastructures cost for using corresponding Cloud Platform. Cost of AWS/AZURE/GCP — your chosen cloud service provider.

4) To minimize the infrastructure cost during learning, when you are not using it, you may choose to delete your cluster. It will save compute cost. Also, you may choose to delete your workspaces as well. It will save NAT Gateway changes. Make sure there is no Elastic IP(public IP) freely available in your account (not associated with any instance/EC2).

5) If you delete everything (to save cost when not using the platform) , it usually takes around 10/12 minutes to create workspaces and cluster again.

6) For learning purpose and to be frugal, use single node cluster with smallest EC2 machine like m5a.large. AMD machines are cost effective.

7) While creating workspace you may choose to use QuickStart. It will create all the required infrastructures and services including separate VPC. It saves a lot of time.

8) If you are very comfortable with networking concept like VPC, Route table etc. You may choose to deploy on your own VPC. But for beginners I would suggest to use QuickStart for deployment.

9) For SQL endpoint also use smallest cluster during learning.

Happy Learning. Best of luck!

--

--