Databricks Machine Learning Associate Certification: A Comprehensive Study Guide

Now? Why? How? With Detailed Study Guide

6 min readJul 29, 2023

The global stage is being reshaped by the deep-seated effects of technology, specifically the tidal wave of artificial intelligence (AI) and ML. Databricks has positioned itself as the premier platform for training these advanced models, growing in popularity due to its generative AI and large language model (LLM) capabilities. Databricks’ acquisition of MosaicML (link) has only increased its capacity, enabling customers to train their own LLMs easily and cost-effectively.

Given these developments, it’s clear why our machine learning certification is in high demand. Both organizations hunting for qualified professionals and individuals eager to highlight their credentials are increasingly recognizing its value.

Why should you trust my guidance? Great question! I hold both — ML Associate and Professional Certification from Databricks, and several others including Data Engineering and Analyst. You can verify my credentials here (link).

Why this blog and What’s in for me? Grow my Databricks RSU value! :-D That will grow, I am not concerned about that. All jokes aside, my true motivation lies in aiding others in their learning journey. As a seasoned Databricks veteran, I recall the hurdles I faced in the early days due to inadequate guidance. Hence, this blog serves as a streamlined, step-by-step map to certification readiness. I am hopeful that it would help a few curious souls out there navigating the sea of preparation.

Enough small talk, it’s time to dive straight into the certification process. Trust me, it’s an exciting journey!

Certification Overview:

How to Pass this Cert? You’ll need to answer 45 MCQ questions within 90 minutes with an accuracy of over 70%. These questions are segmented into four Pillars:

Databricks Machine Learning — 29% (13/45)
ML Workflows — 29% (13/45)
Spark ML — 33% (15/45)
Scaling ML Models — 9% (4/45)

Source: databricks.com

Mock Exams:

— Free source (1 Set): LINK

Now let’s dig into each topic and subtopic.

Note: I am not allowed to share the actual exam Qs or paper but can share the topics, and study materials → lead you to the right track.

Pillar 1: Databricks Machine Learning — 29% (13/45)

This section delves into various specifics of Databricks, focusing on the application of Databricks ML and the Databricks Runtime for Machine Learning. The topic areas include:

A. Databricks Machine Learning (clusters, Repos, Jobs)

Clusters — Read from Cluster Config documentation.

Databricks Clusters including when to use one type over another.
Driver Node VS Worker Note
Cluster Access Mode

Repos — Read from THIS and THIS documentation.

Manage branches
Edit Repo Notebooks
Commit Repo changes to GIT
See the changes visually

Jobs — Read from Job Creation documentation.

Try creating a Job once and see the various options there.

B. Databricks Runtime for Machine Learning (basics, libraries)

Basics — Read from Databricks Runtime for Machine Learning

Read on Databricks ML runtimes and non ML runtimes; check the difference

Libraries — Read from Cluster libraries.

Look into the common famous libraries, and packages in DBR ML runtimes
If you want to make some library changes; think of multiple ways to do that.
Look into collaboration with fellow teammate of yours.

C. AutoML (classification, regression, forecasting)

Read from AutoML documentation.

Look into evaluation metrics
Default settings
Best generated models — find and modify it
Generated notebooks
APIs

D. Feature Store (basics)

Read from Feature Store documentation.

Read the basics on when and why to use it
Look into Feature Store Client API
Simply just write 2–4 lines of code to create (& write) and append to a feature store.
And then use that FS in your ml model to train; If you can do these you should be good with this main topic!

E. MLflow (Tracking, Models, Model Registry)

Read from MLflow Models and Model Registry documentation.

Check the components of Databricks Managed ML Flow
Look into MLflow Client API and find the best runs
Learn to log metrics and see if you can auto-log those
Do some coding to Nest a few runs and learn the code
Look into the UI of Model Registry and see the arrangement of models; see the best one based on your tracked model metrics maybe R2
Learn various ways to Transitioning Model’s Stage; Look into the stages that exists.

Pillar 2: ML Workflows — 29% (13/45)

A. Exploratory data analysis (summary statistics, outlier removal)

Summary Statistics — Read THIS doc and THIS.

Learn to get summary out of your dataframes — mean, median, std-dev, etc
Try out describe or summary methods; see what you get differently

Outlier removal — Your general ML concept.

Also, learn to python code — on filtering your data

B. Feature engineering (missing value imputation, one-hot-encoding)

Missing Value Imputation —

Imputation kind based on the column type
Think of the business context of missing value; Can there be any reason or bias?
Learn mean median mode

One-Hot-Encoding — Read from THIS documentation.

Learn when to and when not to
Implication on Tree-based models
Dense vectors vs Sparse Vectors
String Indexer

C. Tuning (hyperparameter basics, hyperparameter parallelization)

Hyperparameter — Read from THIS and THIS documentation.

Learn the difference between Hyperparameter and Parameters
Learn the ways to find the best ones
Read more on Hyperopt
Grid Search, Random Search, … and their impact on performance, compute need

Hyperparameter Parallelization — Read a subset from THIS documentation.

Hyperopt with MLlib
Read article shared above — Parallelize hyperparameter tuning with scikit-learn and MLflow

D. Evaluation and selection (cross-validation, evaluation metrics)

Cross-validation —

Know the difference, when to use, order of the use, and potential impact or the ordering — estimator, pipeline, CV
For CV — Learn to set folds
Learn about Data Leakage
Learn about Computing complexity

Evaluation Metrics — Read from here or choose from any place you like.

Read on various evaluation metrics for Regression, Classification and forecasting, i.e. R2, MAE, RMSE, F1 score, Recall, Precision, AUC, …
For classifications based on business needs which metrics you would use

Pillar 3: Spark ML — 33% (15/45)

A. Distributed ML Concepts

Read a subset from THIS, THIS, and THIS documentation.

Hyperopt with MLlib
Read article shared above — Parallelize hyperparameter tuning with scikit-learn and MLflow
Know about Models that can be distributed and the ones can’t be by default
Pandas, Scikit Learn, MLlib, Spark ML

B. Spark ML Modeling APIs (data splitting, training, evaluation, estimators vs. transformers, pipelines)

Read from here

C. Hyperopt

Read from THIS documentation.

When to use and when not to
Parameters that you can change

D. Pandas API on Spark

Read from THIS documentation.

Know its working
When to use Pandas API on Spark VS Pandas VS Spark
Think of your project scenarios when each one will be applicable

E. Pandas UDFs and Pandas Function APIs

Read from Pandas UDFs and Pandas Function APIs documentation.

Apache Arrow
Grouped map, Map, Cogrouped map
ApplyInPandas()
mapInPandas()

Pillar 4: Scaling ML Models — 9% (4/45)

A. Distributed Linear Regression and Decision Trees

Read from THIS and THIS documentation.

Learn how Spark handles that
Examine the code and execute it once to test it out.

B. Ensembling Methods (bagging, boosting)

Ensemble learning fundamentally involves the combination of several diverse models, the combined predictive strength of which exceeds that of any individual model.

Now know different ways smalls models can be joined together — maybe sequentially or Parallel
Minimize Overfitting
When you encounter a significant number of outliers

And with that, you’ve successfully covered the entire syllabus. Well done! It’s now time to venture ahead, either diving straight into the exam or starting with a few mock exams first. You’ve totally got this!

Do share your feedback here once you do take it. I will be waiting!

And follow me here and on LinkedIn for more content. Adios for now!

Databricks Machine Learning Associate Certification: A Comprehensive Study Guide

Now? Why? How? With Detailed Study Guide

Certification Overview:

Mock Exams:

Now let’s dig into each topic and subtopic.

Pillar 1: Databricks Machine Learning — 29% (13/45)

A. Databricks Machine Learning (clusters, Repos, Jobs)

B. Databricks Runtime for Machine Learning (basics, libraries)

C. AutoML (classification, regression, forecasting)

D. Feature Store (basics)

E. MLflow (Tracking, Models, Model Registry)

Pillar 2: ML Workflows — 29% (13/45)

A. Exploratory data analysis (summary statistics, outlier removal)

B. Feature engineering (missing value imputation, one-hot-encoding)

C. Tuning (hyperparameter basics, hyperparameter parallelization)

D. Evaluation and selection (cross-validation, evaluation metrics)

Pillar 3: Spark ML — 33% (15/45)

A. Distributed ML Concepts

B. Spark ML Modeling APIs (data splitting, training, evaluation, estimators vs. transformers, pipelines)

C. Hyperopt

D. Pandas API on Spark

E. Pandas UDFs and Pandas Function APIs

Pillar 4: Scaling ML Models — 9% (4/45)

A. Distributed Linear Regression and Decision Trees

B. Ensembling Methods (bagging, boosting)

Written by Dipendu Chanda