Databricks Machine Learning Associate Certification: A Comprehensive Study Guide

Now? Why? How? With Detailed Study Guide

Dipendu Chanda
6 min readJul 29, 2023

The global stage is being reshaped by the deep-seated effects of technology, specifically the tidal wave of artificial intelligence (AI) and ML. Databricks has positioned itself as the premier platform for training these advanced models, growing in popularity due to its generative AI and large language model (LLM) capabilities. Databricks’ acquisition of MosaicML (link) has only increased its capacity, enabling customers to train their own LLMs easily and cost-effectively.

Given these developments, it’s clear why our machine learning certification is in high demand. Both organizations hunting for qualified professionals and individuals eager to highlight their credentials are increasingly recognizing its value.

Why should you trust my guidance? Great question! I hold both — ML Associate and Professional Certification from Databricks, and several others including Data Engineering and Analyst. You can verify my credentials here (link).

Why this blog and What’s in for me? Grow my Databricks RSU value! :-D That will grow, I am not concerned about that. All jokes aside, my true motivation lies in aiding others in their learning journey. As a seasoned Databricks veteran, I recall the hurdles I faced in the early days due to inadequate guidance. Hence, this blog serves as a streamlined, step-by-step map to certification readiness. I am hopeful that it would help a few curious souls out there navigating the sea of preparation.

Enough small talk, it’s time to dive straight into the certification process. Trust me, it’s an exciting journey!

Certification Overview:

How to Pass this Cert? You’ll need to answer 45 MCQ questions within 90 minutes with an accuracy of over 70%. These questions are segmented into four Pillars:

  1. Databricks Machine Learning — 29% (13/45)
  2. ML Workflows — 29% (13/45)
  3. Spark ML — 33% (15/45)
  4. Scaling ML Models — 9% (4/45)

Source: databricks.com

Mock Exams:

Free source (1 Set): LINK

Now let’s dig into each topic and subtopic.

Note: I am not allowed to share the actual exam Qs or paper but can share the topics, and study materials → lead you to the right track.

Pillar 1: Databricks Machine Learning — 29% (13/45)

This section delves into various specifics of Databricks, focusing on the application of Databricks ML and the Databricks Runtime for Machine Learning. The topic areas include:

A. Databricks Machine Learning (clusters, Repos, Jobs)

Clusters — Read from Cluster Config documentation.

  • Databricks Clusters including when to use one type over another.
  • Driver Node VS Worker Note
  • Cluster Access Mode

Repos — Read from THIS and THIS documentation.

  • Manage branches
  • Edit Repo Notebooks
  • Commit Repo changes to GIT
  • See the changes visually

Jobs — Read from Job Creation documentation.

  • Try creating a Job once and see the various options there.

B. Databricks Runtime for Machine Learning (basics, libraries)

Basics — Read from Databricks Runtime for Machine Learning

  • Read on Databricks ML runtimes and non ML runtimes; check the difference

Libraries — Read from Cluster libraries.

  • Look into the common famous libraries, and packages in DBR ML runtimes
  • If you want to make some library changes; think of multiple ways to do that.
  • Look into collaboration with fellow teammate of yours.

C. AutoML (classification, regression, forecasting)

Read from AutoML documentation.

  • Look into evaluation metrics
  • Default settings
  • Best generated models — find and modify it
  • Generated notebooks
  • APIs

D. Feature Store (basics)

Read from Feature Store documentation.

  • Read the basics on when and why to use it
  • Look into Feature Store Client API
  • Simply just write 2–4 lines of code to create (& write) and append to a feature store.
  • And then use that FS in your ml model to train; If you can do these you should be good with this main topic!

E. MLflow (Tracking, Models, Model Registry)

Read from MLflow Models and Model Registry documentation.

  • Check the components of Databricks Managed ML Flow
  • Look into MLflow Client API and find the best runs
  • Learn to log metrics and see if you can auto-log those
  • Do some coding to Nest a few runs and learn the code
  • Look into the UI of Model Registry and see the arrangement of models; see the best one based on your tracked model metrics maybe R2
  • Learn various ways to Transitioning Model’s Stage; Look into the stages that exists.

Pillar 2: ML Workflows — 29% (13/45)

A. Exploratory data analysis (summary statistics, outlier removal)

Summary Statistics — Read THIS doc and THIS.

  • Learn to get summary out of your dataframes — mean, median, std-dev, etc
  • Try out describe or summary methods; see what you get differently

Outlier removal — Your general ML concept.

  • Also, learn to python code — on filtering your data

B. Feature engineering (missing value imputation, one-hot-encoding)

Missing Value Imputation

  • Imputation kind based on the column type
  • Think of the business context of missing value; Can there be any reason or bias?
  • Learn mean median mode

One-Hot-Encoding — Read from THIS documentation.

  • Learn when to and when not to
  • Implication on Tree-based models
  • Dense vectors vs Sparse Vectors
  • String Indexer

C. Tuning (hyperparameter basics, hyperparameter parallelization)

Hyperparameter — Read from THIS and THIS documentation.

  • Learn the difference between Hyperparameter and Parameters
  • Learn the ways to find the best ones
  • Read more on Hyperopt
  • Grid Search, Random Search, … and their impact on performance, compute need

Hyperparameter Parallelization — Read a subset from THIS documentation.

  • Hyperopt with MLlib
  • Read article shared above — Parallelize hyperparameter tuning with scikit-learn and MLflow

D. Evaluation and selection (cross-validation, evaluation metrics)

Cross-validation

  • Know the difference, when to use, order of the use, and potential impact or the ordering — estimator, pipeline, CV
  • For CV — Learn to set folds
  • Learn about Data Leakage
  • Learn about Computing complexity

Evaluation Metrics — Read from here or choose from any place you like.

  • Read on various evaluation metrics for Regression, Classification and forecasting, i.e. R2, MAE, RMSE, F1 score, Recall, Precision, AUC, …
  • For classifications based on business needs which metrics you would use

Pillar 3: Spark ML — 33% (15/45)

A. Distributed ML Concepts

Read a subset from THIS, THIS, and THIS documentation.

  • Hyperopt with MLlib
  • Read article shared above — Parallelize hyperparameter tuning with scikit-learn and MLflow
  • Know about Models that can be distributed and the ones can’t be by default
  • Pandas, Scikit Learn, MLlib, Spark ML

B. Spark ML Modeling APIs (data splitting, training, evaluation, estimators vs. transformers, pipelines)

Read from here

C. Hyperopt

Read from THIS documentation.

  • When to use and when not to
  • Parameters that you can change

D. Pandas API on Spark

Read from THIS documentation.

  • Know its working
  • When to use Pandas API on Spark VS Pandas VS Spark
  • Think of your project scenarios when each one will be applicable

E. Pandas UDFs and Pandas Function APIs

Read from Pandas UDFs and Pandas Function APIs documentation.

Pillar 4: Scaling ML Models — 9% (4/45)

A. Distributed Linear Regression and Decision Trees

Read from THIS and THIS documentation.

  • Learn how Spark handles that
  • Examine the code and execute it once to test it out.

B. Ensembling Methods (bagging, boosting)

Ensemble learning fundamentally involves the combination of several diverse models, the combined predictive strength of which exceeds that of any individual model.

  • Now know different ways smalls models can be joined together — maybe sequentially or Parallel
  • Minimize Overfitting
  • When you encounter a significant number of outliers

And with that, you’ve successfully covered the entire syllabus. Well done! It’s now time to venture ahead, either diving straight into the exam or starting with a few mock exams first. You’ve totally got this!

Do share your feedback here once you do take it. I will be waiting!

And follow me here and on LinkedIn for more content. Adios for now!

--

--

Dipendu Chanda

Senior Architect at Databricks. Skilled in software to AI/ML, from FAANG to startups. Keen on holistic learning and sharing- https://www.linkedin.com/in/dchanda