How to Ace the Azure DP-100-Azure Data Scientist Associate Exam

Ahmed Hassen
10 min readJun 26, 2022

--

Everything you need to know to get started with Microsoft Azure

Introduction

Data Science and MLOps are the two most in-demand roles in the tech industry, with an annual average growth rate of 195% for both roles through 2023. If you are looking to get started with a career in data science or artificial intelligence, achieving certification in this area can help you stand out from other applicants.

This blog is intended to provide you with everything you need to know about taking the Microsoft certification exam DP-100 matching the recent updates for 2022. I will be providing the general tips and hints for the Exam in addition to my notes as a bonus for the readers. If you’re interested in starting your certification journey, continue reading!

My Journey with the Certification

Instead of giving you a whole bunch of details on the exam content and its specifications (The usual information on other articles), I would rather walk you through the path I followed to nail the certification

Free Azure Account

One of the main keys to getting certified, is to get your hands dirty on the Azure Portal console. Start by creating a free account and get 200$ as free credits for you to be familiar with all the resources.

Note: you can create different accounts as long as you change your email, number and credit card. Otherwise, you should activate the option “pay-as-you-go” to continue using the azure portal

Coursera Preparation Course

This is one of the “win-win situations” that you may encounter. You will get a course to help you prepare for the certification, on one hand, and grant you a Voucher with 50% on the exam on the other hand. is like having the best from both worlds! Here is the link for the course

Certification Dumps

Practicing some exam questions before taking the certification exam is a mandatory step to get yourself ready! Among the various resources out there, I will go first with

  • Whizzlab, which I consider one of the best training platforms. If you want a user-friendly interface with a thorough explanation of all answers, this is your perfect spot! This dump comes with 4 timed practice tests and 140 questions taken directly from previous exams, which will help get familiar with the question types and help you manage your timing during the exam. take a step forward here
  • The next one would be the Udemy platform which provides some useful resources to add to your Wishlist. But didn’t we get enough of the courses yet?! As a matter of fact, we did but since the Coursera preparation path lacks two of the main important features which are the Azure SDK and Azure CLI, and that’s why I would recommend an additional course for both. Mastering the use of the Azure portal interface and its tools is a great skill. That said, having the ability build solutions through Python SDK, Azure CLI commands is a whole different story, and you need to have that kind of expertise to get certified. This is the course I recommend
  • Last but not least is the famous Exam-Topic platform. This dump has been avoided by fellow Azure candidates due to the misleading answers and wrong explanations it contains. But using it Wisely can turn this pain point into an amazing advantage, but how?! Try to get into the discussion between members in each question, share your thoughts, receive feedback and filter the upvoted responses to get your final correct answer, it could be a good practice to build up your skills. That said, be aware of deprecated questions as some tests may not be up to date

Exam content: All you need to know

The DP-100 certification exam tests your ability to create a data science solution that meets a specific business need. You can take the exam whether you’re a data scientist or analyst, but your solution will be judged on its fit for a given scenario. In this section, I will not go into the 5 questions types that you will face during the exam (which btw can be found in this awesome story by Seth Billiau). Alternatively, I prefer to have a deep dive into the main concepts and their related questions.

Manage Azure resources for ML

This part tests your expertise in choosing the best Azure resources that best fit your data science solution. We will focus on three important parts:

Azure resources-credits
  1. Managing compute for experiments

You should be able to determine the appropriate compute specifications for your workload. Here are some hints to help decide:

  • Azure compute instance: a single VM doesn’t go idle (scale to 0) or scale up, you can’t share it and needs to be deleted for charges. Can be used with AutoML but is not available in the designer
  • Compute clusters: a Multi-node compute type that can scale up and down, you can share it for training and create t within your workspace region. Suitable for AutoML and training experiments
  • Azure Kubernetes Services: an AKS cluster is used for high scale production deployments such as real-time inference. Available in the designer and can be used in dev/test mode for larger models
  • Azure Container Instance: ACI is suitable for small models (<1GB in size and <48GB in RAM). Used for real-time inference in dev/test mode, scale in a serverless way and available in the designer
  • Azure Databricks: This Instance can be used as a training resource for local runs and machine learning pipelines, but not as a remote target for other training.
  • Azure HDInsight: capable of running pipelines but not suitable for AutoML and not supported in the designer. can be used as an attached compute for training and not for inference

2. Implement Security and Access control

creating custom roles and being familiar with built-in roles is a must-know skill. Follow this guidance:

  • You can assign roles in four different ways, Azure portal, PowerShell, CLI commands and Rest API
  • RBAC: a tool to manage and control the scope granted to identity over a specific resource
  • Managed identity: an authentication method to access an Azure resource. System-assigned identity links you directly to a single resource whereas user assigned identity gives you access to multiple resources
  • Service Principal: an automated process for accessing ML Workspace with no credentials needed. Useful to connect to other resources such as the Datalake Storage or the Azure SQL
  • Owner built-in-role: Grants full access to all resources, including assigning roles in Azure RBAC
  • Contributor role: same as owner with no role assignments in RBAC, no assignments in Azure Blueprints and can’t share image galleries
  • Reader role: View all resources, but not allowed to make any changes
  • User administrator: manage user access resources and assign roles

3. Set up Azure Databricks workspace

  • Azure Databricks supports Python, Scala, R, Java, and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch, and scikit-learn
  • You can link your Databricks cluster to Azure ML workspace
  • logs and models artifacts are stored in both Azure ML and Azure Databricks Workspace once linked

Run experiments and train models

For this part we will discover what is AutoML, create optimal models, tune its hyperparameters and how to troubleshoot your failing experiments like an expert

AutoML in Azure-credits
  1. Use Automated Machine Learning
  • Feature engineering: when enabled, all the pre-processing transformations are done automatically via the ML Studio
  • Control your AutoML experiments costs by setting exit criterions such as training job time and metric score Threshhold
  • To use a custom featurization, we need to create an AutoMLConfig object and then customize the transformations with commands like add_transformers_params or drop_columns etc…
  • only the training data is mandatory, validation dataset can be instructed through the validation_size. Otherwise, Azure ML apply cross-validation using the training data

2. Tune Hyperparameter with Azure ML

Choosing the sampling method along with the early termination policy is an important feature in tuning hyperparameters:

  • Grid Sampling: applied only with discrete values and used to try every possible combination
  • Random Sampling: Randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values
  • Bayesian Sampling: Supports discrete and continuous values, learns from previous runs, does not support early termination techniques and can’t use normal distribution but only Choice, Uniform and qUniform
  • Median stopping policy: terminates the execution when the performance metric proves to be worse than the median of the running averages.
  • Bandit policy: iterations are done when the primary performance metric underperforms the best of the runs. Not to use for Bayesian sampling

3. Troubleshooting experiments

  • Use logs: Use the get_details_with_logs() method of the run object to display the experiment run logs, otherwise use the AzureML Studio
  • Debug locally: using a local web service makes it easier to troubleshoot and debug problems
  • HTTP status code 502: Exception crashed in the run
  • HTTP status code 503: in this case, you should decrease the utilization at which autoscaling creates new replicas (default value 70%)
  • HTTP status code 504: Increase the Timeout and remove unnecessary call from the score.py script

Deploy and operationalize ML solutions

In this part, we will focus on the pipeline aspect, what knowledge we need for the experiments and how to use Mlfow for tracking

ML Operationalization-credits
  1. Deploy model as real time inference

After creating an experiment, it’s important to know how to publish it as real time inference. Follow this guidance to do so:

  • Register the model: can be done from a local path or referenced to a run. Previous models will not be overwritten but another version will be added
  • Inference config: first we define the entry script (init() and run() functions), next we set the environment (specification file, pre-built Envs or by specifying packages) and finally combine both in InferenceConfig
  • Deployment config: start by choosing the compute, define the deployment configuration (whether it’s AksWebservice or AciWebservice) and then deploy you model to an endpoint through the Model class
  • ACI services: Authentication is disabled by default for ACI deployed service, but can be set manually to key_based authentication
  • AKS services: key-based authentication is set by default. Optionally you can choose an AKS to use token-based authentication

2. Pipelines and Experiments in Azure ML

Here some general aspects how you should configure your pipelines and experiments in Azure:

  • When running pipeline with multiple steps, you should include the PipelineData parameter to reference the output folder
  • Publishing a Pipeline: first call the publish method on a specific run, next assign an endpoint to it, after that initiate a published endpoint by making a HTTP request to its REST endpoint
  • In order to parametrize you pipeline, you should create PipelineParameter object for the pipeline before publishing it!
  • Use the ParallelRunStep class to perform parallel batch inferencing. All the results can be stored in the parallel_run_step.txt file
  • Schedule a pipeline: We publish the pipeline to get the ID, then create the Recurrence, finally we define the pipeline with the Schedule method
  • Trigger a pipeline whenever data changes: we must create a Schedule that monitors a specified path on a datastore

3. What MLflow is good for?!

One of the things that will separate you from candidates, is your knowledge on Mlflow module, here are some hints for the tool:

  • Logging: log_param, log_metric, log_image are logging methods to store important values inside MLflow
  • Logging a model should be followed by its package, for example if you trained a model with spark mlflow.spark.log_model(model,”model”)
  • to set the MLflow logging target, you need to set_tracking_uri after getting it with the get_mlflow_tracking_uri method

Implement responsible ML models

Here comes the last part. We will discuss some interesting aspects such as fairness, privacy and model explainers

Responsible ML-credits
  1. Model explainers and feature importance

Models explainers can be local or global depending on your needs. If you want to understand the importance of your features with regards to all the predictions goes with global, otherwise choose local for a specific prediction

  • Mimic Explainer: Works as both global and local explainers. Note that we should choose the same architecture as the trained model.
  • Tabular Explainer: Same as Mimic, chooses the architecture of explainer by itself. The explainer acts as a wrapper around various algorithms
  • PFI Explainer: Doesn’t support local feature importance explanation. It requires the actual labels which are the test features

2. Detect and Mitigate unfairness

We have three techniques to mention to mitigate unfairness

  • Exponentiated gradient: Applies cost minimization approach to learn the tradeoff between predictions and fairness disparity
  • Gridsearch technique: same as Exponentiated gradient but works efficiently with small number of constraints
  • Threshhold optimizer: a post processing technique that applies a constraint to the classifier, transforming the prediction as appropriate

3. Data Drift in dataset

Define a dataset monitor to detect Data-drift and trigger alerts if the rate of drift exceeds a specific threshold

  • DataDriftDetector: create this monitor from a dataset specifying the baseline_data and the target_data
  • Backfill: compare the baseline dataset to the target dataset, which backfills the monitor based on time laps
  • AlertConfiguration: set an operator email address for notifications if the threshhold defined is exceeded

Conclusion

Congratulation on making it so far! It is not an easy journey as you have noticed, but it does worth the effort, believe me! I can’t express enough the excitement and gratitude I had while sharing my expertise on the subject. If you want to dig even more on the exam content, I have prepared a document (DP-100) in my GitHub which goes deeper on each section. You can also get a list of study materials on Azure DP-100 in this amazing article by Shivam

The DP-100 exam is a 120-minute, with exam questions between 40–60, delivered either through a computer-based testing system (Pearson VUE) or on-site. Make sure to use your personal laptop, as you may need to disable firewall and proxy settings during the exam. Choose a quiet place and don’t let anyone enter your room. At the end of the exam, you will be given a score and a list of questions you got wrong. You will not be able to go back and change answers after you submit the exam. You must wait until the end of the exam to see your score.

--

--

Ahmed Hassen

MLOps Engineer with a solid background in Python development/Big data features. Certified Azure (DP100) and Google (PDE on GCP)