Data Science with Microsoft Azure

Using Azure Machine Learning Services

Published in

grandcentrix

17 min readAug 12, 2019

Azure is Microsoft’s well-known cloud platform, competing with the Google Cloud and Amazon Web Services. Microsoft Azure gives us the freedom to build, manage, and deploy applications on a massive global network using our favourite tools and frameworks. Azure provides over 100 services that enable us to do everything from running our existing applications on virtual machines to exploring new software paradigms such as intelligent bots and mixed reality. It also provides storage solutions that dynamically grow to accommodate massive amounts of data. Azure services enable solutions that are simply not feasible without the power of the cloud. Here we are going to mainly focus on the Azure Machine Learning services.

Azure Machine Learning services is a relatively new addition to Microsoft Azure, released publicly in December 2018. Azure Machine Learning services contains many advanced capabilities designed to simplify and accelerate the process of building, training, and deploying machine learning models. Support for popular open-source frameworks such as PyTorch, TensorFlow, and scikit-learn allows data scientists to use the tools of their choice.

Azure ML services provide multiple services and we will look at a few of those here. We will talk about Notebook VMs, which provide a Jupyter notebook interface for experimenting, Visual Interface, which provides a simple drag and drop interface suitable for rapid prototyping and at the end we will look at its Visual Studio Code extension and Python SDK to create experiments and pipelines. We will explain, step by step, how the ML services are used in an actual project.

Workspace

We start by creating a workspace, which is a place in Azure for us to store our work. We can create a workspace in the Azure portal or via Python code. It’s easy to create a workspace through the Azure portal, just open All services, select AI+Machine Learning and then Machine Learning service workspaces. To create it, we need to provide a name for our workspace, choose the right subscription account, choose a resource group or create a new one and choose a location for our workspace. Choosing a location is an important factor, as everything created in this workspace will be in the same location and this directly affects the cost of our experiments. Mostly because some services might be more expensive or cheaper in different locations, such as virtual machines. Here I created a workspace with the name ws-test and created a new resource group called rg-test. For the location, I chose East US, as it can be cheaper in some instances. For example, a virtual machine with the size of Standard_DS3_v2 is almost 18% cheaper in East-US than West-US.

After creating our workspace, we see the overview page where we can select different services and start working.

Azure Notebook VMs

Let’s begin with Azure Notebook VMs, which provide a Jupyter Notebook running on a virtual machine. To create a notebook we need to provide a name and select a size for its virtual machine.

It will take some time for the virtual machine to start, but when it’s done it will provide a URL to open the notebook. Notebook VMs provide a familiar Jupyter Notebook interface and it is a great tool for experimenting with the data.

Jupyter notebook is a very broadly used tool in the field of data science, e. g. for exploratory data analysis. Hence, as it is a familiar interface, it is very easy to jump in right away and start coding. It allows us to easily create and share documents that contain live code, equations, visualizations and explanatory text. We can also install any required packages by writing a pip command.

Keep in mind to always stop Notebook VMs when you’re not using them, as they will cost you money just by keeping them running. You can easily stop or start any notebook through the Azure portal.

Overall, Notebook VMs is a great tool for developers and data scientists to develop solutions, analyze data, experiment and share codes and findings. And, it also supports multiple programming languages such as Python, R, Scala and even Julia.

Azure Visual Interface

If you have worked with the Azure Machine Learning Studio before, you are already familiar with the Azure Visual Interface. It is basically the same service, but at the moment it is missing some features that ML Studio provided, maybe this is related to the fact that it is still in preview.

The Azure Visual Interface is designed for simplicity and productivity. Visual Interface is all about drag and drop, there are multiple modules available, from reading and writing data to machine learning models, that we can just drop into our experiment and connect them.

To run an experiment, we need to define a compute unit. A compute unit is a cloud or local resource that is used to run our experiment, we will talk about this in more details in the next sections.

Currently, the Visual Interface is very limited. For example, it provides a module to run Python scripts, but this Python script can only have two inputs and can only return Data Frames, this makes it not suitable for the actual development, but rather for prototyping, experimenting or presenting an idea. As Microsoft itself puts it, the visual interface is tailored for:

Data scientists, who are more familiar with visual tools than coding.
Users, who are new to machine learning and want to learn it intuitively.
Machine learning experts, who are interested in rapid prototyping.

Python SDK and Extension

Microsoft also provides a Python SDK to work with the Azure ML services via an IDE of your choice. There is also an extension for the Visual Studio code that simplifies working with ML services. Here we will talk about this extension, but later we will also see some usage of the actual Python SDK.

To install the Azure Machine Learning Extension in Visual Studio code, navigate to the extension tab and search for its name.

When installed, we will see a new tab called “Azure”. Here, we can see all of our workspaces in the Azure ML Services alongside all of our experiments, compute targets and others.

To begin our work, we need to create an Experiment first. An experiment object is created within the workspace to store information about runs for the models that we train and test or any experimentations that we do, basically this concept is quite similar to a project. We can have multiple experiment objects in a workspace. To create an experiment via the extension, we first have to select a folder for our project in VS Code, just create a folder and select it as your project folder, then we can right-click on the Experiment branch in Azure tab and click Create Experiment. Now, if we navigate to the Experiments section of the Azure portal, we should also see the newly created experiment there.

A newly created experiment in the Azure portal

Now let’s add some simple code and run the experiment. Create a Python code in your workspace and just put a print command inside.

Just one line for now!

To run this we need to provide a compute target. To create a compute target we can either use the extension or do it via the Azure portal. It is easier to do it via the portal, as we can easily compare different compute sizes, but if you know exactly which size you’d like to use then you can easily create one via the extension.

To create a new compute target, navigate to the compute section of the Azure ML services. Then we have to define multiple parameters for it.

A compute target would use the same region as our workspace, so if we defined our workspace in “East US” for example, our compute unit would also use “East US”. Selecting the right size is important here, as it would directly affect the costs. The number of nodes means the number of jobs that it can run in parallel, selecting 0 as minimum node assures that when there is no job to run, the compute will scale down to 0 and would cost nothing. Another important parameter here is the Idle seconds, this controls for how long our compute should stay in idle for it to scale down, reducing this number, again, would prevent extra costs.

When we are done creating a compute, we can go back to the VS code and run our experiment. To do so, we have to right-click in our code and select the “Azure ML: Run as experiment in Azure” option, then it will ask us to select a compute target, we should be able to see our newly created compute in the list.

Then we have to select a configuration for our run. A configuration contains all the packages that are required to run our Python codes.

We can easily add any required packages into the list and Azure will install any missing packages before running our code.

Configuration file: adding the required packages

Obviously, for this simple example, we don’t need any specific packages and we can just run it with the “Generic training” configuration file.

Now if we go to the Azure portal and navigate to the Experiments tab and click on our experiment, we can see our initiated run.

At first, it would be in “preparing” mode as it would prepare the compute for its first run. Compute would also resize from 0 to 1 node to run our job. If we navigate to the compute tab, we should see that it is resizing.

When resizing is done, it would start our submitted run. After the run finished, we can navigate to its details and Logs tab to see our printed output.

Azure provides more logging options for an experiment than just printing to the standard output, so let’s look at some logging possibilities.

Logging in an Experiment

Azure provides multiple logging options for an experiment. For example, we can log any values, an array as a chart or directly logging plots by using Matplotlib. To do so we should make use of its Python SDK, make sure to have azureml-core packages installed. Now let’s add some logging to our simple Python code.

Do some logging

Each log is associated with a run, so to send a log we have to first get a reference to the current run, we can get this just by calling “Run.get_context()” function. Now we can use this reference to log different values. To log metrics which are just a value, we can use the simple “log” function that accepts a name and a value. To log plots generated by matplotlib, we can use “log_image” function and send the created plot directly. As mentioned we can also log an array, we can do this by using “log_list” function that accepts an array.

Now to run this, we can use the “Repeat run with last configuration” option, to use the same configuration as the previous run. Although running this code with the last configuration would fail now, as the previous configuration does not include matplotlib and NumPy as the required packages. So we have to run with a new configuration and we should include these packages.

Running with a new configuration can take more time than usual as the new packages will be installed in our environment. When our submitted run is complete we can navigate to its detail and see our logged data.

As you can see, these options can be very useful to track different aspects of an experiment or to provide an output and as logging has a very simple interface in Azure, therefore, we can easily add logging to existing code.

Experiments are quite flexible and are perfect for analyzing and developing solutions and providing results, but what if you want to reuse it? What about parameterizing an experiment and making it even more flexible? If this is what you have in mind, then we have to create a pipeline.

Pipelines

We can use the Azure Machine Learning SDK for Python to create ML pipelines, as well as to submit and track individual pipeline runs. With pipelines, you can optimize your workflow with simplicity, speed, portability, and reuse. Using distinct steps makes it possible to rerun only the steps you need, as you tweak and test your workflow.

To create a pipeline via Python SDK, we need to have a reference and access to our workspace in the Azure ML services. For this, first, we need to download our workspace configuration file from its overview page and put it in our project folder.

Download config.json from workspace overview tab

Then we need to provide authentication for the Python SDK to access our workspace. To do this we have to create a Service Principal Authentication through “Azure Active Directory -> App registration” tab. This would provide us with Tenant id, Service principal id and Service principal secret that we can use to provide access, as seen in the code below.

Authentication

We can use this “authenticate” function to create a Service Principal Authentication and get access to our workspace by using “from_config()” function. Remember to put the downloaded configuration file (JSON file) in your project folder, as this function is using that file to identify your workspace.

We can define multiple steps for our pipeline, Azure provides multiple options to use as a step. We can have a step created from the Azure Data Lake, a step to run a Databricks notebook or a step to run a Python script. We are going to focus on the Python script step and create a pipeline with 3 steps.

Define a step

Here we are defining a “PythonScriptStep” but to define a step, we first need to define multiple other things. Every step can have Inputs and Outputs, to provide these, every workspace has default storage, it is either a Blob or file storage, that we can access and use it by just calling “get_default_datastore()” function from the workspace reference. Now we define an Input or Output by using “PipelineData” and providing a name.

Each step also need a compute target, we can either get an existing compute by its name or define one via SDK (we will see this later). Besides this, we also need to provide a run configuration, for this, we define a yml file named “environment.yml” that contains all the required packages for our environment.

We can see that each step’s requirement is quite similar to an experiment and we need to provide most of the parameters that an experiment needs to make it run. And this is kind of true, as our pipeline is basically a controller to run multiple experiments one after another.

Adding a parameter to our pipeline

As mentioned before, we can also add parameters to the pipeline to make it even more flexible. After publishing the pipeline, we can provide a value for this parameter before running it. To parameterize our pipeline we should define a parameter by using “PipelineParameter”. Here we define a parameter named “chart_type” and provide this as an argument for one of our steps.

Submit and publish our pipeline

Now that we defined all of these we can submit and publish our pipeline. We create an array with all of our pipeline steps and then we define a pipeline by using the “Pipeline” class. We run this pipeline creation code as an experiment, an experiment that would create or submit this pipeline. And finally, we would publish the pipeline with a name, description and a version. Publishing a pipeline would make it appear in the pipeline tab of our workspace and we can reuse it afterwards.

The whole code

This is the whole code for pipeline creation. You can see in this code that we used Python SDK to create a Compute Target when it’s not available (Line #50–52), and also defined all of our 3 pipeline steps. You can also take a look at the whole sample project here: Test_Azure_ML_Pipeline

Now if we open the Azure portal and navigate to the experiment, we can see that our current run would initiate another run that is the actual pipeline.

We can see every step of our pipeline in its detail view.

Azure would also generate a graph of our pipeline steps. Showing each input and outputs.

This also shows how our pipeline would run. Currently, each step is dependent on the output of the previous step, so our pipeline would run one by one. But for example, if pre_process step had no input from the read_data step, then these two steps would run in parallel.

Since we also published our pipeline, if we navigate to the “Pipelines” tab in our workspace, we should be able to see our pipeline.

We can rerun our pipeline through here. We can also provide value to our defined parameter.

Providing value to the pipeline’s parameter

Our defined parameter was chart_type, so if we run this pipeline with a different value, we can get a different plot as an output.

Running our pipeline with different parameters

Pipelines can be useful in endless scenarios, as we can easily create steps that are using Python codes, the flexibility is as much as Python itself and having a parameterized pipeline would help us to make it even more flexible and use it under multiple conditions. Now let’s look at a real use case, a project that we did recently in grandcentrix.

Temperature prediction with an IoT devkit

To try out the Azure ML Services, we, recently defined a project to predict the office temperature of the next 24 hours based on past temperatures measured by a sensor in the office. We used the MXChip IoT DevKit from Microsoft, which is a small devkit with multiple sensors and a wireless chip on board. It is programmable with the Arduino IDE and also delivers an out of the box IoT Hub support. We connected the DevKit to the Azure IoT Hub and stored temperature and humidity data coming from the chip in a PostgreSQL database. Then we read the data from the Python side by using psycopg2 library to access our database, we do some pre-processing by filling NaN values and resample the data to have hourly values.

To predict the temperature of the next 24 hours, we created features consisting of 24 hours of data, so each row would contain all the data of the past 24 hours. By doing so, we can provide 24 hours set of data to the model and receive a prediction for the next 24 hours.

Our pipelines consist of 7 steps, from fetching all of the data to feature creation, train/predicts and finally evaluation and visualizing the result.

Temperature predictions against actual values

These two plots are the output of our pipeline. In the first plot, we can see the differences between the actual and predicted temperature. In the second plot, we are creating a histogram, from the differences between those values (Actual — Predict), and we are also fitting a Gaussian to our results so that we can calculate mean and sigma as our evaluation metrics. In most of the instances, we are quite close to the actual values but sometimes we are predicting about 1 degree higher or lower. This is visible in both plots, for example in the histogram, the mean value of -0.64 shows that our overall results are a little bit higher than the actual values, and the sigma value of 1.63 shows the range of these differences.

By creating a pipeline we can easily test, develop and reuse each step. As each step is separated, we can easily add one of the steps to another pipeline. For example, in our projects, we reuse the evaluation step to create corresponding plots in different pipelines. It makes the whole process faster and more flexible. It helped us to run multiple experiments with different models and compare all of the results to find the optimal solution. And at the end, we have all of our experiments with their outputs documented in Azure for future references.

Beware of the costs

For the last part of this blog, I would like to take a look at the Azure Cost Management. It is always a good practice to check your resources via the cost management, as you can easily find out if a resource is costing too much and maybe you need to change some configurations. For example, as mentioned before, if you keep the Notebook VMs running, it will cost you, whether you are using it or not, and by checking this graph you can find out if there is any Notebook running in one of your workspaces.

Group the plot by resources or resource groups

At first, the plot might not look that intuitive, but to make it more understandable I would recommend grouping all the costs by resources or resource groups, this way you can investigate the costs separately for each resource and compare the costs of each.

The Azure ML Services actually won’t cost that much if we use cheaper Compute Targets with the right configuration, for example setting their minimum node to zero. Another strategy to save costs is to try the code on a local machine and only run it as an experiment in Azure when everything is working as expected. And, as you can see in the image, it cost us less than 2 euros to create and experiment with that sample pipeline.

Summary

The Azure Machine Learning Services provides a wide array of services and we only talked about a handful of them here. It provides services that are very helpful for experimenting or rapid prototyping alongside services that are suitable for large scale projects, such as pipelines. Using the Azure ML Services helps us in many ways. We can easily document all of our experiments and evaluate different parameters with each change. We can create flexible pipelines and reuse them under different conditions, and last but not least, we can use the power of the compute targets to move our code execution from local machines to the cloud, which speeds up the whole process.