Azure for AI/ML

Microsoft AI Workshop

Allen Manoj

Published in

Analytics Vidhya

4 min readSep 27, 2020

Microsoft AI Platform

Microsoft AI is a robust framework for developing AI solutions in Machine learning, data sciences, robotics, IoT, etc.

Workspace

Top-level resource for Machine Learning service. This serves as a hub for building and deploying models. This stores experiment objects that are required for each model you create. The workspace saves compute targets.

Datastore

Abstraction over an Azure Storage Account. Each workspace has a registered default datastore.

Pipeline

This is a tool for creating and managing the workflows which typically includes data manipulation, model training & testing, and deployment.

Azure ML Service

Bring the power of containerization and automation.

Fun Fact: All of us use Jupyter notebook.
but how did it get its name?
Ju-Julia, Py-Python, and R languages are used in the notebook, Jupyter notebooks are built especially for these languages.

Typical Workflow of ML Studio Classic

Visual drag and drop happens in this platform.

Data sources

Azure Blob Storage, Web URL using HTTP, Hadoop using HiveQL, Azure Table Storage, Azure SQL Database, SQL Server on AzureVM, On-premises SQL server database via the Data Manager and OData are the services provided by Microsoft Azure.

Data Formats

.csv — Comma-Separated Value with a header
.nh.csv — Comma-Separated Value with a no header
.tsv — Tab-Separated Values with a header
.nh.tsv — Tab-Separated Values with no header
.txt — Plain text
.svmlight — SVMlight
.arff — Attribute Relation File Format
.zip
.RData — R object or workspace

Explore, Create Summaries

Things to keep in mind.

Develop an understanding of data.
Which features show independent and independent behavior.
Do the features contain outliers.
Are there features that only add noise, if used for training the model.
Are there trends-patterns or biases.
Why the attributes have missing values.
Which are the values which are rare and why?
Can you see any unusual patterns? What might explain them?
How are the observations within each cluster similar to each other?
How are the observations within separate clusters different from each other?
Identify the missing values
Find the minimum and maximum value.
Correlation plot.
Box plot or identify the skewness or scatterplot.

Bar Graph: Distribution of categorical variable
This is useful in plotting discrete values.

Histogram: Distribution of continuous variable
- Negative skew
- Positive skew

Prepare and Clean Data

Replace using MICE.
Replace using Probabilistic PCA.
Custom substitution.
Replace with mean, mode, median.
Remove entire row, column.

MICE: Multiple Imputation by Chained Equations
Each variable with missing data is modeled conditionally using other variables in the data before filling in missing values.

PCA: Principal Components Analysis
Replaces the missing values by using a linear model that analysis correlations between the column and estimates a low-dimensional approximation of the data, from which the full data is reconstructed.

Preprocessing

Using standard or advanced preprocessing automatically scaled or normalized to help the algorithm go well.

Drop high cardinality or no variance features, Impute missing values,
Generate additional features, Transform & Encode, Word Embeddings, Cluster Distance

Cross-Validation

Uses more test data
Evaluates the dataset as well as the model.
Generalize to new datasets.

Model Deployment

Deploy the model for consumption!

Target development environment supported are:

Docker image
Azure Container Instance
Azure Kubernetes Service
Azure IoT Edge
Field Programmable Gate Array

For deployment, you would require.

An environment file specifier package dependencies.
A configuration file requests the required resources for the container.
A score script file that tells the Automated ML’s to call the models.

Automated ML

Model creation is typically Time Consuming.
We will have the following questions whenever we train a model.

Will this solve the data science challenges?
How do I set the perfect hypermeter values?
Will this model be the best model to solve the problem?
How do I speed up the model and hyperparameter value selection process?

Automated ML Techniques

This creates a number of models in parallel pipelines that try different algorithms and parameters. It will stop once it hits the exit criteria defined in the experiment.