Course Review — Data Science and Machine Learning Essentials

Manas Ranjan Kar
NLP Wave
Published in
6 min readNov 2, 2015

I recently completed an EdX course sponsored by Microsoft — Data Science and Machine Learning Essentials. It dealt with basics of Machine Learning and creating machine learning web services using Azure ML. I have been working in the area of NLP/ML for quite some time now. I had completed multiple courses from Andrew Ng, Yaser S. Abu-Mostafa and Christopher Manning when I started off in the field.

I was majorly interested in seeing how Azure ML can help me create a scalable web service. I tried to pay extra attention to the Azure Labs to understand what can be done/not done on the platform. A brief about Azure ML;

What is Azure Machine Learning?

Azure Machine Learning is a fully managed service that you can use to create, test, operate, and manage predictive analytic solutions in the cloud. With only a browser, you can sign-in, upload data, and immediately start machine learning experiments. Drag-and-drop predictive modeling, a large pallet of modules, and a library of starting templates makes common machine learning tasks simple and quick. For more information, see the Azure Machine Learning service overview. For a machine learning introduction covering key terminology and concepts, see Introduction to Azure Machine Learning.

What is Machine Learning Studio?

Machine Learning Studio is a workbench environment you access through a web browser. Machine Learning Studio hosts a pallet of modules with a visual composition interface that enables you to build an end-to-end, data-science workflow in the form of an experiment.

For more information about the Machine Learning Studio, see What is Machine Learning Studio

What is the Machine Learning API service?

The Machine Learning API service enables you to deploy predictive models built in Machine Learning Studio as scalable, fault-tolerant, web services. The web services created by the Machine Learning API service are REST APIs that provide an interface for communication between external applications and your predictive analytics models.

See Connect to a Machine Learning web service for more information.

The machine learning concepts were beginner level and is well explained by the instructors. The majot learning points from the course were as follows;

1. You can quantize a variable to reduce the number of categories for a categorical variable, balance the number of cases for each category of a categorical variable, or transform a numeric variable to a categorical variable.

2. Visualizing a dataset enables you to develop an understanding of the relationship between the features and the label to determine which features are likely to be predictive of the label and should be used in training the machine learning model, develop an understanding of which features are redundant or collinear with other features and should be eliminated from the dataset before training the machine learning model, and find features that are not likely to be predictive of the label and should be removed from the dataset before training the machine learning model.

3. When features have a nonlinear relationships with the label, you can engineer new features by converting the original features to polynomials, logarithms, or exponentials in order to try to find a more linear relationship that will work better in a linear model.

4. You should split the data set into three independently selected non-overlapping subsets to ensure that the data used to train, test, and validate the machine learning model are independent, preventing unintentional bias.

5. You should prune features that are collinear or co-dependent on other features in the dataset to prevent bias and instability when training the model. You should prune features that increase model error during training and testing to improve model performance. You should prune features that have little impact on model performance during training and testing, to reduce model complexity and improve generalization.

6. Conditioning allows you to view three or more dimensions of a data set on a two dimensional projection by grouping or sub-setting the data by one or more conditioning variables, and to understand relationships between three or more variables in your data set.

7. The residuals (error) of a good regression model should exhibit a random distribution with no particular structure with respect to the values of the label or the features. The randomness and lack of structure indicates that model fits the data well and that the information in the features has been exploited by the model.

8. A model which generalizes well should show consistent performance metrics (accuracy, precision, recall, F1) across the folds in cross validation. The consistency indicates that the model generalizes well since the performance is insensitive to the test data set in each of the folds. One measure of the consistency in the performance metrics over the folds is the standard deviation being significantly smaller than the mean for each metric. Finally, the mean performance metrics should exhibit acceptable values.

9. Features for Two-class classification should exhibit separation in the values or the categories for the two label categories. The separation in the features allows the model to separate the label categories. These features should be retained in the dataset. Labels which exhibit poor separation in the values or categories for the two categories of the label are unlikely to aid in classification. These features should be removed from the data set as they can only add noise or result in poor generalization of the model.

10. The principle component projections of the cluster ellipses summarize the properties of the clusters. The projected ellipses should show distinct properties. When clusters exhibit good separation the projection of the ellipses for the first two principle components of the clusters will have both distinct directions of the major axes and the lengths of the major and minor axis will be distinctly different in each ellipse.

11. The goal of pruning is to eliminate features from the dataset which either reduce model performance or have no impact on model performance but increase complexity and may reduce how well the model generalizes. Feature importance can be useful guide to finding feature pruning candidates. Once a feature has been pruned from a dataset, the performance of the model must be measured to assess the actual effect.

12. To split data for recommendation, use the Recommender Split splitting mode of the Split module. This evenly distributes user-item pairs into the training and test sets.

13. When using the Item Recommendation mode, the Evaluate Recommender model displays the NDCG metric for the scored model, with 1 representing a perfect model.

14. The Project Columns module allows you to exclude unneeded columns. In this case, a user will provide all of the columns required for the Web Service Input. They only require the scored or predicted value as output. Placing the Project Columns module between the Score module and the Web Service Output module allows you to provide only the one required column as a response to a service request.

15. To access a published web service, the client application must specify the endpoint URL and secure access key for the web service.

16. Less Recall — More False Negatives (FNs), Less Precision — More False Positives (FPs)

17. You should handle the outliers first, or the extreme values in the column will affect how the numeric values are scaled.

18. Removing rows containing an outlier, interpolating or imputing a new value to replace the outlier, or substituting a fixed value are the options for treating outliers.

19. Missing values can cause errors in some machine learning model calculations.

20. Scaling features helps prevent numeric features with large values from dominating the training of the machine learning model.

21. Azure ML includes modules for R, Python, and SQL scripts

FINAL VERDICT

Extremely useful for machine learning newbies, explains the basics in a clear and concise manner. Azure ML has a lot of functions related to data cleaning, integration with Python/R/SQL and creating a web service from the model. The best part is its capability to execute custom Python scripts which can allow me to add/create custom features. An example of binary classification for sentiment analysis is here.This is a critical element for feature engineering — well done Microsoft !

Currently, I am working on an ML API using Azure to reproduce a few of my models. It will be interesting to see if I can re-create some of the complex features I created for some use cases like topic classification, sentiment analysis and gender detection.

But how does AWS ML fare as compared to Azure ML? This goes in my To-Do check list now!

--

--