Background vector created by freepik — www.freepik.com

Udacity’s Machine Learning Engineer Nanodegree (MLND) — Term 1

Published in

NEW IT Engineering

7 min readFeb 20, 2019

About me

After my Ph.D. in pure mathematics from Heidelberg University, I joined Accenture Technology as a Software Engineer in September 2018. During my studies, I chose computer science and economics as minors. Hence, I fulfilled quite well Udacity’s requirements for this Nanodegree: intermediate knowledge about statistics, calculus, linear algebra and Python programming.

Why Udacity’s MLND?

Machine learning is a hot topic for math and computer science majors, so it made sense for me to gain deeper knowledge in this area. As a first step, I did some research on MOOC courses and took a closer look at the two most popular ones: Coursera’s Machine Learning from Stanford university and Udacity’s Intro to Machine Learning. Personally, I liked Udacity’s video quality and their exercises much better. In addition, I prefer Python over Octave/MATLAB, the former also being the most popular programming language in the machine learning field. Finally, Udacity’s MLND schedule provides a complete course to machine learning: From supervised and unsupervised machine learning algorithms in Term 1, to deep learning and reinforcement learning in Term 2. As I could make time one day per week for a machine learning course, I gladly took the opportunity to commit myself to Udacity’s MLND for the next 6 months.

How is the first term structured and which services does Udacity provide?

The first term of the Nanodegree starts by giving an overview of its three sections:

Machine Learning Foundations plus Model Evaluation and Validation — Project: Predicting Boston Housing Prices
Supervised Learning — Project: Finding Donors for CharityML
Unsupervised Learning — Project: Creating Customer Segments

Next Udacity lists their different ways of support for the Nanodegree students: Through detailed reviews after the projects, mentored classrooms, a “Knowledge” platform like StackOverflow and career services. While I found the classrooms with their chat function a bit convoluted, the code reviews are very helpful and often provide extra study material. I didn’t use Knowledge and the career services.

Section 1 — Machine Learning Foundations plus Model Evaluation and Validation

The first part motivates machine learning: As humans learn from experience, we can similarly use previous data to train computers and let them make predictions. This leads to exciting applications like image and voice recognition, spam detection, fraud detection, teaching a computer to play chess and self-driving cars. Udacity continues with videos explaining the following machine learning algorithms:

Decision Trees: This algorithm asks a series of questions about the data features. Every question forms a node, which for each possible answer points to one child node. The resulting hierarchy is encoded as a tree.
Naïve Bayes: The Naïve Bayes Theorem is based on conditional probability: That is, the probability that an event will happen given that another event has happened. Usually, the Naïve Bayes classifier combines this model with a decision rule: Pick the most probable hypothesis. This model is called “naïve” based on the assumption that every pair of features is conditionally independent.
Linear Regression: This algorithm defines a linear relationship between the feature(s) and the output variable.
Support Vector Machines: Using hyperplanes, a SVM divides the data set into classes so that the hyperplane has the largest distance to the nearest training data points of any class.
K-Means Clustering: This unsupervised learning algorithm iteratively assigns every data point to the nearest of K centroids, which define the K clusters. In each iterative step, these K centroids are moved to the center of the centroid’s cluster.
Hierarchical clustering: This unsupervised learning algorithm repeatedly merges the two nearest/most similar clusters into one cluster, resulting in a dendrogram that shows the hierarchical relationship between the clusters.

After introducing these algorithms, metrics for the evaluation of the machine learning algorithms are presented. For instance, in a medical model that predicts whether a patient is sick a high “true positivity rate” a.k.a. recall is important since all sick patients shall receive a therapy. Contrarily, in a spam classification model a high “true negative rate” a.k.a. sensitivity is important as we do not want to send non-spam emails to the spam folder. Other metrics as accuracy, F1 score, F-beta score, ROC curve and R2 score are also explained.

In the next lesson, the concept of cross-valuation is introduced for making decisions about our model. By plotting the cross validation and training score of the models in a complexity graph, we see if a model is too general and “underfits” or if the models memorizes training data and “overfits”. Similarly, the learning curve given by the cross validation and training score of a fixed model respective to the number of training points helps to detect over- and underfitting.

Finally, grid search selects the best score on the cross-validation data set with respect to the combinations of different chosen parameters used to train the model.

These various techniques from Section 1 are used to train the “best” decision tree regressor in the first project “Predicting Boston Housing Prices” using grid search with R2 score.

Section 2 — Supervised Learning

In this section, the above supervised learning algorithms are explored in single lessons in more detail. Each lesson provides an exercise for using this algorithm in scikit-learn.

In addition, the following algorithms are explained:

Perceptron Algorithm: The perceptron algorithm is the building block for neural networks. For n features x_1, …, x_n , the perceptron algorithms finds a (n-1)-dimensional hyperplane given by the equation w_1x_1+…+w_nx_n +b that classifies the data.
Random Forests: To avoid overfitting, one trains different decision trees on random subsets of the training data and combines their prediction by voting (in case of classification problems) or averaging (in case of regression)
Further Ensemble Methods: This lesson explains two ensemble methods: bagging (bootstrap aggregating) and boosting. Bagging combines the predictions from “weak” learning algorithms on subsamples of the data set by either averaging (in case of regression) or voting (in case of classification). AdaBoost is an adaptive ensemble method in the sense that each subsequent “weak” learning algorithm focusses on correctly classifying the data points that were misclassified by previous weak learners. Finally, AdaBoost combines the weak learners by weighting their output according to their accuracy.

This section closes with the project “Finding Donors for CharityML”, for which I chose three of the supervised learning algorithms to predict whether an individual makes more than $ 50,000 annually as this can help a non-profit organization to understand better how large a donation to request, or whether or not they should reach out to begin with. After evaluating the three models, I selected RandomForests with the best scores, and fine-tuned its parameters using grid search with F_(0.5) score for high precision.

Section 3 — Unsupervised Learning

The last section of Term 1 starts by examining the two algorithms K-Means Clustering and Hierarchical clustering. In addition, DENSITY-BASED CLUSTERING (DBSCAN) is introduced, which is robust to noise points by classifying points with a minimum of n points in their epsilon-neighborhood as “core” points of a cluster and the other cluster points as “border” points.

In the next lesson, a fourth clustering algorithm Gaussian Mixture Models (GMM) is explained, which uses mixtures of (multivariate) Gaussian distributions to best model any input data set using an expectation-maximization approach. The lesson finishes with different indices for validating the clustering result like the silhouette score.

The following lesson deals with feature scaling, which is an important pre-processing step for unsupervised learning algorithms when analyzing various features together. In sckit-learn this re-scaling of features is provided by the MinMaxScaler.

Another pre-processing step is done by feature selection and dimensionality reduction, which in the next lesson is given via principal component analysis (PCA) that by projecting onto direction of maximum variance transforms correlated features into linearly uncorrelated features.

The final lesson of the third section deals with two other methods for feature selection, namely Random Projection and Independent Component Analysis (ICA).

Term 1 finishes with the unsupervised learning project “Creating Customer Segments”, where a wholesale customer data set is clustered into two segments after removing outliers and applying PCA to reduce to two principal components.

So, should you subscribe to Term 1 of MLND?

In my opinion, Udacity’s Machine Learning Engineer Nanodegree is an excellent introductory course for machine learning. The short videos are fun to watch, and every algorithm is explained with examples in simple terms. As a math major, I wish that the mathematical side of the algorithms was explained in greater detail, or that more often references for further reading were given. Moreover, the last project I had to resubmit twice, and each review provided further helpful visual tools to analyze the data — it would be nice to see such additional tools also after successfully submitting right away. I did finish Term 1 in 2 instead of 3 months so I did have time to read more about the material. The only real downside is the high price of 999 USD. Overall, I learned many useful machine learning algorithms, and I recommend Term 1 to everyone interested in machine learning. There are free courses in Python, Probability and Statistics, and Linear Algebra to learn the prerequisites.

Udacity’s Machine Learning Engineer Nanodegree (MLND) — Term 1

Written by Ann-Kristin Juschka