Cheat Sheet for Machine Learning Models

Published in

Analytics Vidhya

15 min readJan 29, 2020

Understanding how to utilize algorithms ranging from random forest to TextRank for different use cases

Machine Learning has been popularized in recent years due to its scalability and accuracy in recognizing patterns and predicting outcomes. It has already been widely adopted in various industries, ranging from detecting fraudulent transactions to predicting customer churn.

There are so many readily available Machine Learning models out there that it can get quite overwhelming to learn, especially for those who are relatively new to data science. As I learn more about Machine Learning through my courses and work experiences, I’ve decided to write my first Medium article by covering some of the most popular ML algorithms at a very high level.

This guide is not meant to be a comprehensive laundry list of all the important ideas. Instead, it serves as an overview of how you can utilize ML algorithms for different purposes. I will introduce the models by use cases, such as predicting numeric values, binary classification, clustering, etc. I will focus on simplifying how each model works, analyzing model performance using metrics, and some other relevant concepts.

Predicting Numeric Values

For example, predicting sales based on advertisement spending.

Multiple Linear Regression

Estimate feature weights (coefficients) through least squares: minimizing the sum of the squared residuals, where residual = actual y value-predicted y value.
Evaluate each feature (coefficients) by its associated p-value, where the Null Hypothesis is that there is no correlation between the particular feature (x) and the target (y).
Assess overall model utility by the F-Statistic (if it’s large, then we reject the Null Hypothesis and conclude that there is some relationship between the feature and the target).
We can assess model accuracy by using the metric R squared, which is the proportion of variability in the target that can be explained by the features in our regression model (ideally we want R squared to be high i.e. close to 1). Keep in mind that adding more predictors (features) will always improve the model fit with training data, thus increasing the R squared value (be careful of overfitting: when a model fits the training data so well that it loses some ability to recognize patterns on the testing set).

Decision Tree Regression

Decision Trees can be used for both regression and classification. Here we focus on using a Decision Tree to predict some numeric value.

Decision Trees split the dataset into subsets at a specific node using standard deviation reduction. A sample of data points with similar values (homogenous) should have a very small standard deviation.
Mean and Standard Deviation are then used to compute the Coefficient of Variation (CV):

CV is used to decide when to stop branching

When we use decision trees for regression, we are concerned with finding the attributes that return the largest standard deviation reduction.

We choose the attribute with the largest SDR as the splitting node. Iteratively repeat this process until some stop criteria: CV is smaller than some threshold value (i.e. when the CV for a branch becomes smaller than a threshold of 8%, stop splitting the node because the branch is “homogenous enough”).

Example: Building a Decision Tree for Regression

Suppose we are using 3 predictors:

Weather (rainy, sunny, or overcast),
Temperature (hot, mild, or cold), and
Humidity (high or normal)

to predict the number of hours spent outdoors.

Compute the standard deviation of the target (# hours spent outdoors).
Split data on the 3 different attributes (weather, temperature, humidity) and compute standard deviation for each branch after the split. This result is subtracted from the standard deviation before the split.
Calculate the Standard Deviation Reduction (SDDR), where SDR(T, X) = S(T)-S(T, X) = S(# hours outdoors)-S(# hours outdoors, feature X)
The attribute with the largest standard deviation reduction is used as the splitting node (in our case, weather is chosen as the root node).
Divide data based on the values of the selected attribute (weather) until some stopping criteria (i.e. when CV for a branch becomes smaller than a threshold of 8%):

a) If the CV at a particular node is more than the threshold, the branch is not homogenous enough and we split it further by choosing the feature with the largest SDR.

b) If the CV at a particular node is less than the threshold, it becomes a leaf node as the branch is “homogenous enough”.

Improving Model Accuracy for Decision Trees

By creating a diverse set of different decision trees, we can build a random forest which improves model accuracy and helps prevent overfitting. The specifics of random forest will be discussed in more details below.

Note that if we used decision trees for classification, we can pretty much replicate the same process as above. Instead of using SDR, we use Gini Impurity scores (we want a low impurity score as it measures the likelihood of an incorrect classification, if the observation were randomly classified using the distribution of classes) as the metric to split nodes.

Note

There are many other techniques you can use for regression problems, such as neural networks and polynomial regression.

Binary Classification

For example, classifying whether if an email is spam or not.

Logistic Regression

Logistic Regression aims to predict the probability of an observation to be part of a certain class or not (value is between 0 and 1).

The main idea behind Logistic Regression is the logit function specified below:

Linear combinations of inputs (right) are mapped to the log of odds (ratio of success and failure probabilities)

We take the right-hand side of the above equation = z = B0 + B1X (this means an intermediate input).

We then pass z into a sigmoid function, which would return a probability value between 0 and 1.

Estimate regression coefficients through maximum likelihood estimation: choosing the set of coefficients for which the predicted probability is as close as possible to the observed values.
We can assess model performance using Deviance. There are 2 types —

Null Deviance: the response is predicted by a model with only an intercept term;
Residual Deviance: the response is predicted by a model with an intercept term and additional predictors.

If the residual deviance is much smaller than the null deviance, we can conclude that including the additional parameters significantly improved model accuracy.

Naïve Bayes Classifier

This model is based on Bayes Theorem, with the key assumptions that:

Each feature is conditionally independent of other features;
Each feature has the same weight.

Bayes Theorem. Source: https://www.saedsayad.com/naive_bayesian.htm

If we wanted to classify an email as spam (1) or not (0):

P(c|x) = P(y=1 | email x)
P(x|c) * P(c) = P(email x | y=1) * P(y=1)
P(x) = P(email x | y=1) * P(y=1) + P(email x | y=0) * P(y=0)

Assessing the Performances of Binary Classifiers

Example: predicting if a customer will return (positive) or not (negative)?

We can use metrics derived from the confusion matrix, such as:

Accuracy = (TP + TN) / (TP + FN + FP + TN). Out of all possible data points, what proportion was correctly classified by our model?
Specificity = TN / (FP + TN). Out of all the customers that didn’t return (negative), what proportion was correctly classified by our model? (model accuracy in predicting negatives)
Sensitivity or Recall = TP / (TP + FN). Out of all the customers that returned (positive), what proportion was correctly classified by our model? (model accuracy in predicting positives)
Precision = TP / (TP + FP). Out of all the customers that were predicted to return (positive), what proportion actually returned?

In our example, we would prioritize how accurate our model is in correctly classifying whether if a customer will return. Thus, it makes sense for us to optimize Precision and Recall scores, but there is a tradeoff between the two metrics.

ROC Curve

Based one some of the metrics mentioned above, we can derive the Receiver Operating Characteristic (ROC) curve as an evaluation metric.

The ROC curve visualizes how well the binary classifier distinguishes between the two classes. AUC simply means the area under the ROC curve: the higher the AUC (closer to 1), the better the diagnostic ability of our model.

Below are the three key metrics to deriving a ROC curve:

True Positive Rate / Recall / Sensitivity: out of all data points in the positive class, what proportion was correctly predicted?

True Negative Rate / Specificity: out of all data points in the negative class, what proportion was correctly predicted?

False Positive Rate = 1-True Negative Rate

Multi-Class Classification

For example, predicting if an image is a muffin, chihuahua, or cat.

Support Vector Machines (SVM)

The goal of SVM is to separate different classes with a hyperplane by maximizing the margin, which is the shortest distance between observations and the threshold (separation line for classification).
This is how SVM works in simple terms:

We begin with datasets in a relatively low dimension.
Transform data into higher dimension.
Kernel functions are used to find classifiers that separate higher dimension data into different classes (by systematically computing the relationships between each pair of observations).

Support vector classifiers are used with soft margins, which means allowing some misclassifications (introduce some bias to have the threshold be less sensitive to outliers) for better model predictions (lower variance). This is also called the bias-variance tradeoff.
SVM has several key parameters:

C (regularization parameter) specifies how much you want to avoid misclassifying each example. High values of C means that a small margins will be used if the model has a higher training accuracy. Lower values of C encourage a bigger margin while sacrificing some training accuracy.
Gamma defines how far the influence of an observation is. High gamma values means only nearby points are considered, while low gamma values means far points are also considered.
Margin: line of separation to closest class points. A good margin signifies that the separating hyperplane is roughly equidistant as far as possible for all of the classes.

Random Forest

A random forest operates on the basis of Ensemble Learning, which means using multiple decision trees with different subsets of features in order to improve model accuracy and help prevent overfitting. There are two important concepts in Ensemble Learning that are relevant here:

Bagging (bootstrap aggregation): reduce variance by repeatedly taking random samples from the dataset. Train on all bootstrapped datasets, obtain a prediction for each set, and average the predictions.
Boosting: trees are grown sequentially using information from previous trees. Start with weak early learners (who fit simple models) and analyze for errors: when an input is misclassified, increase its weight to improve classification performance.

A Random Forest can be constructed with the following steps:

Create a bootstrapped dataset (estimate sampling distributions with random sampling with replacement).
Select m (m < total number of features) random features in the bootstrapped dataset to build a decision tree at each step (node).
Repeat steps 1 and 2 which results in a wide variety of decision trees.
Since each decision tree has a vote for classifying the input, we take the votes from each tree and assign input variable with the class that has maximum votes.

Unsupervised Learning — Clustering

For example, segmenting enterprise customers based on monthly recurring revenue and the monthly growth rate of company size.

K-Means Clustering

In K-Means, we try to to maximize the Dunn Index, which is defined as min(inter-cluster distance) / max(intra-cluster distance).

Larger numerator values represents clusters that are far apart from each other.
Small denominator values means that each cluster is relatively compact. In other words, points within a cluster are close to the cluster centroid.

We can build a K-Means Clustering model by using the following steps:

Choose the number of clusters (k) using the Elbow Method: graph various values of k (x-axis) against the mean of squared distances from each data point to the centroid of its cluster (y axis). Pick the value of k where distance drops off significantly. Then, we randomly initialize the centroids of all k clusters.
Then, the algorithm iterates between:

Data Assignment Step: each point is assigned to the nearest centroid by minimizing its squared distance to the centroid. In other words, we classify an observation to be in the cluster whose centroid is the closest to it.
Centroid Update Step: recompute the cluster centroids by taking the average of all observations in that cluster.

…until some stopping criteria (i.e. max number of iterations reached).

Key Advantage: computing distances between observations and centroids has a linear time complexity, O(n).
Key Disadvantage: results might lack consistency since the model begins with a randomly initialized number of cluster centers.

K Nearest Neighbors (kNN)

kNN can be used for both classification and regression.

Choose the number of nearest neighbors (k) to be considered when classifying an observation (i.e. k = 11. Out of 11 nearest points to unknown point P, 8 are green and 3 are red = x is classified as green).
Compute the distances between point P (test data) and each row of training data.
Sort the distances in ascending order and pick the first k smallest entries.
Obtain the labels of the selected k entries:

If classification: return the mode of the k labels.
If regression: return the mean of the k labels.

Hierarchical Clustering

The most common type is called agglomerative clustering, which is constructed in a bottom-up, hierarchical approach.

This algorithm treats each data point as a single cluster. Then, select a dissimilarity metrics that measures the distance between clusters. For example, we can use average linkage, which is the average distance between data points in the 1st cluster and those in the 2nd cluster (compute all pairwise distances between observations in cluster 1 and observations in cluster 2, then average the dissimilarities).
Combine the most similar pair of clusters in every iteration (choose the 2 clusters with the smallest average linkage).
Repeat step 2 until everything becomes one single cluster containing all of the data points (we can also choose an arbitrary stopping criteria i.e. stop combining the clusters when there are 2 total clusters).

Key Advantages:

Does not require human input to initially specify the number of clusters.
Not sensitive to the different metrics for distance.
Useful for datasets that have hierarchical structures.

Disadvantage: lower efficiency O(n³), unlike K-Means Clustering which has a linear complexity.

Unsupervised Learning — Dimensionality Reduction

Principal Component Analysis (PCA)

If you have way too many features to analyze, PCA can reduce the dimensionality of the feature space such that Principal Component 1 (PC1) accounts for the most variation in our dataset, PC2 does the second best, etc.
Here’s how PCA works in simple terms:

Find a “line of best fit”: after projecting the data points on the best fit lines, choose the one with the lowest sum of squared residuals of the original data points. This is equivalent to the choosing largest sum of squared distances (SS) from the projected data points (the projections are now on the best fit line) to the origin.
The best fit line that yields the largest SS from the projected points to the original is known as Principal Component 1. Repeat this process for other Principal Components (if you wanted a 2D PCA plot, simply have PC2 to be perpendicular to PC1).
Transform original data to align with these directions (Principal Components), which compresses the feature space (dropping the variables that do not account for lots of variation in the dataset) while preserving our original variables.

Linear Discriminant Analysis (LDA)

LDA can be used for both classification and dimensionality reduction.

LDA creates a new axis from the dataset by maximizing the separability among the categories.
LDA generates this new axis by:

Maximizing the distance among the means (of various categories);
Minimizing the variation within each category.

If we have a 2D dataset (gene 1 and gene 2), LDA simply uses both genes to create a new axis and projects the original data onto this new axis to maximize the separation of the 2 categories (our data is now in 1D).

PCA vs. LDA

Both models try to reduce dimensions and rank the new axes in order of importance.
PCA creates PC1 which accounts for the most variation in our data, while LDA creates LD1 which accounts for the most variation between categories.
LDA is suitable for small datasets and can be used to predict 3 or more classes, while PCA is useful for visualizing higher dimensional data or dropping the variables that do not account for much variation in our data.

Processing Text Data

For example, summarizing the main points of an article.

Bag-of-Words

A bag-of-words (BoW) is a representation of text which specifies the frequencies of words within a document. The main idea behind BoW is quite intuitive: having a vocabulary list of words and calculating the presence of known words in a given document.

This “bag” of words ignores the locations, structure, or the order of words in a document; the model only considers the occurrences of known words in the document.

TF-IDF (Term Frequency — Inverse Document Frequency)

This model is used to reflect how important a word is to a document.

Term Frequency = the word’s frequency in a given document / total number of words in that document
Inverse Document Frequency = log(total # of documents / # of documents containing that word)
The equation = TF * IDF

We can use TF-IDF to find the meanings of documents by following these steps:

Data Cleaning: remove special characters, change all words to their lower-cased root words.
Split into individual words and frequencies.
Compute Term Frequency * Inverse Document Frequency for each word.

TextRank

TextRank can be a powerful tool for text summarization.

Data Cleaning: remove special characters (and punctuations), get rid of stop words (is, am, the, of, etc.).
Vector Representation: can use imported word embeddings, bag-of-words, or TF-IDF to represent sentences.
Construct the Similarity Matrix by using cosine similarity. Calculate the cosine of the angle between two vectors to determine whether if they are pointing in a similar direction (if yes, then angle should be narrow and the two sentences are similar).
Convert the Similarity Matrix into a graph, where the nodes represent sentences and the edges represent the similarity scores between the sentences.
Lastly, extract the top n sentences based on their rankings.

Ranking Algorithms

For example, ranking products to users based on relevance.

Learning to Rank (LTR)

LTR ranks items by training a model to predict the probability of a particular item ranking over another item (“assign a score” to each item such that those with higher rankings also have higher scores)

Sigmoid Function where s = the score of an item

Using the Sigmoid Function, large differences in ranking scores would result in a larger probability (of item i ranking over item j). In addition, equal scores result in probability = 0.5.

Source: https://mlexplained.com/2019/05/27/learning-to-rank-explained-with-code/

Define a loss function J (maps events onto a real number as its associated “cost”), which is optimized using gradient descent (if you were trying to get down from a barely visible mountain, analyze the slope at current position then proceed in the direction with the steepest descent i.e. downhill)

Negative log likelihood of the probability that i ranks higher than j

Hopefully this article can serve as a practical introduction to the diverse, powerful set of Machine Learning tools you can deploy for drastically different use cases.