# My ML Glossary: Part 1

As I familiarize myself with the Machine Learning technology, I am keeping notes of the concepts that I am coming across. Someone else like me, may benefit from this. So here goes the first part of the glossary.

Confusion matrix: Confusion matrix for a classification model is a layout that visualizes the performance. It displays the number of times the algorithm was able to predict correctly. The horizontal axis shows actual labels and vertical axis shows predictions. As an example, lets say we have a model that filters spam messages. There can be 4 possible outcomes:

1. When an email is actually spam and the model also identifies as spam, then the case is True Positive (TP).
2. When the model marks a legitimate email as spam, then the case is False Positive (FP).
3. When a model labels an actual spam email as not spam, then the case is False Negative.
4. When a model label a non-spam message correctly as not spam, then the case is True Negative.

These cases are depicted in the following diagram. This representation of model performance is called the confusion matrix.

Type 1 error: False positive errors are called Type 1 errors.

Type 2 error: False Negative errors are called Type 2 errors.

Accuracy: It is a metric to evaluate a classification model. It is the fraction of predictions the model got right. So, Accuracy = Total number of correct predictions/Total number of predictions.

A model having an Accuracy of 95% means that it was able to classify correctly 95 times out of 100 times.

The formula for accuracy is:

Accuracy = (TP+TN)/(TP+FP+TN+FN)

## Precision & Recall:

I am putting these two concepts under the same label as they are easier to understand together and as metrics they are inversely proportional.

Precision: It is the metric that tells us, out of all the classifiers that have positive labels (TP and FP), what fraction was correct. The mathematical formula for this is:

Precision = TP/(TP+FP)

Recall: It is the metric that tells us, out of all the actual true samples, how many were identified by the model. The mathematical formula for this is:

Recall = TP/(TP+FN)

I will explain both of them using an example of spam email filter. Let’s say we have a model that needs to identify spam emails. Now, Precision for this model will tell us, what fraction of the number of emails that were marked as spam by the model, were actually spam. And Recall will tell us, what fraction of the actual spam message were marked correctly as spam by the model.

F1-score: It is the harmonic mean of Precision and Recall. The mathematical formula for this is:

F = 2 * (Precision * Recall)/(Precision+Recall)

We want our model to have a high F1 score.

F1 score is derived from the metric F-beta measure. The formula is:

F-beta = 1/(beta * (1/Precision) + (1-beta) * (1/Recall))

The higher the value for beta, the more importance is given to Precision. For example, a cancer classification model would require high precision so beta will be high.

F1 score is when beta = 0.5, so equal importance is given to both Precision and Recall and the F-beta equation deduces the formula for F1 score.

Sensitivity: It is the metric that tells us, out of all the actual true samples, how many were identified by the model. This is the True Positive Rate (TPR). Does this sound familiar? Yes, it is actually another word for Recall. As expected the mathematical formula for this is:

Sensitivity = TP/(TP+FN)

1 — Specificity: Specificity is a metric that tells us, out of all the negative sample (False Positives and True Negative), how many were actually correct predictions (True Negatives). So the formula is:

Specificity = TN/(FP+TN)

From this we can deduce what `1-Specificity` will be.

1 — Specificity = FP/(FP+TN)

`1-Specificity` is the False Positive rate (FPR).

Undersampling: Quite often our data set is imbalanced. For example- lets say, in a binary classification model, our training data has a target label named `class` and the values are 1 and 0. Now if we have more data where `class=1` (labelled 1) than `class=0` (labelled 0), then our model will tend to be biased towards `class-1`. Undersampling is a process that used to solve this kind of problem. In undersampling we will randomly select some rows with label 1 and shrink the number of rows to make it same as or proportionate to label 0 row count.

Oversampling: Oversampling another method of creating a balance withing an imbalanced class. In the above definition of Undersample, it is clear that some data is getting lost as we chop down the row count. To overcome this data loss oversampling is used. Oversampling is quite simply the opposite of undersampling. In this approach, the number of samples in the minority class is increased to make it equal to the number of samples in the majority class. There are 2 ways to achieve oversampling: Random oversampling and SMOTE oversampling.

Random Oversampling: Iteratively duplicates a random sample within the minority class, until the size of both classes become similar.

SMOTE Oversampling: SMOTE stands for Synthetic Minority Oversampling Technique. The way SMOTE works is, for features of minority class, it finds the k-nearest neighbors and connects them to each other. For example, for k=3, 3 samples of the feature will be connected. Then synthetic samples will be generated by selecting points that lie on the connected paths. With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE)adds new information by adding synthetic data points to the minority class.

Loss function: A ML Model’s job is to predict the estimated target value. When it predicts a value, which does not equal to the actual target value, it incurs a penalty . A loss function quantifies this penalty as a single value.

Optimization Technique: An optimization technique seeks to minimize the loss. Stochastic Gradient Descent (SGD) is an optimization technique. SGD makes sequential passes over the training data, and during each pass, updates feature weights one example at a time with the aim of approaching the optimal weights that minimize the loss.

Amazon ML uses the following learning algorithms:

• For binary classification, Amazon ML uses logistic regression (logistic loss function + SGD).
• For multiclass classification, Amazon ML uses multinomial logistic regression (multinomial logistic loss + SGD).
• For regression, Amazon ML uses linear regression (squared loss function + SGD).

Hyperparameter: Training parameters are are used to improve model performance. Each ML algorithm has different set of hyperparameters.

Learning Rate: Learning rate is a constant value, which is used in Stochastic Gradient Descent (SGD) algorithm. In SGD algorithm, the learning rate determines how fast it reaches or converges to optimal weight. The linear model weights get updated for every data example it gets. The amount of update is determined by the learning rate value. If the value is too large then the weights may not reach the optimal solution. If the value is too low then the algorithm may need many passes to reach the optimal weights.

Model Size: Model size is determines by the number of input features. If there are too many features then there will be too many patterns in the data as well. And this will increase the model size. As the model size goes up, so does the required size of RAM to train and use the model for prediction. In order to reduce model size we can use L1 regularization or set a maximum size. If we reduce the model size too much that may reduce its performance.

Overfitting: When models memorize patterns instead of generalizing them.

Underfitting: When model has not learned patterns very well.

Regularization: It’s a process of penalizing the extreme weight features to reduce overfitting. L1 regularization reduces the weight of the features that have small values to zero. This results in sparse data and less noise. L2 regularization reduces overall weight values of the features and stabilizes the weights of the features that have high correlation. The type and amount of regularization can be set using `Regularization Type` and `Regularization amount` parameters. A large amount will result in all feature having zero weight and the model mot learning any patterns.

Text Feature Engineering: Feature Engineering is the process of extracting useful information out of data. The feature engineering process for text data is text feature engineering. For example, if we have this line of text `45 Collins st, VIC 3000, Australia` It may not have any meaning to the algorithm. But when we break the text with delimiters (like space, comma), we find important information like — street name, street number, state, postcode, country etc. Bag of Words, N-gram, Tf-idf etc are some text feature engineering approaches.

Bag of words: Split sentences by white space into single words.

N-gram: It’s an extension of bag of words. An n-gram is a contiguous sequence of n items from a given sample of text or speech.

Tf-idf: Tf-idf stands for term frequency-inverse document frequency. This is a statistical measure that is used to evaluate how important a term is to a document in a collection or corpus.

Term Frequency: It measures how frequent a term is in a document, since document lengths can vary, so it is normalized by dividing this by the length of the document.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Inverse Document Frequency: It measures how important a term is. While we calculate TF, all terms are given equal imporatanc. However, we know that words like `the, of, is` may occur frequently, but are of little value to find relevant document. So, we need to weigh down the frequent term and scale up the rare ones.

IDF = log_e(Total number of documents/Number of documents with the term in it.)

Normalization: Max value becomes 1 and min becomes 0. Rest of the values are calculated using following equation:

x` = (x-min(x))/(max(x)-min(x))

Outliers can meddle with normalization.

Standardization: Sets average to zero and uses z-score to calculate values. Z-score uses average value and the standard deviation to calculate the values.

z = (x-mean)/sd

here, mean = mean of x values, sd = standard deviation of x values

To be continued…

--

--