Here is my reference summary of random important concepts related to classification models. Credits: various sources online.
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”)
This is a list of rates that are often computed from a confusion matrix for a binary classifier:
- Accuracy: Overall, how often is the classifier correct?
(TP+TN)/total = (100+50)/165 = 0.91
- Misclassification Rate or “Error Rate”: Overall, how often is it wrong?
equivalent to 1 minus Accuracy (FP+FN)/total = (10+5)/165 = 0.09
- True Positive Rate also known as “Sensitivity” or “Recall”: When it’s actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
- Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
F Score: This is a weighted average of the true positive rate (recall) and precision
ROC Curve: This is a commonly used graph that summarizes the performance of a classifier over all possible thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class.
- True Positive Rate: When it’s actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95 also known as “Sensitivity” or “Recall”
- False Positive Rate: When it’s actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
This type of graph is called a Receiver Operating Characteristic curve (or ROC curve.) It is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.
An ROC curve demonstrates several things:
- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
- The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. You can check this out on the graph above. Recall that the LR for T4 < 5 is 52. This corresponds to the far left, steep portion of the curve. The LR for T4 > 9 is 0.2. This corresponds to the far right, nearly horizontal portion of the curve.
- The area under the curve is a measure of text accuracy.
Decision tree split methods:
Two common methods on deciding splits in decision tree: Gini impurity and information gain.
Lets talk about information gain first:
In information theory, systems are composed of three elements, a transmitter, a channel, and a receiver, and their measurable ability to convey a message. In this context, entropy (more specifically, Shannon entropy) is the expected value of the information contained in each message. In a more technical sense, there are reasons to define information as the negative logarithm of the probability distribution of possible events or messages. The amount of information of every event forms a random variable whose expected value, is called, Shannon entropy. Entropy is zero when one outcome is certain. Generally, entropy refers to disorder or uncertainty.
Entropy is defined as below
where p1, p2, .. are fractions that add up to 1 and represent the percentage of each class present in the child node that results from a split in the tree.
Information Gain = Entropy(parent) — Weighted Sum of Entropy(Children)
More simplified explanation of Information Gain:
Look at the image below and think which node can be described easily. I am sure, your answer is C because it requires less information as all values are similar. On the other hand, B requires more information to describe it and A requires the maximum information. In other words, we can say that C is a Pure node, B is less Impure and A is more impure.
Now, we can build a conclusion that less impure node requires less information to describe it. And, more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% — 50%), it has entropy of one.
Entropy can be calculated using formula:-
Here p and q is probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.
Steps to calculate entropy for a split:
- Calculate entropy of parent node
- Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available in split.
Example: Let’s use this method to identify best split for student example.
- Entropy for parent node = -(15/30) log2 (15/30) — (15/30) log2 (15/30) = 1. Here 1 shows that it is a impure node.
- Entropy for Female node = -(2/10) log2 (2/10) — (8/10) log2 (8/10) = 0.72 and for male node, -(13/20) log2 (13/20) — (7/20) log2 (7/20) = 0.93
- Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30)*0.93 = 0.86
- Entropy for Class IX node, -(6/14) log2 (6/14) — (8/14) log2 (8/14) = 0.99 and for Class X node, -(9/16) log2 (9/16) — (7/16) log2 (7/16) = 0.99.
- Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99
Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will split on Gender. We can derive information gain from entropy as 1- Entropy.
Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability pi of an item with label i being chosen times the probability 1 - pi of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.
Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.
It works with categorical target variable “Success” or “Failure”.
It performs only Binary splits
Higher the value of Gini higher the homogeneity.
CART (Classification and Regression Tree) uses Gini method to create binary splits.
Steps to Calculate Gini for a split
Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p²+q²).
Calculate Gini for split using weighted Gini score of each node of that split
Example: — Referring to example used above, where we want to segregate the students based on target variable ( playing cricket or not ). In the snapshot below, we split the population using two input variables Gender and Class. Now, I want to identify which split is producing more homogeneous sub-nodes using Gini index.
Split on Gender:
Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
Similar for Split on Class:
Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender.
Bias Variance Tradeoff:
In statistics and machine learning, the bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:
- The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- The variance is error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
Central limit theorem (CLT): is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
What is Bagging? How does it work?
Bagging is a technique used to reduce the variance of our predictions by combining the result of multiple classifiers modeled on different sub-samples of the same data set. The following figure will make it clearer:
The steps followed in bagging are:
Create Multiple DataSets:
- Sampling is done with replacement on the original data and new datasets are formed.
- The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model
- Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting
Build Multiple Classifiers:
- Classifiers are built on each data set.
- Generally the same classifier is modeled on each data set and predictions are made.
- The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
- The combined values are generally more robust than a single model.
Note that, here the number of models built is not a hyper-parameters. Higher number of models are always better or may give similar performance than lower numbers. It can be theoretically shown that the variance of the combined predictions are reduced to 1/n (n: number of classifiers) of the original variance, under some assumptions.
There are various implementations of bagging models. Random forest is one of them and we’ll discuss it next.
Random forest is a tree-based algorithm which involves building several trees (decision trees), then combining their output to improve generalization ability of the model. The method of combining trees is known as an ensemble method. Ensembling is nothing but a combination of weak learners (individual trees) to produce a strong learner.
Random forests does not overfit. You can run as many trees as you want. It is fast.
How random forests work: The training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest. It is also used to get estimates of variable importance.
After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of the run, the proximities are normalized by dividing by the number of trees. Proximities are used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data.
The out-of-bag (oob) error estimate
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree. Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.
In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m.
If the values of this score from tree to tree are independent, then the standard error can be computed by a standard computation. The correlations of these scores between trees have been computed for a number of data sets and proved to be quite low, therefore we compute standard errors in the classical way, divide the raw score by its standard error to get a z-score, ands assign a significance level to the z-score assuming normality.
If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run.
Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.
Outliers: Outliers are generally defined as cases that are removed from the main body of the data. Translate this as: outliers are cases whose proximities to all other cases in the data are generally small. A useful revision is to define outliers relative to their class. Thus, an outlier in class j is a case whose proximities to all other class j cases are small.
Balancing prediction error
In some data sets, the prediction error between classes is highly unbalanced. Some classes have a low prediction error, others a high. This occurs usually when one class is much larger than another. Then random forests, trying to minimize overall error rate, will keep the error rate low on the large class while letting the smaller classes have a larger error rate. For instance, in drug discovery, where a given molecule is classified as active or not, it is common to have the actives outnumbered by 10 to 1, up to 100 to 1. In these situations the error rate on the interesting class (actives) will be very high.
The user can detect the imbalance by outputs the error rates for the individual classes. To illustrate 20 dimensional synthetic data is used. Class 1 occurs in one spherical Gaussian, class 2 on another. A training set of 1000 class 1’s and 50 class 2’s is generated, together with a test set of 5000 class 1’s and 250 class 2's.
The final output of a forest of 500 trees on this data is: (num sample, total error, class A error, class B error): 500 3.7 0.0 78.4. There is a low overall test set error (3.73%) but class 2 has over 3/4 of its cases misclassified.
The error can balancing can be done by setting different weights for the classes. The higher the weight a class is given, the more its error rate is decreased. A guide as to what weights to give is to make them inversely proportional to the class populations. So set weights to 1 on class 1, and 20 on class 2, and run again. The output is: 500 12.1 12.7 0.0
The weight of 20 on class 2 is too high. Set it to 10 and try again, getting: 500 4.3 4.2 5.2. This is pretty close to balance. If exact balance is wanted, the weight on class 2 could be jiggled around a bit more.
Note that in getting this balance, the overall error rate went up. This is the usual result — to get better balance, the overall error rate will be increased.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.
Step 1: The base learner takes all the distributions and assign equal weight or attention to each observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting pays higher focus on examples which are mis-classiﬁed or have higher errors by preceding weak rules.
There are many boosting algorithms which impart additional boost to model’s accuracy. Two most commonly used algorithms i.e. Gradient Boosting (GBM) and XGboost.
The general idea of the method is additive training. At each iteration, a new tree learns the gradients of the residuals between the target values and the current predicted values, and then the algorithm conducts gradient descent based on the learned gradients. We can see that this is a sequential algorithm. Therefore, we can’t parallelize the algorithm like Random Forest. We can only parallelize the algorithm in the tree building step. Therefore, the problem reduces to parallel decision tree building.
Advantages of Xgboost over GBM:
- Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
- In fact, XGBoost is also known as ‘regularized boosting‘ technique.
- XGBoost implements parallel processing and is blazingly faster as compared to GBM.
- But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further. Basic idea is features are distributed across nodes and for each tree building we are working across data split between nodes to parallelize.
- XGBoost allow users to define custom optimization objectives and evaluation criteria.
- This adds a whole new dimension to the model and there is no limit to what we can do.
Handling Missing Values
- XGBoost has an in-built routine to handle missing values.
- User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
- A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
- XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
- Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
- XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
- This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
Continue on Existing Model
- User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
- GBM implementation of sklearn also has this feature so they are even on this point.
Boosting v/s RF
Boosting is based on weak learners (high bias, low variance). In terms of decision trees, weak learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves). Boosting reduces error mainly by reducing bias (and also to some extent variance, by aggregating the output from many models).
On the other hand, Random Forest uses as you said fully grown decision trees (low bias, high variance). It tackles the error reduction task in the opposite way: by reducing variance. The trees are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for large, unprunned trees, so that the bias is initially as low as possible.
SVM: Support Vector Machine
Selecting the hyperplane: Maximizing the distances between nearest data point (either class) and hyper-plane will help us to decide the right hyper-plane. This distance is called as Margin. Hyperlane C
SVM selects the hyper-plane which classifies the classes accurately prior to maximizing margin. hyper-plane A (Linear regression would have picked B as RMSE is lower with B)
SVM has a feature to ignore outliers and find the hyper-plane that has maximum margin. Hence, we can say, SVM is robust to outliers.
SVM has a technique called the kernel trick. These are functions which takes low dimensional input space and transform it to a higher dimensional space i.e. it converts not separable problem to separable problem, these functions are called kernels. It is mostly useful in non-linear separation problem. we have various options available with kernel like, “linear”, “rbf”,”poly” and others (default value is “rbf”).
Pros and Cons associated with SVM
- It works really well with clear margin of separation
- It is effective in high dimensional spaces.
- It is effective in cases where number of dimensions is greater than the number of samples.
- It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- It doesn’t perform well, when we have large data set because the required training time is higher
- It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
- SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is related SVC method of Python scikit-learn library.
Classifying text with bag-of-words
TF-IDF stands for “term frequency / inverse document frequency” and is a method for emphasizing words that occur frequently in a given document, while at the same time de-emphasising words that occur frequently in many documents.
Stopwords and N-grams:
we’d like to try n-grams, and for n-grams we better leave all the words in place. We covered n-grams before, they are combinations of n sequential words, starting with bigrams (two words): “cat ate”, “ate my”, “my precious”, “precious homework”. Trigrams consist of three words: “cat ate my”, “ate my homework”, “my precious homework”; 4-grams of four, and so on.
Why do n-grams work? Consider this phrase: “movie not good”. It has obviously negative sentiment, however if you take each word in separation you won’t detect this. On the opposite, the model will probably learn that “good” is a positive sentiment word, which doesn’t help at all here.
On the other hand, bigrams will do the trick: the model will probably learn that “not good” has a negative sentiment.
To use a more complicated example from Stanford’s sentiment analysis page:
This movie was actually neither that funny, nor super witty.
For this, bigrams will fail with “that funny” and “super witty”. We’d need at least trigrams to catch “neither that funny” and “nor super witty”, however these phrases don’t seem to be too common, so if we’re using a restricted number of features, or regularization, they might not make it into the model. Hence the motivation for a more sophisticated model like a neural network, but we digress.
If computing n-grams sounds a little complicated, scikit-learn vectorizers can do it automatically.
Each word is a feature: whether it’s present in the document or not (0/1), or how many times it appears (an integer >= 0). We started with the original dimensionality from the tutorial, 5000. This makes sense for a random forest, which as a highly non-linear / expressive / high-variance classifier needs a relatively high ratio of examples to dimensionality. Linear models are less exacting in this respect, they can even work with d >> n.
We found out that if we don’t constrain the dimensionality, we run out of memory, even with such a small dataset. We could afford roughly 40k features on a machine with 12 GB of RAM. More caused swapping.
For starters, we tried 20k features. The logistic regression scores 94.2% (before TF-IDF and n-grams), vs 92.9% with 5k features. More is even better: 96.0 with 30k, 96.3 with 40k (after TF-IDF and ngrams).
To deal with memory issues we could use the hashing vectorizer. However it only scores 93.2% vs 96.3% before, partly because it doesn’t support TF-IDF.
Mesnil, Mikolov,Ranzato and Bengio have a paper on sentiment classification: Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews (code). They found that a linear model using n-grams outperformed both a recurrent neural network and a linear model using sentence vectors.
However, the dataset they use, the Stanford Large Movie Review Dataset, is small — it has 25000 training examples. Alec Radford says that RNN start to outperform linear models when the number of examples is larger, roughly from 100k to 1M.
Topic modelling provides us with methods to organize, understand and summarize large collections of textual information. It helps in:
- Discovering hidden topical patterns that are present across the collection
- Annotating documents according to these topics
- Using these annotations to organize, search and summarize texts
Latent Dirichlet Allocation (LDA)
In the LDA model, each document is viewed as a mixture of topics that are present in the corpus. The model proposes that each word in the document is attributable to one of the document’s topics.
Collapsed Gibbs sampling is one way the LDA learns the topics and the topic representations of each document. The procedure is as follows:
- Go through each document and randomly assign each word in the document to one of K topics (K is chosen beforehand)
- This random assignment gives topic representations of all documents and word distributions of all the topics, albeit not very good ones
- So, to improve upon them:
1. For each document d, go through each word w and compute:
2. p(topic t | document d): proportion of words in document d that are assigned to topic t
3. p(word w| topic t): proportion of assignments to topic t, over all documents d, that come from word w
- Reassign word w a new topic t’, where we choose topic t’ with probability
p(topic t’ | document d) * p(word w | topic t’)
This generative model predicts the probability that topic t’ generated word w
On repeating the last step a large number of times, we reach a steady state where topic assignments are pretty good. These assignments are then used to determine the topic mixtures of each document.
Latent Dirichlet Allocation for Topic Modeling
There are many approaches for obtaining topics from a text such as — Term Frequency and Inverse Document Frequency. NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation is the most popular topic modeling technique and in this article, we will discuss the same.
LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.
LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. The following matrix shows a corpus of N documents D1, D2, D3 … Dn and vocabulary size of M words W1,W2 .. Wn. The value of i,j cell gives the frequency count of word Wj in Document Di.
LDA converts this Document-Term Matrix into two lower dimensional matrices — M1 and M2.
M1 is a document-topics matrix and M2 is a topic — terms matrix with dimensions (N, K) and (K, M) respectively, where N is the number of documents, K is the number of topics and M is the vocabulary size.
Notice that these two matrices already provides topic word and document topic distributions, However, these distribution needs to be improved, which is the main aim of LDA. LDA makes use of sampling techniques in order to improve these matrices.
It Iterates through each word “w” for each document “d” and tries to adjust the current topic — word assignment with a new assignment. A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities p1 and p2.
For every topic, two probabilities p1 and p2 are calculated. P1 — p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 — p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w.
The current topic — word assignment is updated with a new topic with the probability, product of p1 and p2 . In this step, the model assumes that all the existing word — topic assignments except the current word are correct. This is essentially the probability that topic t generated word w, so it makes sense to adjust the current word’s topic with new probability.
After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA.
Parameters of LDA
Alpha and Beta Hyperparameters — alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.
Number of Topics — Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this original paper on the use of KL divergence.
Number of Topic Terms — Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.
Number of Iterations / passes — Maximum number of iterations allowed to LDA algorithm for convergence.