Comparative Study on Classic Machine learning Algorithms , Part-2

9 min readDec 11, 2018

Quick summary on ML algorithms

In the previous story, I have already covered Linear Regression, Logistic Regression, KNN and Decision trees. In this Story, I will explain Support Vector Machines, Random Forest and Naive Bayes algorithm. The comparisons, that are already covered in the previous story wont be repeated here. Lets start the the phase 2.

5. Support Vector Machine

Support Vector machine is a type of ML technique that can be used for both classification and regression. It have majorly two variants to support linear and non linear problems. Linear SVM has no kernel and finds a minimum margin linear solution to the problem. SVM with kernels are used when the solution is not linearly separable.

Basic Theory:

Support Vector Machine is a supervised learning technique extensively used in text classification, image classification, bioinformatics etc.

In Linear SVM, the problem space must be linearly separable. A hyperplane is derived by the model, that maximizes the classification margin. The hyperplane will be an N-1 dimensional subspace if there are N features present. The boundary nodes in the feature space are called support vectors. Based on the their relative position, the maximum margin is derived and an optimal hyperplane is drawn in the midpoint.

credits : https://www.hackerearth.com/blog/machine-learning/simple-tutorial-svm-parameter-tuning-python-r/

Value of margin(m) will be inversely proportional to ||w||, where w is the set of weight matrices. For maximizing margin, we’ll have to minimize||w||. The optimization problem will be,

The above optimization works well for fully linearly separable solutions. For handling outliers, we need a slack term as below. The second term uses the hinge loss to get slack variable.

C is the regularization parameter that balances the miss penalty and margin width. As the mathematical explanations are well above the scope of this story, I will not be explaining them in depth.

The basic logic is that, to minimize the cost function, w is forced to tune with maximum margin between the classes. C value will decide the level of regularization applied over the datasets. It decides the level (soft/hard) margin to be applied over the datasets. In short, C is the level of ignorance over outliers.

Non-linear SVM when the dataset is not linearly separable. A kernel function is used to derive a new hyperplane for all the training data. The distribution of labels in new hyperplane will be such that training data will be linearly separable. Later, a linear curve will classify the labels in the hyperplane. When the classification results are projected back to the feature space, we get a non linear solution.

The only change in the equation here is, the introduction of a new kernel function. The new equation will look like,

The xi will be replaced by ϕ(xi) which will transform the dataset into the new hyperplane.

loss function:

The loss function in Eqn.1 can be divided into two parts as below :

First term tries to minimize w parameters, and achieve high margin. Second term corresponds to hinge loss. It calculates the slack variable for each dataset. If any dataset comes in between the margin or in the wrong side, it will be given a penalty by the hinge loss.

The minimization of first term causes decrease in w and widening of the margin. The minimization of second term forces shortening of margin to reduce the hinge loss. Based on the value of C, we finally settles to a stable margin. Value of C decides a soft/hard margin in the curve.

credits : https://datascience.stackexchange.com/questions/25977

In the above diagram, it is evident about the effects of C in deriving the margin.

In non-linear kernels, we can use Gaussian kernel, polynomial kernel, Sigmoid kernel, Laplace RBF kernel etc.

Advantages:

SVM uses kernel trick to solve complex solutions.
SVM uses a convex optimization function, due to which global minima is always achievable.
Hinge loss provides higher accuracy.
Outliers can be well handled using soft margin constant C.

Disadvantages:

Hinge loss leads to sparsity.
Hyper parameters and kernels are to be carefully tuned for sufficient accuracy.
Longer training time for larger datasets.

Hyperparameters:

Soft Margin Constant ( C ) : It is a hyperparameter that decides the level of penalty over the outliers. It is directly inverse to regularization parameter. When C is large, Outliers will be given high penalty and the a hard margin is formed. When C is small, the outliers are ignored and the margin will be wide.
Degree of polynomial in Polynomial Kernel (d) : when d = 1, it is equivalent to a linear kernel. When d is higher, kernel is flexible enough to distinguish complex patterns by projecting them to a new hyperplane.
Width Parameter in Gaussian Kernel (γ) : Gamma decides the width of the Gaussian curve. With increase in gamma, width also increases.

Comparison with other Models :

SVM vs Random Forest :

Random Forest supports multiclass classification,whereas SVM needs multiple models for the same.
Random Forest can give a probability over the prediction, whereas SVM cannot give.
Random Forest deals categorical data better than SVM.

SVM vs Naive Bayes :

Both performs better with low amount of training data and large features.
If features are mutually dependent, SVM outperforms Naive Bayes.
SVM is a discriminative model whereas NB is generative model.

SVM vs Neural Networks :

SVM have a convex optimization function , whereas NN may hung in local minima.
SVM can perform better than NN when there are limited training data and many features. NN needs large training data for sufficient accuracy.
Multi class classification requires multiple models for SVM, whereas NN can do it with a single model.

6. Random Forest

Random Forest is an ensemble model where, multiple decision trees are combined to get a stronger model. The derived model will be more robust, accurate and handles overfitting better than constituent models.

Basic Theory :

Random Forest have a set of decision trees ensembled with “bagging method” to obtain classification and regression outputs. In classification, it calculates the output using majority voting , whereas in regression, mean is calculated.

please refer my previous story to find more about decision trees.

Random Forest comes up with a robust, accurate model that can handle large varieties of input data with binary, categorical, continuous features.

credits : https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d

Loss function :

We use entropy/Gini score to calculate the loss value of the datasets. In the previous story, i have explained it in the decision tree section.

Advantages :

Accurate and powerful model.
handles overfitting efficiently.
Supports implicit feature selection and derives feature importance.

Disadvantages :

computationally complex and slower when forest becomes large.
Not a well descriptive model over the prediction.

Hyperparameters :

n_estimators : It is the number of trees in the forest. With large number of trees comes high accuracy, but high computational complexity.

maximum features : maximum number of features allowed in an individual tree.

minimum sample leaf : It is the minimum number of samples required to split an internal node.

Comparison with other models :

Random Forest comparison is pretty much similar to that of Decision tree comparisons. Please refer the part-1 of this series for more information.

Random Forest vs Naive Bayes :

Random Forest is a complex and large model whereas Naive Bayes is a relatively smaller model.
Naive Bayes performs better with small training data, whereas RF needs larger set of training data.

Random Forest vs Neural Networks :

Both are very powerful and high accuracy algorithms.
Both have feature interactions internally and are less explainable.
Random Forest needs no feature scaling whereas NN needs features to be scaled.
An ensemble version of both models will be powerful.

7. Naive Bayes

Naive bayes is a generative probability model used for classification problems. It is the prime model used for text classifications, where featureset is very large. It is extensively used for sentiment analysis, spam filtering etc.

Basic Theory:

Naive bayes model is based on Thomas Baye’s bayes rule. The bayes rule can be stated as ,

In the above equation,

P(A|B) : (posterior probability) probability of event A to happen when event B is true.

P(A) ,P(B) : probability of event A and event B to happen.

P(B|A) : (likelihood) probability of event B to happen when event A is true.

The basic logic is to derive the probability of output label(Y) given the input X , from individual probabilities of features(Xi) given output label as Y, from the training data.

Please notice in the above equation that naive bayes assumes all the features to be independent. The word “naive” itself is used to remind this. In case of multiple class labels, P(Ci|X) is calculated for each labels and label with maximum probability is chosen as the output.

There are few alternatives of naive bayes theorem as listed below :

Gaussian : Gaussian naive bayes assumes features to be following Gaussian distribution.
Multinomial : Multinomial naive bayes distribution is used when the data is assumed to be in multinomial distribution.

Assumption :

The major assumption of naive bayes is that all features tend to be mutually independent. But in real scenarios, this may not be true.

ie , P(X1|X2, X3, X4, …,Xn, C) = P(X1|C)

In the above equation, X1 feature is assumed to be independent of other features, so other features can be excluded. Xi is the ith feature and C is the class label.

Advantages :

works well with less training data.
If NB conditional independence is satisfied, it converges faster than other discriminative models.
Handles irrelevant features.
Supports binary and multi-class classification problems.

Disadvantages :

expects the features to be strictly independent to each other, which is not applicable in real life scenarios.
While training sample of a large population, and if we have a feature with P(X=feature|Y) as zero, the posterior probability will become zero. This happens when the sample is not representing the population properly.
continuous variables are binned to extract discrete values from features. This task should be carefully done to avoid data loss.

When to Use :

Naive bayes is preferred when the features are mutually independent and have limited training data.
Naive bayes is highly used in text classification, spam filtering, recommender systems etc.

Conclusion :

In the two stories, we learned about various classical Machine learning Algorithms and their general properties. None of the algorithms works best in all cases(No Free Lunch), that means, distribution of training data is the key criteria for selecting a suitable algorithm. Still, Some general assumptions can be made on selection of algorithms, based on training size, type of features, number of features, computational and space complexity etc. Once we try various ML models with different hyperparameters on the data, we will get to know a clear view of the Algorithms. I hope this story helped you to have an idea over various ML algorithms and their comparisons.

Please comment your corrections, concerns and opinions below.

Thank You all :)

Comparative Study on Classic Machine learning Algorithms , Part-2

5. Support Vector Machine

Basic Theory:

loss function:

Advantages:

Disadvantages:

Hyperparameters:

Comparison with other Models :

6. Random Forest

Basic Theory :

Loss function :

Advantages :

Disadvantages :

Hyperparameters :

Comparison with other models :

7. Naive Bayes

Basic Theory:

Assumption :

Advantages :

Disadvantages :

When to Use :

Conclusion :

Written by Danny Varghese