Machine Learning Algorithms: A comparison of different algorithms and when to use them

Taniya
9 min readMay 26, 2018

--

There are different types of algorithms in Machine Learning. We often tend to apply all of these algorithms without thinking as when to apply what. When we know the differences in between these algorithms, it becomes easier for us to optimize their application and know in which scenario we use what.

There are certain parameters on which these algorithms differ. Understanding these parameters and their application can help us attain better accuracy and minimize the chances of error in our predictions.

Regularisation or Overfitting

Both RF and GBM are ensemble methods, meaning you build a classifier out a big number of smaller classifiers.RF uses decision trees, which are very prone to overfitting. In order to achieve higher accuracy, RF decides to create a large number of them based on bagging. The basic idea is to resample the data over and over and for each sample train a new classifier. Different classifiers overfit the data in a different way, and through voting those differences are averaged out. The process of averaging or combining the results of different decision trees helps to overcome the problem of overfitting. Random forest also has less variance than a single decision tree. It means that it works correctly for a large range of data items than single decision trees.

GBM is a boosting method, which builds on weak classifiers. The idea is to add a classifier at a time, so that the next classifier is trained to improve the already trained ensemble. Notice that for RF each iteration the classifier is trained independently from the rest.

Hence GBM has no provision for regularization but because of the multiple resampling of data, RF algorithm is more regularised.

Standard GBM implementation has no regularisation like XGBoost, therefore it also helps to reduce overfitting. In fact, XGBoost is also known as ‘regularized boosting‘ technique. But boosting algorithms can overfit if the number of trees is very large.

SVM performs similar to logistic regression when linear separation and performs well with non-linear boundary depending on the kernel used. SVM is susceptible to overfitting/training issues depending on kernel. A more complex kernel would overfit the model.

Missing Values

XGBoost is designed to handle missing values internally. The missing values are treated in such a manner that if there exists any trend in missing values, it is captured by the model. User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.

Random Forest is good for data set which has missing value attributes.

SVM does not perform well with missing data. It is always better to impute the missing values before running SVM.

Naive Bayes does not perform well with data scarcity. For any possible value of a feature, you need to estimate a likelihood value by a frequentist approach. This can result in probabilities going towards 0 or 1, which in turn leads to numerical instabilities and worse results.

Loss Function

Regression Loss Functions:

L1 and L2 are two loss functions in machine learning which are used to minimize the error. L1 Loss function stands for Least Absolute Deviations. Also known as LAD. L2 Loss function stands for Least Square Errors. Also known as LS.

L1 Loss Function

L1 Loss Function is used to minimize the error which is the sum of the all the absolute differences between the true value and the predicted value.

L2 Loss Function

L2 Loss Function is used to minimize the error which is the sum of the all the squared differences between the true value and the predicted value.

Huber (‘huber’): Another robust loss function that combines least squares and least absolute deviation; use alpha to control the sensitivity with regards to outliers.

Classification Loss Functions:

Binomial deviance (‘deviance’): The negative binomial log-likelihood loss function for binary classification (provides probability estimates). The initial model is given by the log odds-ratio.

Multinomial deviance (‘deviance’): The negative multinomial log-likelihood loss function for multi-class classification with n_classes mutually exclusive classes. It provides probability estimates. The initial model is given by the prior probability of each class. At each iteration n_classes regression trees have to be constructed which makes GBRT rather inefficient for data sets with a large number of classes.

Exponential loss (‘exponential’): The same loss function as AdaBoostClassifier. Less robust to mislabeled examples than ‘deviance’; can only be used for binary classification.

Support Vector Machine

The loss function in SVM is Hinge Loss

For Soft Margin SVM the loss function is

and for Hard Margin SVM the loss function is

In both Random Forest and XG Boost the loss function can be comtomised.

Tree Pruning

In GBM tree pruning stops once a negative loss is encountered.

XGBoost grows the tree upto max_depth and then prune backward until the improvement in loss function is below a threshold.

Variance and Bias

Boosting is based on weak learners (high bias, low variance). In terms of decision trees, weak learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves). Boosting reduces error mainly by reducing bias (and also to some extent variance, by aggregating the output from many models).

On the other hand, Random Forest uses as you said fully grown decision trees (low bias, high variance). It tackles the error reduction task in the opposite way: by reducing variance. The trees are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for large, unpruned trees, so that the bias is initially as low as possible.

The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute t the prediction and in turn increases the bias of the model.

The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

In Linear Regression we fit a straight line through the points. This line, which of course represents a set of predicted values, has zero statistical variance. But the bias is (probably) high — i.e., it doesn’t fit the data very well.Next, suppose we model the data with a high-degree polynomial spline. We increase the polynomial degree until the fit improves (and it will, to arbitrary precision, in fact). Now we have a situation with bias that tends to zero, but the variance is very high. Higher degree polynomial are more complex with higher variance and low bias. Logistic Regression has low variance and high bias.

Flexibility

XGBoost allow users to define custom optimization objectives and evaluation criteria.This adds a whole new dimension to the model and there is no limit to what we can do.

Cross Validation

XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

This is unlike GBM where we have to run a grid-search and only a limited values can be tested.

By default random forest picks up 2/3rd data for training and rest for testing for regression and almost 70% data for training and rest for testing during classification.By principle since it randomizes the variable selection during each tree split it’s not prone to overfit unlike other models. It also uses Out of bag (OOB) estimates can be used for model validation.

Computation Power

XGBoost implements parallel processing and is blazingly faster as compared to GBM. Boosting is sequential process but it can still be parallelized. Each tree can be built only after the previous one and each tree is built using all cores. This makes XGBoost a very fast algorithm.

The main disadvantage of Random forests is their complexity. They are much harder and time-consuming to construct than decision trees. They also require more computational resources and are also less intuitive. When you have a large collection of decision trees it is hard to have an intuitive grasp of the relationship existing in the input data. The prediction process using random forests is time-consuming than other algorithms.

Naive Bayes is computationally fast and simple to implement. It also perform well in multi class prediction.

Computation cost is quite high because we need to compute distance of each query instance to all training samples. Some indexing (K-D tree) must be used to reduce this computation cost. This lazy learning algorithm requires that most of KNN’s computation be done during testing, rather than during training. This can be an issue for large datasets.

Dataset Features

SVM It doesn’t perform well, when we have large data set because the required training time is higher. It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping. It is effective in cases where number of dimensions is greater than the number of samples.

Naive Bayes works well with high dimensions. The biggest disadvantage is that the Naive Bayes classifier makes a very strong assumption on the shape of your data distribution, i.e. any two features are independent given the output class. It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption). If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation. When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.

RF decorrelates trees (relative to bagged trees). It is important when dealing with multiple features which may be correlated. It can handle qualitative (categorical) features.Random forests can perform better on small data sets. They also do not require preparation of the input data. You do not have to scale the data. It also maintains accuracy even when a large proportion of the data are missing.

All boosting algorithms can easily handle qualitative (categorical) features. Boosted trees are data hungry and require large datasets.

Logistic Regression works well with diagonal (feature) decision boundaries.

In the case of many dimensions, inputs can commonly be “close” to many data points. This reduces the effectiveness of k-NN, since the algorithm relies on a correlation between closeness and similarity. One workaround for this issue is dimension reduction, which reduces the number of working variable dimensions (but can lose variable trends in the process). Since k-NN gets all of its information from the input’s neighbors, localized anomalies affect outcomes significantly, rather than for an algorithm that uses a generalized view of the data. KNN can handle both continuous and discrete data but appropriate distance algorithm must be selected.

--

--