## CODEX

# The Most Common Machine Learning Classification Algorithms for Data Science and Their Code

T*he roundup of most common classification algorithms along with their python and r code:*

Decision Tree, Naive Bayes, Gaussian Naive Bayes, Bernoulli Naive Bayes, Multinomial Naive Bayes, K Nearest Neighbours (KNN), Support Vector Machine (SVM), Linear Support Vector Classifier (SVC), Stochastic Gradient Descent (SGD) Classifier, Logistic Regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Fisher’s Linear Discriminant….

Classification algorithms can be performed on a variety of data — structured and unstructured data. Classification is a technique where we divide the data into a given number of classes. The main goal of a classification problem is to identify the category or class to which new data will fall.

Important Terminologies encounter in machine learning — classification algorithms:

**classifier**: An algorithm that maps the input data to a specific category.**classification model**: A model draw some conclusion from input data which is given for training purpose. It will predict class labels or categories for new data.**Binary classification**: Classification task with two possible outcomes. Eg: Gender classification (Male / Female)**Multi-class classification:**Classification with more than two classes. In multi-class classification, we assigned each sample to one and only one target label. Eg: An animal can be cat or dog but not both at the same time**Multi-label classification**: Classification task where each sample is mapped to a set of target labels (more than one class). Eg: A news article may be about sport, a person, and location at the same time.

These classification algorithms are used to build a model that predicts the outcome of class or categories for a given dataset. The data can come from different platforms. It depends upon the dimensionality of the datasets, the attribute types, and missing values, etc. one algorithm can give you better accuracy than other algorithms. let’s get started…

# 1. Decision Tree

Decision trees are very extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero in on the classification. For example, if you wanted to build a decision tree to classify an animal you come across while on a hike, you might construct the one shown in Figure.

Decision tree classification models can easily handle qualitative independent variables without the need to create dummy variables. Missing values are not a problem either. Interestingly, decision tree algorithms can be used for regression models as well. The same library that you used to build a classification model, can also be used to build a regression model after change ing some of the parameters.

As the decision tree-based classification models are easy to interpret, they are not robust. One major problem with decision trees is their high variance or low bias. One small change in the training dataset can give an entirely different decision tree model.

# 2. Naive Bayes

Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets. Because they are so fast and have so few tunable parameters, they end up being very useful as a quick-and-dirty baseline for a classification problem.

Naive Bayes Classifier is based on the Bayes Theorem.

The Bayes Theorem says the conditional probability of an outcome can be computed using the conditional probability of the cause of the outcome.

The probability of an event *x* occurring, given that event *C* has occurred in the *prior probability. *It is the knowledge that something has already happened. Using the prior probability, we can compute the *posterior probability — *which is the probability that event *C *will occur given that *x* has occurred. The Naive Bayes classifier uses the input variable to choose the class with the highest posterior probability.

The algorithm is called naive because it makes an assumption about the distribution of the data. The distribution can be Gaussian, Bernoulli, or Multinomial. Another drawback of Naive Bayes is that continuous variables have to be preprocessed and discretized by *binning, *which can discard useful information.

# 3. Gaussian Naive Bayes

The Gaussian Naive Bayes algorithm assumes that all the features have a Gaussian (Normal / Bell Curve) distribution. This is suitable for continuous data eg: daily temperature, height.

The Gaussian distribution has 68% of the data in 1 standard deviation of the mean and 96% within 2 standard deviations. Data that is not normally distributed produce low accuracy when used in a Gaussian Naive Bayes classifier and a Naive Bayes classifier with a different distribution can be used.

# 4. Bernoulli Naive Bayes

The Bernoulli Distribution is used for binary variables — variables that can have 1 of 2 values. It denotes the probability of each of the variables occurring. A Bernoulli Naive Bayes classifier is appropriate for binary variables, like Gender or Deceased.

# 5. Multinomial Naive Bayes

The Multinomial Naive Bayes uses the multinomial distribution, which is the generalization of the binomial distribution. In other words, the multinomial distribution models the probability of rolling a *k* sided die *n* times.

Multinomial Naive Bayes is used frequently in text analytics because it has a bag of words assumption — which is the position of the words doesn’t matter. It also has an independence assumption — that the features are all independent.

# 6. K Nearest Neighbours (KNN)

K Nearest Neighbors is the simplest machine learning algorithm. The idea is to memorize the entire dataset and classify a point based on the class of its *K *nearest neighbors.

The figure from Understanding Machine Learning, by Shai Shalev-Shwartz and Shai Ben-David, shows the boundaries in which a label point will be predicted to have the same class as the point already in the boundary. This is a 1 Nearest Neighbor, the class of only 1 nearest neighbor is used.

KNN is simple and without any assumptions, but the drawback of the algorithm is that it is slow and can become weak as the number of features increase. It is also difficult to determine the optimal value of K — which is the number of neighbors used.

# 7. Support Vector Machine (SVM)

An SVM is a classification and regression algorithm. It works by identifying a *hyperplane* that separates the classes in the data. A *hyperplane *is a geometric entity which has a dimension of 1 less than it’s surrounding (ambient) space.

If SVM is asked to classify a two-dimensional dataset, it will do it with a one-dimensional hyper place (a line), classes in 3D data will be separated by a 2D plane and Nth dimensional data will be separated by an N-1 dimension line.

SVM is also called a margin classifier because it draws a *margin *between classes.

The images shown here has a class that is *linearly separable. *However, sometimes classes cannot be separated by a straight line in the present dimension. An SVM is capable of mapping the data in a higher dimension such that it becomes separable by a margin.

Support Vector machines are powerful in situations where the number of features (columns) is more than the number of samples (rows). It is also effective in high dimensions (such as images). It is also memory efficient because it uses a subset of the dataset to learn support vectors.

# 8. Linear Support Vector Classifier (SVC)

A Linear SVC uses a boundary of one-degree (linear/straight line) to classify data. Linear SVC has much less complexity than a non-linear classifier and is only appropriate for small datasets. More complex datasets will require a nonlinear classifier.

# 9. Stochastic Gradient Descent (SGD) Classifier

SGD is a linear classifier that computes the minima of the cost function by computing the gradient at each iteration and updating the model with a decreasing rate. It is an umbrella term for many types of classifiers, such as Logistic Regression or SVM that use the SGD technique for optimization.

# 10. Logistic Regression

Logistic regression estimates the relationship between a dependent categorical variable and independent variables. For instance, to predict whether an email is a spam or whether the tumor is malignant or not.

If we use linear regression for this problem, there is a need to set up a threshold for classification which generates inaccurate results. Besides this, linear regression is unbounded, and hence we go into the idea of logistic regression.

Unlike linear regression, logistic regression is estimated using the Maximum Likelihood Estimation (MLE) approach. MLE is a technique for the “likelihood” maximization method, while OLS is a distance-minimizing approximation method. Maximizing the likelihood function determines the mean and variance parameters that are most likely to produce the observed data.

Logistic Regression transforms its output using the sigmoid function in the case of binary logistic regression. As you can see in the below figure, if ‘*t*’ goes to infinity, *Y* (predicted) will become 1 and if ‘*t*’ goes to negative infinity, *Y*(predicted) will become 0. The output from the function is the estimated probability. This is used to infer how confident can predicted value be as compared to the actual value when given an input *X*.

There are several types of logistic regression:

**Binary Logistic Regression:**Two Categories: Spam (1) Not-Spam (0)**Multinomial Logistic Regression:**Three or more category without ordering: Predicts which food is recommended more like Veg, Non-Veg, Vegan**Ordinal Logistic Regression:**Three or more categories with ordering: Books rating from 1 to 5

# 11. Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is performed by starting with 2 classes and generalizing to more. The idea is to find a direction, defined by a vector, such that when the two classes are projected on the vector, they are as spread out as possible.

# 12. Quadratic Discriminant Analysis (QDA)

QDA is the same concept as LDA, the only difference is that we do not assume the distribution within the classes is normal. Therefore, a different covariance matrix has to be built for each class which increases the computational cost because there are more parameters to estimate, but it fits data better than LDA.

# 13. Fisher’s Linear Discriminant

Fisher’s Linear Discriminant improves upon LDA by maximizing the ratio between-class variance and the inter-class variance. This reduces the loss of information caused by overlapping classes in LDA.

# Links For Machine Learning Classification Algorithms Tutorial And Their Codes

## 1. Decision tree algorithms

**Python Tutorial**

**R Tutorial**

## 2. Naive Bayes

**Python Tutorial**

**R Tutorial**

## 3. Gaussian Naive Bayes

**Python Tutorial**

**R Tutorial**

## 4. Bernoulli Naive Bayes

**Python Tutorial**

**R Tutorial**

## 5. Multinomial Naive Bayes

**Python Tutorial**

**R Tutorial**

## 6. K Nearest Neighbours (KNN)

**Python Tutorial**

**R Tutorial**

## 7. Support Vector Machine (SVM)

**Python Tutorial**

**R Tutorial**

## 8. Linear Support Vector Classifier (SVC)

**Python Tutorial**

**R Tutorial**

## 9. Stochastic Gradient Descent (SGD) Classifier

**Python Tutorial**

**R Tutorial**

## 10. Logistic Regression

**Python Tutorial**

**R Tutorial**

## 11. Linear Discriminant Analysis (LDA)

**Python Tutorial**

**R Tutorial**

## 12. Quadratic Discriminant Analysis (QDA)

**Python Tutorial**

**R Tutorial**

## 13. Fisher’s Linear Discriminant

**Python Tutorial**

**R Tutorial**

**Thanks for reading.**