Classification Methods in Machine Learning

5 min readOct 9, 2019

Classification is a supervised machine learning approach, in which the algorithm learns from the data input provided to it — and then uses this learning to classify new observations.

In other words, the training dataset is employed to obtain better boundary conditions which can be used to determine each target class; once such boundary conditions are determined, next task is to predict the target class.

Binary classifiers works with only two classes or possible outcomes (example: positive or negative sentiment; whether lender will pay loan or not; etc), and Multiclass classifiers work with multiple classes (ex: to which country a flag belongs, whether an image is an apple or banana or orange; etc). Multiclass assumes that each sample is assigned to one and only one label.

One of the first popular algorithms for classification in machine learning was Naive Bayes, a probabilistic classifier inspired by the Bayes theorem (which allows us to make reasoned deduction of events happening in the real world based on prior knowledge of observations that may imply it). The name ("Naive") derives from the fact that the algorithm assumes that attributes are conditionally independent.

The algorithm is simple algorithm to implement and usually represents a reasonable method to kickstart classification efforts. It can easily scale to larger datasets (takes linear time versus iterative approximation, as used for many other types of classifiers, which is more expensive in terms of computation resources) and requires small amount of training data.

However, Naive Bayes can suffer from a problem know as ' zero probability problem ', when the conditional probability is zero for a particular attribute, failing to provide a valid prediction. One solution is to leverage a smoothing procedure (ex: Laplace method).

P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes). P(c) is the prior probability of class. P(x|c) is the likelihood which is the probability of predictor given class, and P(x) is the prior probability of predictor.

First step of the algorithm is about computing the prior probability for given class labels. Then finding the likelihood probability with each attribute, for each class. Subsequently, putting these values in Bayes formula & calculating posterior probability, and then seeing which class has a higher probablity, given that the input belongs to the higher probability class.

It is rather straightforward to implement Naive Bayes in Python by leveraging scikit-learn library. There are actually three types of Naive Bayes model under the scikit-learn library: (a) Gaussian type (assumes features follow a bell-like, normal distribution), (b) Multinomial (used for discrete counts, in terms of quantity of times an outcome is observed across x trials), and (c] Bernoulli (useful for binary feature vectors; popular use-case is text classification).

Another popular mechanism is the Decision Tree. Given a data of attributes together with its classes, the tree produces a sequence of rules that can be used to classify the data. The algorithm splits the sample into two or more homogeneous sets (leaves) based on the most significant differentiators in your input variables. To choose a differentiator (predictor), the algorithm considers all features and does a binary split on them (for categorical data, split by cat; for continuous, pick a cut-off threshold). It will then choose the one with the least cost (i.e. highest accuracy), repeating recursively, until it successfully splits the data in all leaves (or reaches the maximum depth).

Decision Trees are in general simple to understand and visualize, requiring little data prep. This method can also handle both numerical and categorical data. On the other hand, complex trees do not generalize well ("overfitting"), and decision trees can be somewhat unstable because small variations in the data might result in a completely different tree being generated.

A method for classification which is derived from decision trees is Random Forest, essentially a "meta-estimator" that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is the same as the original input sample size — but the samples are drawn with replacement.

Random Forests tend to exhibit higher degree of robustness to overfitting (>robustness to noise in data), with efficient execution time even in larger datasets. They are more sensible however to unbalanced datasets, being also a bit more complex to interpret and requiring more computational resources.

Another popular classifier in ML is Logistic Regression — where probabilities describing the possible outcomes of a single trial are modeled using a logistic function (classification method despite the name):

Here’s what the logistic equation looks like:

Taking e (exponent) on both sides of the equation results in:

Logistic Regression is most useful for understanding the influence of several independent variables on a single outcome variable. It is focused on binary classification (for problems with multiple classes, we use extensions of logistic regression such as multinomial and ordinal logistic regression). Logistic Regression is popular across use-cases such as credit analysis and propensity to respond/buy.

Last but not least, kNN (for "k Nearest Neighbors") is also often used for classification problems. kNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). It has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique:

Classification Methods in Machine Learning

Written by Jorge Leonel