Demystifying Classification Models

Published in

Nerd For Tech

9 min readJun 14, 2024

Classification models are quite popular in the field of machine learning, responsible for tasks like spam filtering, image recognition, and even fraud detection. These models are used when you are looking to parse through data and categorise it based on previous relevant information.

A vast array of classification models are available and choosing the right one can be a challenge. This article dives into five popular models: K-Nearest Neighbours, Logistic Regression, Support Vector Machines, Decision Trees, and Random Forests, explaining their inner workings and ideal applications.

Please refer to section 3, “Additional Terms” for any doubts on terminology.

1. Popular Classification Models

1.1 K- Nearest Neighbours:

When to use: It is ideal when data is small, labelled, with no significant outliers. It also works well even when decision boundaries are irregular (that is if plotted on graph it is difficult to segregate data into two or more categories). Widely used for Recommendation Systems, EKG Patterns, Image Classification and Text Mining. There are no underlying assumptions about distribution of data so it is essentially good for generalisation.

How it Works? It is a memory based classifier and does not require a model to be fit. K-NN algorithm works by identifying nearest k points based on feature data. Most commonly Euclidean distance is used to calculate distance. The distance is an indication of how similar two points are based on their feature vectors. Given the training data, an unclassified data point is assigned the label based on its neighbours.

Distance between point x and y (Euclidean Distance):

d(x, y) = sqrt((x1 — y1)² + (x2 — y2)² + … + (xn — yn)²)

Tuning Parameters:

Different values of “k” can be tested, that is, how many neighbouring points should be considered.
Different values of underlying distance metric (Euclidean distance, Manhattan distance etc) can be tested

Data Preprocessing: Data Standardisation (mean 0 and variance 1), Feature Selection, Dimensionality Reduction.

Pros and Cons: It is effective for small datasets. It gets modified each time a new data point is added. As it is computationally intensive, it is difficult to scale and does not provide accurate classification in case of high dimensionality (high number of predictors). It may also be prone to overfitting.

Model Inference: A new data point is classified based on voting of its nearest neighbours. Data point belongs to the majority category based on votes.

1.2 Logistic Regression:

When to use: It is best suited for a binary classification problem where a large sample size is available. Popular applications include Fraud Detection and Credit Scoring. Primarily logistic regression models are used as an inference tool to identify the impact of input variables.

How it Works? The model transforms output of linear regression into a categorical output using a sigmoid function. This maps any value of independent variable between 0 and 1. Maximum Likelihood estimate is used to determine the coefficients of independent variables. Techniques like L1 (LASSO) and L2 (ridge) regularisation can be used to avoid overfitting. Below is the equation for this model:

p(y = 1 | X) = 1 / (1 + exp(-(β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ)))

Linear Combination: The part within the parenthesis (β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ) represents a linear combination of the coefficients (β) and the features (X). This means the equation assumes a linear relationship between the features and the log-odds of the target variable.
Logit Transformation: The term exp(-(β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ)) is then passed through a logit function (ln(odds)) which transforms the linear relationship into an S-shaped curve. This ensures the predicted probability (p(y = 1 | X)) always falls between 0 and 1, as a probability should.
Logistic Function (1 / (1 + exp(-(…)))): This part takes the output from the logit transformation and maps it to a probability between 0 and 1.

Wald test can be used to see the effect of excluding an independant variable.

Tuning Parameters: Different C values to penalise model complexity. C ranges from 0–1, smaller value of C leads to stronger regularisation, indicating simpler model with less variables.

Data Preparation: Data Scaling, One-hot encoding for categorical variables

Pros and Cons: It provides easy to interpret probability estimates for output variable. However, it can overfit in higher dimensions (with more predictors). There needs to be a linear relationship between predictors and log-odds of target variable (although data can be transformed to create a linear relationship).

Model Inference: The coefficients represent the effect of independent variables on log odds of dependant variable. Consider this example, we are trying to predict likelihood of a patient having a heart attack based on their age. If the coefficient of age is 0.12 it means odds ratio is exp(0.12) which is approximately 1.13. This means that for every 1 year increase in age, odds of having a heart attack increase by 13%.

The logistic regression output also includes “z-score” that is coefficient/standard error. It corresponds to the hypothesis test where null hypothesis is that coefficient value is 0 and there is no impact of predictor on target variable. In case, z-score is higher than 2, then at a 95% confidence level the coefficient is significant. Analysis of deviance can also be used to further narrow down important predictors.

1.3 Support Vector Machine:

When to use: They excel in high-dimensional classification problems where data is limited. Particularly used in bioinformatics, Text Classification and Image Recognition

How it Works? Objective of the model is to identify an N-dimensional hyperplane that separates data in different categories. The dimension of this hyperplane depends on number of input features.

There are two types of classification possible:

(i) Data can be fully categorised accurately(Hard Margin): Hyperplane is determined by maximising margin. Such that, the distance between the hyperplane and nearest data point is maximum.

(ii) Some data points may fall in the wrong categories (Soft Margin): In case, complete segregation is not possible, certain outliers can be ignored. Penalty is associated with inclusion of each outlier and the metric (1/margin + (∑penalty)) is minimised. Hinge loss is one of the most common penalties used.

In case the boundary separating categories is not linear different kernels can be used to identify the same.

Tuning Parameters:

C (Cost parameter for margin maximisation and misclassification penalty), Kernels (linear, polynomial, radial basis function and sigmoid)

Data Preparation: Data Scaling

Pros and Cons: It is robust to outliers and performs well in high dimensional data. It is computationally expensive and difficult to interpret (particularly with non-linear kernels).

Model Inference: New data points are classified based on which side of hyperplane they are put into.

1.4 Decision Trees:

When to use: Easy to interpret and work with multiple data types. Finds practical application in Marketing Analytics, Determining important features for further data modelling.

How it Works? Entropy is the level of randomness in a dataset. Information gain is measured as the reduction in entropy on each split in the decision tree. A split is selected based on highest information gain achieved. The process is repeated iteratively to build a decision tree. The process of removing branches that do not provide additional information gain is called pruning.

Entropy(S) = -Σ(pi * log2(pi))

S is the set of data points at a node.
pi is the proportion of data points in S belonging to class i.
log2 is the logarithm base 2.

Tree is created for minimum node x, then it is pruned based on threshold of information criteria. Gini index or cross entropy should be used for growing the trees, typically misclassification rate is used for pruning. There can be significant overfitting if there are too many possible categories. Also, in case a certain category is to be preferred as default, Gini Index can be accordingly modified. For example, in medical applications, false positives are preferred over false negatives.

Tuning Parameters: maximum depth, minimum samples per split, minimum samples per leaf, splitting criterion and cross validation error.

Data Preparation: Categorical variable encoding, Missing value imputation

Pros and Cons: It is prone to overfitting. As variables are considered step-wise to split the data it is possible that based on the sequence of execution it will give different results every time. This problem can be solved by “bagging” multiple trees (Random Forest).

Model Inference: A new data point is passed through the tree, following the splits based on feature values, until it reaches a leaf node representing the predicted class or value.

1.5 Random Forest:

When to use: It is an ensemble model to improve accuracy, and reduce overfitting. Some examples of practical application include Churn prediction, Stock Price Prediction and Medical Diagnosis.

How it Works? Bagging is used to reduce variance in estimated prediction function. It works well for high variance, low bias procedures like trees. Random Forest leverages the power of multiple trees to make a decision. Multiple samples are created from a single dataset and randomised feature selection is used to make each split in a decision tree to ensure no single feature gets too much weightage. The final result is the aggregation of results from different trees.

Ensemble_Prediction(x) = (1/T) * Σ(T_i(x))

Ensemble_Prediction(x) represents the predicted value (classification label or regression output) for a new data point x. Here, T indicates number of trees.
T_i(x) represents the prediction made by the i-th tree in the forest for the data point x.
(1/T) is the averaging factor (can be modified for weighted voting schemes).

When there are high number of predictors, but less relevant ones, Random Forest performs poorly.

Tuning Parameters: Number of trees, Maximum Depth, Minimum samples per leaf, Maximum features, Maximum Depth

Data Preparation: Categorical variable encoding, Missing value imputation, Variable Selection

Pros and Cons: It is robust to overfitting, more accurate and can handle multiple data points. It is however difficult to interpret and computationally expensive.

Model Inference: A new data point passes through all trees in the forest. The average of mode of classification category is selected.

2. Model Validation and Performance Metrics

Accuracy: True Positives / True Negatives It should be used for a balanced dataset where each category is almost equally likely. It is not a great measure for imbalances datasets. Model may always predict the majority class and still get high accuracy even while performing poorly for minority class.
Precision: Total positives / (Total Positives + False Negatives) It shows the ability to avoid false positives. It should be used when cost of incorrectly classifying a positive instance is high.
Recall: Total positives / (Total Positives + False Negatives) It shows model is good at identifying positive cases. It should be used when missing a positive instance is critical. For example, in fraud detection systems.
F1 Score: F1 = 2 * (Precision * Recall) / (Precision + Recall) It is a single metric to consider both precision and Recall. It indicates a good balance between precision and recall, useful when both are important. It should be used when both are equally important.
ROC AUC (Area Under the Receiver Operating Characteristic Curve): Represents probability that model will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Indicates ability to discriminate between positive and negative classes. Particularly used for binary classification.
k-fold Cross Validation: Data is split into k-folds, model is trained on k-1 folds and testing on the remaining fold. The process is repeated k-times. It provides robust estimate of model performance on unseen data.
Misclassification Error: Proportion of instances that the model classified incorrectly to evaluate model performance. It is not ideal for imbalanced datasets.
Gini Index: It reflects the likelihood of randomly picking two instances from the same node and getting different classifications. It is primarily used in decision tree algorithms.

3. Additional Terms

Feature Set: It is a collection of attributes or characteristics used to describe a data point. Also referred to as independent variables and predictors these attributes are used to classify/predict response variable.
Training and Testing Data: Before building a model, data is typically split into training and testing data. Training data is used to train the model and test data is used to evaluate performance using unseen data.
Overfitting: When model memorises training data too well including noise. This can lead to the model performing well on the training data but poorly on new, unseen data.
Labelled Data: Type of data where each data point has a corresponding label or target value.

By understanding the strengths and weaknesses of these five models, you can make informed decisions and unlock the power of classification for your machine learning endeavours.

Independant Variables:

Overfit: