Choosing the Right Classification Algorithm

Albane Colmenares
8 min readNov 13, 2023

--

Image credit: https://www.simplilearn.com/tutorials/machine-learning-tutorial/classification-in-machine-learning

I am getting ready to prepare for my third project during my Data Science course. The task is to build and tune machine learning classification models. This part of the course taught me about a multitude of options. Among them, four algorithms stood out as fundamental tools for tackling classification problems: Logistic Regression, Decision Trees, K-Nearest Neighbors (KNN), and Bayes Classification.

It seemed daunting to choose which one would apply the most to my dataset.

Just like an important decision in life, I thought it would be helpful to put on paper the pros and cons of each model, to have a clearer idea.

This is the goal of this post: exploring these algorithms, their strengths, weaknesses, and how to make an informed choice when facing a classification task.

Understanding Classification Problems

Before diving into the algorithms themselves, it is essential to understand what a classification problem requires. In classification, the goal is to categorize data points into predefined classes or labels. For example, classifying emails as spam or not spam, diagnosing diseases as benign or malignant, or identifying a water well as functional or non-functional.

The Iterative Nature of Classification Modeling

Credit: https://thedolectures.com/

In addition to understanding the classification problems, it is important to realize as well that the classification process is rarely a one-off task — where one algorithm will magically provide the best predictions. Instead, it is more a cyclical process that involves continuous refinement. The approach begins with:

  1. the collection and preprocessing of data
  2. followed by the selection of features
  3. and the crucial task of splitting the dataset into training and testing sets.

The first iteration often reveals a general understanding of the characteristics of the data, which can then lead to adjustments in the model.

It Takes A Village

Image credit: https://ar.inspiredpencil.com/pictures-2023/two-puzzle-pieces-coming-together

It would of course seem simpler if one algorithm was assigned to a set of classification problems. But isn’t it so much more interesting to join forces? The number of existing algorithms is rich and diverse, and each has strengths and weaknesses. It then seems natural that experimenting with multiple ones to identify the one that best suits the problem, and the dataset would allow data scientists to make different assumptions about the data. The chances of discovering the model that makes the best and most relevant predictions are higher when several options were tested.

1. Logistic Regression: A Simple yet Effective Choice

Image credit: gettyimages.com

Logistic Regression is a widely used classification algorithm. It is common to start a baseline model with Logistic Regression thanks to its simplicity.

Despite its name, it is not a regression algorithm but rather a classification one. It is mainly used for binary classification — a two classes problem, although it can sometimes be extended to multiclass problems.

Strengths:

  • Interpretability: logistic regression provides clear results. You can easily understand how each feature influences the probability of belonging to a particular class.
  • Efficiency: it can handle large datasets without much trouble, while remaining fast at producing predictions.
  • Regularization: the model can be regularized to prevent overfitting, which makes it powerful against noisy data.

Weaknesses:

  • Linear Assumption: Logistic Regression assumes a linear relationship between features and the target variable. This can be limiting for more complex datasets.
  • Binary Classification: While it can handle multiclass problems with techniques it may not respond as well or be the most intuitive choice for multi-category classification problems.

Some best cases:

Some examples where Logistic Regression modeling might be helpful include:

  • Medical diagnoses and predicting if a patient has a disease: have the disease, or not
  • Spam emails, and predicting if an email is spam or not
  • Fault detection — this can be in industrial problems, or to predict if a water point is functional or not

Go with logistic regression in cases where interpretability and simplicity are essential, especially in binary classification problems where you need to understand feature influence clearly.

2. K-Nearest Neighbors: The Power of Proximity

Image credit: https://www.reddit.com/r/titanfolk/comments/eqrxk7/i_dont_know_netflix_ive_only_watched_this_show_4/

K-Nearest Neighbors (KNN) is a classification algorithm that makes decisions for data points by looking at the majority class among their closest neighbors in the feature space.

Strengths:

  • Simplicity: KNN is intuitive and easy to implement. It doesn’t make strong assumptions about the data’s underlying distribution.
  • Non-linearity: unlike Logistic Regression, this model can capture complex, non-linear relationships between features and the target variable.
  • Adaptability: it can be used for both binary and multiclass classification problems.
  • Training speed: fast. It stores training data for future use.

Weaknesses:

  • Complexity: due to the nature of the algorithm, the algorithm has to compute distances between data points which makes it computationally ‘heavy’. This also makes a slower prediction speed as the model examines all training records to find the k closest to a new record.
  • Sensitivity to noise: KNN is sensitive to noisy data or irrelevant features, which can lead to unclear results.

Some best cases:

K-Nearest Neighbors modeling might be helpful to predict the following scenarios:

  • ‘You may also like…’: providing recommendations for a product, based on similar reviewed ones, such as TV Series recommendations on Netflix
  • Sentiment Analysis in Reviews: determining whether a Tripadvisor review is positive or negative based on the choice of words
  • Credit Card Fraud Detection: predicting if a transaction is fraudulent based on past spending habits

KNN is a good choice when the dataset’s size is small to medium, and when the classification approach you aim at targeting remains simple, yet not linear. It is especially useful when you don’t want to make strong assumptions about the data distribution.

3. Decision Trees: The Power of Tree-Based Learning

Image credit: https://www.ntara.com/blog/how-to-conduct-a-customer-segmentation-study/

Decision Trees are a non-linear and highly interpretable classification algorithm. They work by recursively splitting the data into subsets based on feature values, ultimately leading to a decision or classification. This algorithm provides a flexible and interpretable machine learning approach, allowing various options to adapt to different datasets and objectives by fine-tuning hyperparameters.

Strengths:

  • Interpretability: decision trees provide a straightforward visual representation of the decision-making process. You can follow the tree’s path to understand why a particular decision was made
  • Non-linearity: they can capture non-linear relationships between features and the target variable, making them suitable for complex problems
  • Feature Importance: feature importance can be drawn from this algorithm, helping you identify which factors drive classification decisions
  • Scaling requirement: decision trees do not require feature scaling, and for handling categorical data, you can use a LabelEncoder, as they don’t assume equal distances between categories

Weaknesses:

  • Overfitting: this model is prone to overfitting, especially when it grows too deep Pruning techniques are often necessary to mitigate this issue
  • Instability: small changes in the data can result in significantly different trees, making them unstable
  • Training speed: decision tree training can be slow due to considering splits involving all features, potentially leading to exponential computational time

Some best cases:

Decision Trees are well-suited for scenarios where outcomes are similar to the one provided by an if / else statement:

  • Loan Approval: based on several criteria, i.e. credit score, debt to income ratio, income
  • Customer Segmentation: grouping customers based on demographics, purchasing behavior and source of business
  • Weather Prediction: classifying weather conditions based on a set of variables: humidity, temperatures, wind

Decision trees are an excellent choice when interpretability is important, and when the classification problem is non-linear. However, pruning and hyperparameter tuning is absolutely necessary for this algorithm, as it may lead to overfitting.

4. Bayes Classification: The Probability-Based Approach

Image credit: https://safjan.com/visual-text-exploration-as-part-of-preprocessing-before-classification/

Bayes Classification methods, such as Naive Bayes, are probabilistic classification algorithms based on Bayes’ theorem. The main assumption these models are based on, is that features are independent — which is considered naive — nevertheless reasonable.

Strengths:

  • Efficiency: Bayes classifiers are computationally efficient and can handle large datasets, with a high number of features.
  • Interpretability: they provide probabilistic predictions, making it easy to assess the model’s confidence in its predictions.
  • Robustness to Irrelevant Features: Naive Bayes is particularly strong to handle irrelevant features due to its independence assumption.

Weaknesses:

  • Independence Assumption: the naive independence assumption might not hold in many real-world scenarios, leading to lower results.
  • Linear Preference: Naive Bayes may struggle with capturing complex, non-linear relationships in the data.

Some best cases:

Bayes classification modeling would be helpful to predict the following scenarios:

  • Spam emails: classifying emails as spam or not based on a set of keywords
  • Text classification: pre-defining topics of a given text based on the frequency of words
  • Sentiment analysis: again, determining if a Tripadvisor review is positive or negative based on the choice of words.

Bayes classification methods are a good choice when you have high-dimensional data and want a probabilistic approach that is computationally efficient. They work well for text classification, spam detection, and other tasks with sparse data.

Choosing the Right Algorithm: A Suggested Checklist

Image credit: https://virtualpro24.com/que-es-un-checklist/

Now that we have explored the strengths and weaknesses of these classification algorithms, here is a proposed set of questions you can take with you. It can be a good idea to start answering these before starting your next project.

  • 1. Nature of the Problem:
    Is it a binary or multiclass classification problem? Some algorithms are naturally suited for binary classification, while others can handle both.
  • 2. Interpretability:
    Do you need to understand and explain the model’s decisions clearly? If interpretability is a priority, logistic regression or decision trees might be better choices.
  • 3. Data Size:
    How large is your dataset? KNN can become computationally expensive for large datasets due to its reliance on distance calculations.
  • 4. Data Complexity:
    Is the relationship between features and the target variable linear or non-linear? Decision trees and KNN are more flexible in capturing non-linear relationships.
  • 5. Handling Irrelevant Features:
    If you suspect that some features are irrelevant, Naive Bayes can be strong, due to its independence assumption.

In a nutshell, the choice of a classification algorithm should be driven by the specific characteristics of your problem and your project’s goals. All classification models have their own strengths and weaknesses, making them suitable for different scenarios.

Remember that classification modeling is a dynamic and iterative process, where the use of multiple algorithms improves the model’s predictive power. Experimenting multiple methods allows you to better refine the models until the evaluation metrics reach their highest results, and make the most sense in the context of your dataset.

--

--