So, which ML Algorithm to use?!

15 min readApr 28, 2023

As a data science practitioner, you may have found yourself scratching your head when trying to choose the best machine learning algorithm for your project. With so many options available, the process can be overwhelming and confusing. But fear not, because we’re here to simplify it for you.

Let’s start by answering a fundamental question: what is a machine learning algorithm trying to do? At its core, any algorithm aims to take a set of features and translate them into a useful prediction using a mathematical system unique to that algorithm. Therefore, the translation process can vary widely.

So, what’s the keyword that plays a major role in selecting an algorithm? It’s simple: the features themselves. You’re just trying to choose the best algorithm (or translator) for your specific set of features.

In the next section, we’ll break down the most widely known algorithms based on their general behavior. By understanding the different categories of algorithms, you’ll be well-equipped to choose the right one for your project and move forward with confidence. Say goodbye to overwhelm and hello to productivity!

Decision Boundary Concept

It’s crucial to have a clear understanding of the decision boundary concept. The decision boundary defines how an algorithm behaves and how it interprets and processes data. It essentially highlights the strengths of each algorithm, providing valuable insights into their performance.

To illustrate this concept, the graph below, taken from the sklearn documentation, is an excellent example. It shows how different algorithms perform on different datasets, giving us a clear indication of their decision boundary and how they interpret the data.

Understanding the decision boundary concept is essential for selecting the best algorithm for a particular dataset. By knowing how an algorithm behaves and interprets data, data scientists can make informed decisions and choose the right algorithm for their specific needs.

*Decision Boundaries for Different ML Algorithms*

The graph above is a powerful tool for data scientists, revealing valuable insights into the behavior of each algorithm. By examining the graph, we can see how different algorithms interpret and process data in unique ways.

For instance, the Nearest Neighbor Algorithm (KNN) relies heavily on the proximity of data points, while Linear SVM tries to slice the data to determine the class of each point. RBF SVM and Neural Networks both aim to find linear combinations that separate the data but use different methods to achieve this goal.

Decision Trees and Random Forest both use a strategy of splitting the data to separate classes, with Random Forest taking this approach further. Meanwhile, AdaBoost enhances and modifies the idea of splits to improve performance.

By categorizing algorithms based on their behavior, we can make informed decisions when selecting the right algorithm for our needs. Understanding how an algorithm processes and interprets data is crucial for choosing the best algorithm for a particular dataset.

Linear Models

In machine learning, a linear model refers to a model that is specified as a linear combination of features. During the learning process, the algorithm computes a weight for each feature based on the training data to form a model that can predict or estimate the target value.

Essentially, the linear model uses a linear equation to represent the relationship between the input features and the output value. By adjusting the weights of each feature, the model can accurately predict the target value for new input data.

Linear models are widely used in various machine learning applications, including regression and classification problems.

This category includes the following algorithms:

Linear Regression
Logistic Regression
SVM
Neural Networks

Tree-based Models

Tree-based models are a category of machine learning algorithms that use a series of if-then rules to generate predictions from one or more decision trees. This results in a splitting effect that can be seen in the graph mentioned earlier.

The decision tree is a graphical representation of the rules that a tree-based model uses to make predictions. The tree starts with a single node, which represents the entire dataset, and then splits into branches that represent the subsets of the data based on a particular feature. The process is repeated until each subset contains only one class of data.

There are several popular tree-based models, such as Decision Trees, Random Forests, and Gradient Boosted Trees. Decision Trees are simple and easy to interpret, while Random Forests combine multiple trees to improve accuracy and reduce overfitting. Gradient Boosted Trees, on the other hand, iteratively improve the decision tree to minimize errors and increase accuracy.

Tree-based models are widely used in various machine learning applications, including classification and regression problems.

This category includes the following algorithms:

Decision Tree
Random Forest (ensemble methods)
XGBoost (ensemble methods)
LightGBM (ensemble methods)
GradientBoosting (ensemble methods)
AdaBoost (ensemble methods)
CatBoost (ensemble methods)

Distance-based Models

Distance-based models are a category of machine learning algorithms that determine the decision boundary based on the closeness of points to each other. These models are heavily influenced by the scale of each feature.

The most popular distance-based model is the k-Nearest Neighbors (KNN) algorithm. KNN finds the k closest data points to the input and predicts the output based on the majority class of these neighbors. Another popular algorithm in this category is Naive Bayes, which predicts the probability of each class given the input features and uses Bayes’ theorem to combine these probabilities.

Distance-based models are used in various machine learning applications, including classification and regression problems. However, they can be sensitive to the scale of the features, and data normalization or standardization is often required to improve their performance.

When selecting a machine learning algorithm, it’s essential to understand the relationship between the features and the target value. This can help data scientists choose the right model for their data and accurately predict the target value.

Based on the categorization of machine learning algorithms, a simple question arises: Are the features in our data helpful for creating splits between classes, or are they more useful in creating linear trends between the features and the target value?

To answer this question, data scientists can perform exploratory data analysis (EDA) to understand the relationship between the features and the target value. They can use visualizations and statistical methods to analyze the data and identify patterns and trends.

If the features show a clear relationship with the target value that can be expressed linearly, a linear model may be appropriate. On the other hand, if the data requires splits to separate classes, a tree-based model may be more suitable.

By considering the behavior of each algorithm and understanding the relationship between the features and the target value, data scientists can choose the right machine learning algorithm for their data and achieve better predictive performance.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the cornerstone of data science. The ultimate goal is to know, explore, and visualize your data to make informed decisions. By understanding your data, EDA can help you decide on the best algorithm for your model.

While there is no clear-cut process for EDA, we can summarize the essential steps to reveal valuable insights.

Step 1: Summary statistics and visualizations

By looking at percentiles, ranges, variance, and standard deviation, you can identify the range for most of the data. Averages and medians describe the central tendency, while correlations indicate strong relationships.

Step 2: Visualize the data

Box plots identify outliers, density plots and histograms show the spread of data, and scatter plots describe bivariate relationships.

The insights gathered from these two steps will contribute to model selection. By identifying the range, central tendency, and relationships in your data, you can determine which algorithm is best suited for your model. It’s all about exploring your data thoroughly and making informed decisions based on what you find

Now, let’s comment about how the outcomes of these two steps will contribute to model selection.

Outliers

Outliers can significantly impact any linear model, as the model will try to fit the points with high weights. This can affect the accuracy of the model and raise some important questions.

Firstly, would a linear model be appropriate given the existence of outliers? Linear models assume that the data is normally distributed and free of outliers, so it may not be the best choice for data with significant outliers.

If outliers persist, the question then becomes how to handle them. One way to deal with outliers is to remove them from the dataset. However, this approach can be risky as it may lead to a loss of valuable information.

Another option is to transform the data, such as using a logarithmic transformation or a robust scaling technique like the Interquartile Range (IQR) method. These techniques can help reduce the impact of outliers on the model without removing any data points.

It’s important to note that the handling method for outliers should be chosen based on the specific data and model being used. Handling outliers can improve the accuracy of the model, but it requires careful consideration and an understanding of the data’s characteristics.

So, if you encounter outliers in your data, don’t panic! Instead, consider whether a linear model is appropriate, and if not, explore different techniques to handle the outliers and improve the accuracy of your model.

Data Distribution and Correlation

The distribution of data and the correlation between features and the target variable are crucial considerations in selecting an appropriate model. Two key factors to keep in mind are normality and correlation.

If the data is not normally distributed, it may be beneficial to consider using a tree-based model. The idea of splits in these models is less affected by the normality of the data. However, if the data is normally distributed, a linear model may be a better choice as it assumes that the data is linearly distributed.

The strength of the correlation between features and the target variable is also important in selecting an appropriate model. If there is a strong correlation, a linear model may be the best choice as it can construct linear combination boundaries with ease. However, a tree-based model may be less affected by weak correlations.

It’s important to note that selecting an appropriate model depends on various factors, including the distribution of the data and the correlation between features and the target variable. Therefore, it’s important to carefully evaluate the data before selecting a model.

Normality and correlation are critical factors in selecting an appropriate model. Lack of data normality may point towards a tree-based model, while strong correlation may indicate a linear model. However, each model has its strengths and weaknesses, so it’s crucial to evaluate the data and choose the most appropriate model for your specific needs.

Missing Values

Missing values are a common issue that data scientists must address when working with datasets. When missing values are present, it’s essential to consider a few critical questions to determine the best approach for dealing with them.

First, it’s important to consider whether the missing values are related to a specific event. If so, they may be treated as a separate category of data and be more beneficial to a tree-based model.

If a linear model is still required despite the missing values, it’s important to determine the best imputation technique that will minimize the impact on the model. There are many imputation techniques available, including mean imputation, median imputation, and K-nearest neighbors imputation.

Another option is to use a model that can handle missing values internally, such as Naive Bayes or XGBoost. However, it’s important to note that the sklearn implementation of the Naive Bayes algorithm does not allow missing values, so manual implementation may be required.

By considering the nature of the missing values and the specific needs of the model, data scientists can select the most appropriate approach for their project.

Feature Engineering

Once the process of exploratory data analysis (EDA) is complete, it’s important to consider the model family that will be used. This will guide the process of feature engineering, which is crucial for improving model performance.

Feature engineering involves transforming raw data into features that are more informative and useful for the model. Some standard procedures in feature engineering include scaling, one-hot encoding, and creating new features through combinations or interactions.

Scaling is essential for models that rely on distance metrics, such as K-nearest neighbors and support vector machines. One-hot encoding is useful for categorical features, allowing them to be represented as binary vectors. Creating new features through combinations or interactions can help capture complex relationships between features that may be missed by the model.

It’s important to note that feature engineering should be done with respect to the model family that will be used. For example, decision trees can handle categorical features and do not require scaling, while linear models benefit from scaled features.

Missing Values Handling

Here are some tips to keep in mind when handling missing values:

Imputing the missing values with the “Unknown” category when dealing with categorical variables will be more beneficial to tree-based models. This is because tree-based models can handle categorical variables well and do not rely on the linear relationship between variables.
Imputing the missing values with the mean or median will be more beneficial to linear models over tree-based models. Linear models rely on the linear relationship between variables, and imputing with mean or median values helps preserve this relationship.
Models that can handle missing values can be susceptible, and not handling missing values can lead to reduced performance. For instance, Naive Bayes is a model that can handle missing values, but the missing values can impact the model’s performance if not handled correctly.

Outliers Handling

Outliers can affect your model performance negatively and dramatically however, in some cases, they can be a golden feature with strong predictive power. Here are some tips to keep in mind when handling outliers.

Clipping can be more useful for any linear model as it increases data normality, but it causes some information loss.
Transformations such as logarithmic or square root transformation can add a damping effect to the values without any information loss, making them more beneficial for linear models.
Tree-based models are generally less affected by outliers because the idea of splits in them will most probably assign outliers to a separate split.

Scaling and Normalization

Impact on Tree-based Models:

Tree-based models, such as decision trees and random forests, are not affected by scaling and normalization. The reason is that these models work by recursively splitting the data based on the values of the features. Therefore, the relative scale of the features does not matter.

Impact on Linear Models:

Linear models, such as linear regression and logistic regression, are highly affected by scaling and normalization. Here are some reasons why normalization is an excellent approach for linear models:

Faster Training: Normalization can speed up the training process by reducing the number of iterations required by the optimization algorithm.
Better Scores: Normalization can improve the accuracy of the model by reducing the impact of outliers and ensuring that all features are on a similar scale.

Impact on Distance-based Models:

Distance-based models, such as k-Nearest Neighbors (k-NN) and Support Vector Machines (SVM), are profoundly affected by the scale of the features. Here are some reasons why scaling is necessary for distance-based models:

Feature Importance: Scaling ensures that all features are equally important in the model. Otherwise, a feature with a larger scale will dominate the model and may lead to biased predictions or, in some other cases, to explicitly make one feature dominant by not scaling it!
Distance Metric: Scaling ensures that the distance metric used in the model is meaningful. Otherwise, the distance between two points may be dominated by a single feature with a larger scale.

Categorical Variables Handling

Categorical variables are variables that take a limited number of values. They are common in many real-world applications, such as marketing and finance. Handling categorical variables is an important preprocessing step in machine learning. We will discuss how different approaches for handling categorical variables impact different types of models.

Impact on Linear Models:

Linear models, such as linear regression and logistic regression, require numerical input. Therefore, categorical variables need to be encoded before feeding them into the model. Here are two popular encoding methods that are beneficial for linear models:

One-hot Encoding: One-hot encoding transforms each category into a binary vector with a length equal to the number of categories. This method is useful for linear models because it creates a separate feature for each category, which can capture non-linear relationships between the category and the target variable.
Frequency Encoding: Frequency encoding replaces each category with its frequency in the dataset. This method is useful for linear models because it captures the distribution of the categories in the dataset, which can be informative for predicting the target variable.

Impact on Tree-based Models:

Tree-based models, such as decision trees and random forests, can handle categorical variables without any encoding. However, some encoding methods can improve the performance of these models. Here is a popular encoding method that is beneficial for tree-based models:

Label Encoding: Label encoding replaces each category with a numerical label. This method is useful for tree-based models because it creates an ordinal relationship between the categories, which can boost the ability of the model to do data splits. However, it is important to note that label encoding may not be appropriate if the categories do not have a natural ordering.
Target Encoding: Target encoding replaces each category with the mean of the target variable for that category. This method is useful for tree-based models because it captures the relationship between the category and the target variable. However, it is important to avoid overfitting when using target encoding, especially when dealing with rare categories.

Once we’ve shared some thoughts during the exploratory data analysis and feature engineering stages, we’ll have a better sense of which model will best fit our problem. However, it’s also important to consider non-data-related factors that could influence our decision-making process when selecting a model. Let’s delve into these considerations further.

Deployment Considerations

As we consider our model of choice, it’s important to also take into account deployment considerations that are unrelated to the data itself. For instance, we may ask ourselves:

What is our data storage capacity? If our system has limited storage capacity, we may not be able to store large classification or regression models, or gigabytes of data for training?
Does our prediction need to be fast? In real-time applications, it’s crucial to generate predictions as quickly as possible. For instance, in autonomous driving, road signs must be classified rapidly to prevent accidents?

Usability Considerations

Now that we’ve covered different model families, discussed best practices for exploratory data analysis and feature engineering, and touched on deployment considerations that may favor one model over another, it’s time to delve into a crucial aspect that all data scientists should be aware of:

Explainability Vs. Predictability

One important factor to consider is the trade-off between explainability and predictability of a model. Explainability refers to the extent to which we can explain the model’s prediction. For example, a decision tree is a highly explainable model because we can easily determine why it made a particular prediction by following the sequence of splits (if-then rules). These types of models are often referred to as “white-box” models.

On the other hand, predictability refers to the model’s ability to make accurate predictions, regardless of our ability to explain how it arrived at a specific prediction for a given input. For instance, a neural network is very complex, making it difficult to understand why it provides a particular prediction. These types of models are often referred to as “black-box” models.

This explainability vs. predicatability can be demonstrated in the following graph for various machine learning algorithms.

*Interpretability vs. Complexity trade-off*

It’s essential to keep this trade-off in mind because there may be instances where the application requires some level of explainability in the selected model, which would mean opting for simpler models. Conversely, there may be situations where explainability is not necessary, and prioritizing a model with high predictability over a simpler one may be more advantageous. Ultimately, it’s crucial to strike the right balance between explainability and predictability, depending on the specific requirements of the application.