Machine Learning Classification Algorithms

Yogesh V
15 min readDec 16, 2023

--

MLPClassifier

MLPCLassifier is a class in the Scikit Learn library for training a multi-layer perceptron (MLP) neural network for classification tasks. An MLP is a type of feedforward neural network that consists of multiple layers of neurons, with each layer connected to the next. The MLPCLassifier class provides various options for defining the network architecture, training the model, and evaluating its performance. It can be used for a wide range of classification tasks, including image classification, text classification, and time series classification.

Architecture of MLPClassifier:

The MLPClassifier algorithm consists of an input layer, one or more hidden layers, and an output layer. The input layer is responsible for receiving the inputs to the model, which are then passed on to the hidden layers. The hidden layers perform computations on the input data, and the output layer generates the final output of the model.

Training MLPClassifier:

Training MLPClassifier involves providing it with a set of labelled training data, and then iteratively adjusting the weights of the connections between the neurons in the network to minimise the error between the predicted outputs and the actual outputs. This process is known as backpropagation and is performed using an optimization algorithm such as stochastic gradient descent.

Hyperparameters of MLPClassifier:

There are several hyperparameters that can be tuned when using MLPClassifier. These include the number of hidden layers, the number of neurons in each hidden layer, the activation function used by each neuron, the learning rate, and the regularisation parameter.

Using MLPClassifier for Classification Tasks:

MLPClassifier can be used for a wide range of classification tasks, including image classification, text classification, and sentiment analysis. In this section, we will explore how to use MLPClassifier for the MNIST handwritten digit classification task.

KNeighborsClassifier

KNeighborsClassifier is a supervised machine learning algorithm used for classification tasks. It is a non-parametric algorithm that falls under the category of lazy learning algorithms, as it doesn’t fit a model during training but rather stores the training data and uses it to make predictions during prediction time.The basic idea behind KNeighborsClassifier is to find the K nearest neighbors of a given data point in the feature space and then predict the class label of that point based on the majority class among its K neighbors. The value of K is a hyperparameter that needs to be chosen by the user, and it can be any positive integer.

Architecture:

The architecture of KNeighborsClassifier is simple, as it only requires storing the training data in memory. During training, the algorithm doesn’t fit any model but rather creates a dictionary-like data structure that stores the feature vectors and their corresponding labels. During prediction time, the algorithm takes a new data point as input, finds its K nearest neighbors in the training set using some distance metric (such as Euclidean distance), and then predicts the class label based on the majority class among its K neighbors.

Training:

To train a KNeighborsClassifier, you first need to create an instance of the class and then fit it on the training data using the fit() method. The fit() method takes two arguments: X (the input features) and y (the corresponding labels). The algorithm then stores this data in memory and is ready to make predictions.

Hyperparameers:

K_NeighborsClassifier has several hyperparameters that can be tuned to optimize performance for specific use cases:

a) N neighbors (K): The number of nearest neighbors to consider when making predictions. Higher values of K lead to more robust predictions but also slower computation times.

b) Weights: The weight given to each neighbor’s vote during prediction time. Options include uniform (all neighbors have equal weight), distance (neighbors farther away have less weight), and manual (user-defined weights).

c) P: The number of features to use when computing distances between points (only used for high-dimensional data).

d) Leaf size: The maximum number of points allowed in each leaf node when constructing the kd-tree (used for faster neighbor searches).

e) Metric: The distance metric used to compute distances between points (default is Euclidean distance).

f) Algorithm: The search algorithm used to find nearest neighbors (default is ‘brute’). Other options include ‘kd_tree’ (for faster neighbor searches) and ‘ball_tree’ (for handling large datasets with many features).

Classification Task:

KNeighborsClassifier can be used for classification tasks where the input features are numerical or categorical, and the output label is categorical as well. It can handle both linear and non-linear decision boundaries and is particularly useful for high-dimensional data where other algorithms may struggle due to the curse of dimensionality.

svc

Support Vector Classifier (SVC) is a supervised machine learning algorithm used for classification tasks. It is a discriminative classifier, which means it learns a decision boundary that separates the input data into different classes. The decision boundary is defined by a set of support vectors, which are the data points that lie closest to the boundary.

Architecture:

The architecture of SVC involves finding the optimal hyperplane that separates the input data into different classes. The algorithm first maps the input data into a higher-dimensional feature space using a non-linear kernel function (such as polynomial, radial basis function (RBF), or sigmoid). It then finds the optimal hyperplane in this feature space that maximizes the margin between the two classes. The decision boundary is defined by this hyperplane, and the support vectors are the data points that lie closest to it.

Training:

To train an SVC, you first need to create an instance of the class and then fit it on the training data using the fit() method. The fit() method takes two arguments: X (the input features) and y (the corresponding labels). The algorithm then finds the optimal hyperplane and support vectors using a quadratic optimization problem.

Hyperparameters:

SVC has several hyperparameters that can be tuned to optimize performance for specific use cases:

a) C: The regularization parameter that controls the trade-off between model complexity and training error. Higher values of C lead to a more complex model with fewer training errors, but also more overfitting.

b) gamma: The kernel coefficient that controls how far apart two points need to be for them to be considered distinct in the feature space. Smaller values of gamma lead to more complex decision boundaries, while larger values lead to simpler boundaries.

c) kernel: The type of kernel function used to map the input data into a higher-dimensional feature space (default is ‘rbf’). Other options include ‘poly’ (for polynomial kernels) and ‘sigmoid’ (for sigmoid kernels).

d) degree: The degree of the polynomial kernel function used for ‘poly’ kernels (default is 3). Higher values of degree lead to more complex decision boundaries, while lower values lead to simpler boundaries.

Classification Task:

SVC can be used for classification tasks where the input features are numerical or categorical, and the output label is categorical as well. It can handle both linear and non-linear decision boundaries and is particularly useful for high-dimensional data where other algorithms may struggle due to the curse of dimensionality. However, SVC can be computationally expensive due to its use of quadratic optimization problems during training, so it may not be suitable for very large datasets or real-time applications.

GaussianProcessClassifier

The Gaussian Process Classifier is a probabilistic machine learning algorithm used for classification tasks that assumes the output label is a realization of a Gaussian process with an unknown mean function and covariance function. It predicts the probability distribution of the output label given the input features by defining a prior distribution over the mean function and covariance function, which are updated based on the training data using Bayes’ theorem. The algorithm can handle non-linear decision boundaries and is particularly useful for small to medium-sized datasets where other algorithms may struggle due to overfitting or underfitting, but can be computationally expensive due to its use of Gaussian processes during training and prediction phases.

Architecture:

The architecture of GPC involves defining a prior distribution over the mean function and covariance function, which are then updated based on the training data using Bayes’ theorem. The algorithm then predicts the probability distribution of the output label for new input features by propagating uncertainty through the model.

Training:

To train a GPC, you first need to create an instance of the class and then fit it on the training data using the fit() method. The fit() method takes two arguments: X (the input features) and y (the corresponding labels). The algorithm then updates the prior distributions based on the training data and returns a Gaussian process regressor object.

Hyperparameters:

GPC has several hyperparameters that can be tuned to optimize performance for specific use cases:

a) Mean Function: The mean function is a prior distribution over the output label, which can be chosen based on domain knowledge or set to zero for a simpler model. Some options include constant, linear, polynomial, and RBF (radial basis function).

b) Covariance Function: The covariance function is a prior distribution over the relationship between input features and output labels, which can be chosen based on domain knowledge or set to a simple kernel such as squared exponential or Matern. The hyperparameters of the covariance function include lengthscale (which controls how quickly correlation decays with distance) and amplitude (which controls how strong the correlation is).

c) Acquisition Function: The acquisition function is used to select new input features to collect during active learning, which can improve model performance by reducing uncertainty in the predictions. Some options include expected improvement (EI), probability of improvement (PI), and upper confidence bound (UCB).

Classification Task:

GPC can be used for classification tasks where the input features are numerical or categorical, and the output label is categorical as well. It can handle non-linear decision boundaries and is particularly useful for small to medium-sized datasets where other algorithms may struggle due to overfitting or underfitting. However, GPC can be computationally expensive due to its use of Gaussian processes during training and prediction phases, so it may not be suitable for very large datasets or real-time applications.

DecisionTreeClassifier

DecisionTreeClassifier is a supervised learning algorithm used for classification tasks. It is a decision tree model that recursively partitions the feature space into smaller subspaces based on the value of a feature at each node, in order to predict the output label.

Architecture:

The architecture of DecisionTreeClassifier involves creating a tree-like model where each internal node represents a test on an input feature, each branch represents the outcome of the test, and each leaf node represents a class label. The algorithm predicts the class label for a new input feature by traversing the tree from the root node to the leaf node based on the values of the input features at each internal node.

Training:

To train a DecisionTreeClassifier, you first need to create an instance of the class and then fit it on the training data using the fit() method. The fit() method takes two arguments: X (the input features) and y (the corresponding labels). The algorithm then recursively partitions the feature space into smaller subspaces based on the values of the input features at each node until a stopping criterion is met, such as reaching a minimum number of samples in a leaf node or maximum depth of the tree.

Hyperparameters:

DecisionTreeClassifier has several hyperparameters that can be tuned to optimize performance for specific use cases:

a) Max Depth: The maximum depth of the tree, which controls how deep the tree can grow during training. A higher max depth may result in overfitting, while a lower max depth may result in underfitting.

b) Min Samples Split: The minimum number of samples required to split an internal node into child nodes during training. A higher min samples split may result in fewer splits and simpler trees, while a lower minimum sample split may result in more splits and more complex trees.

c) Min Samples Leaf: The minimum number of samples required to be a leaf node during training. A higher min samples leaf may result in fewer leaf nodes and simpler trees, while a lower min samples leaf may result in more leaf nodes and more complex trees.

d) Random State: A seed value for generating random splits during training, which can be used to reproduce results across multiple runs with different random seeds.

Classification Task:

DecisionTreeClassifier can be used for classification tasks where the input features are numerical or categorical, and the output label is categorical as well. It can handle both linear and non-linear decision boundaries and is particularly useful for small to medium-sized datasets where other algorithms may struggle due to overfitting or underfitting. However, DecisionTreeClassifier may be sensitive to noisy or irrelevant features, so it may not be suitable for very large datasets or real-time applications with high computational requirements.

RandomForestClassifier

RandomForestClassifier is an ensemble learning algorithm used for classification tasks. It is a combination of multiple decision trees, where each tree is trained on a random subset of the features and samples in the training data. The final prediction is made by taking the majority vote of the predictions from all the trees in the forest.

Architecture:

The architecture of RandomForestClassifier involves creating multiple decision trees, where each tree is trained on a random subset of the features and samples in the training data. The number of trees in the forest is a hyperparameter that can be tuned to optimize performance for specific use cases. The final prediction is made by taking the majority vote of the predictions from all the trees in the forest.

Training:

To train a RandomForestClassifier, you first need to create an instance of the class and then fit it on the training data using the fit() method. The fit() method takes two arguments: X (the input features) and y (the corresponding labels). The algorithm then recursively partitions the feature space into smaller subspaces based on the values of a random subset of features at each node until a stopping criterion is met, such as reaching a minimum number of samples in a leaf node or maximum depth of the tree.

Hyperparameters:

RandomForestClassifier has several hyperparameters that can be tuned to optimize performance for specific use cases:

a) N_estimators: The number of decision trees in the forest, which controls how many trees are used to make predictions during training. A higher n_estimators may result in more accurate predictions, but may also increase computational requirements and training time.

b) Max Depth: The maximum depth of each decision tree, which controls how deep each tree can grow during training. A higher max depth may result in more complex trees and better accuracy, but may also increase overfitting and computational requirements.

c) Min Samples Split: The minimum number of samples required to split an internal node into child nodes during training. A higher min samples split may result in fewer splits and simpler trees, while a lower minimum sample split may result in more splits and more complex trees.

d) Min Samples Leaf: The minimum number of samples required to be a leaf node during training. A higher min samples leaf may result in fewer leaf nodes and simpler trees, while a lower min samples leaf may result in more leaf nodes and more complex trees.

e) Random State: A seed value for generating random splits during training, which can be used to reproduce results across multiple runs with different random seeds.

Classification Task:

RandomForestClassifier can be used for classification tasks where the input features are numerical or categorical, and the output label is categorical as well. It can handle both linear and non-linear decision boundaries and is particularly useful for large datasets where other algorithms may struggle due to overfitting or underfitting. However, RandomForestClassifier may be sensitive to irrelevant or redundant features, so it may not be suitable for very high-dimensional datasets or real-time applications with high computational requirements.

AdaBoostClassifier

AdaBoostClassifier is an ensemble learning algorithm used for classification tasks. It is a combination of multiple weak classifiers, where each classifier is trained on the same input features but with a different weight assigned to each sample based on its misclassification error in the previous iteration. The final prediction is made by taking the weighted majority vote of the predictions from all the classifiers.

Architecture:

The architecture of AdaBoostClassifier involves creating multiple weak classifiers, where each classifier is trained on the same input features but with a different weight assigned to each sample based on its misclassification error in the previous iteration. The weights are updated iteratively to focus more on misclassified samples in subsequent iterations. The final prediction is made by taking the weighted majority vote of the predictions from all the classifiers.

Training:

To train an AdaBoostClassifier, you first need to create an instance of the class and then fit it on the training data using the fit() method. The fit() method takes two arguments: X (the input features) and y (the corresponding labels). The algorithm then iteratively trains each weak classifier on the weighted training data, where the weights are updated based on the misclassification error of the previous iteration.

Hyperparameters:

AdaBoostClassifier has several hyperparameters that can be tuned to optimize performance for specific use cases:

a) N_estimators: The number of weak classifiers in the ensemble, which controls how many classifiers are used to make predictions during training. A higher n_estimators may result in more accurate predictions, but may also increase computational requirements and training time.

b) Learning Rate: The shrinkage factor used to update the weights of misclassified samples in subsequent iterations, which controls how much emphasis is placed on correcting misclassifications. A higher learning rate may result in faster convergence but may also result in overfitting, while a lower learning rate may result in slower convergence but may also result in better generalization performance.

c) Loss Function: The loss function used to calculate the misclassification error for each sample during training, which can be either logarithmic (for binary classification tasks) or exponential (for multi-class classification tasks). The choice of loss function depends on the nature of the output labels and can significantly affect model performance.

d) Random State: A seed value for generating random splits during training, which can be used to reproduce results across multiple runs with different random seeds.

Classification Task:

AdaBoostClassifier can be used for classification tasks where the input features are numerical or categorical, and the output label is categorical as well. It can handle both linear and non-linear decision boundaries and is particularly useful for imbalanced datasets where some classes are significantly underrepresented compared to others. However, AdaBoostClassifier may be sensitive to overfitting and noisy data, so it may not be suitable for very high-dimensional datasets or real-time applications with high computational requirements.

GaussianNB

GaussianNB (Gaussian Naive Bayes) is a probabilistic classification algorithm based on the Bayes theorem, which assumes that each feature is independent and follows a Gaussian distribution. It calculates the posterior probability of a class given the input features and predicts the class with the highest probability as the output label, without any hyperparameters to tune. Suitable for binary classification tasks with Gaussian distributed features, such as credit scoring, spam filtering, and sentiment analysis.

Architecture:

The architecture of GaussianNB involves calculating the probability of each class given the input features using the Bayes theorem and assuming that each feature follows a Gaussian distribution. The algorithm then predicts the class with the highest posterior probability as the output label.

Training:

To train a GaussianNB model, you first need to create an instance of the class and then fit it on the training data using the fit() method. The fit() method takes two arguments: X (the input features) and y (the corresponding labels). The algorithm then calculates the mean and variance of each feature for each class and uses them to calculate the posterior probabilities of each class given the input features.

Hyperparameters:

GaussianNB does not have any hyperparameters that can be tuned, as it assumes that each feature follows a Gaussian distribution with a fixed variance. However, you can use techniques like GridSearchCV and RandomizedSearchCV to optimize the algorithm’s performance by selecting an appropriate kernel function for transforming non-Gaussian features into Gaussian features.

Classification Task:

GaussianNB is suitable for binary classification tasks where each feature is independent and follows a Gaussian distribution, such as credit scoring, spam filtering, and sentiment analysis. However, it may not be suitable for highly non-linear or correlated datasets, as it assumes that each feature is independent and follows a Gaussian distribution with a fixed variance.

--

--