ML Interview Prep: Part II: Traditional ML Algorithms

Jeniya Tabassum
9 min readJul 6, 2023

--

This curated list of algorithms for training the machine learning models, that are often asked during MLE/MLS interviews.

[This is part-I of the three part ML interview refresher]

SVM

Imagine you have a dataset with two classes, represented by points in a two-dimensional space. The goal is to find a decision boundary that separates the two classes as best as possible.

SVM workflow:

1. Data Representation:
Let’s say we have a training dataset consisting of input features X and corresponding class labels y. Each training example is represented as (xᵢ, yᵢ), where xᵢ is the input feature vector and yᵢ is the corresponding class label (+1 or -1).

2. Margin and Decision Boundary:
The SVM algorithm aims to find a hyperplane that separates the two classes with the largest possible margin. The margin is the distance between the hyperplane and the nearest data points from each class. The hyperplane is defined by the equation:
wᵀx + b = 0
Here, w is the normal vector to the hyperplane, x is the input feature vector, and b is the bias term.

3. Maximum Margin:
SVM selects the hyperplane that maximizes the margin while still correctly classifying the training examples. The margin is computed as the distance between the hyperplane and the closest data points from each class. These closest points are called support vectors, hence the name “Support Vector Machines.”

4. Formulating the Optimization Problem:
To find the optimal hyperplane, SVM formulates an optimization problem. It aims to minimize the norm of the weight vector (||w||) subject to the constraint that all training examples are correctly classified according to their class labels:
minimize: (1/2) * ||w||²
subject to: yᵢ * (wᵀxᵢ + b) ≥ 1
for all training examples (xᵢ, yᵢ)

The objective function (1/2) * ||w||² represents the regularization term that encourages finding a large-margin hyperplane, and the inequality constraint ensures correct classification.

5. Solving the Optimization Problem:
The optimization problem is typically solved using quadratic programming or optimization techniques. The Lagrange multiplier method is employed to derive the dual problem, which results in a quadratic optimization problem that can be efficiently solved.

6. Support Vectors and Decision Rule:
The training examples that lie on the margin or violate the margin are the support vectors. These support vectors play a crucial role in defining the decision boundary.

The decision rule for classifying a new test example x can be derived from the learned weights and biases as follows:
If wᵀx + b > 0, predict class +1.
If wᵀx + b < 0, predict class -1.
The sign of the dot product wᵀx + b determines the class label.

SVM finds an optimal hyperplane that maximizes the margin and separates the classes, with the support vectors playing a crucial role in the decision boundary. The optimization problem ensures correct classification while minimizing the norm of the weight vector.

Crammer-Singer SVM

Crammer-Singer SVM is a multiclass extension of Support Vector Machines (SVMs) that directly formulates the optimization problem for multiclass classification. It avoids the need for transforming the problem into multiple binary classification tasks using the one-vs-one or one-vs-rest approaches. Instead, it optimizes a joint objective function to find a single decision boundary for all classes.

Crammer-Singer workflow:

1. Data Representation:
Let’s consider a multiclass classification problem with K classes. We have a training dataset consisting of input features X and corresponding class labels y. Each training example is represented as (xᵢ, yᵢ), where xᵢ is the input feature vector and yᵢ is the corresponding class label. The class labels yᵢ are one-hot encoded vectors, where only one element is 1 indicating the true class.

2. Objective Function:
Crammer-Singer SVM aims to minimize a joint objective function that incorporates both the margin maximization and classification accuracy. The objective function is defined as:
minimize: 1/2 ||w||² + C * ξ
subject to: (wᵀφ(xᵢ) — wᵀφ(xⱼ))yᵢ ≥ 1 — ξ

  • ξ ≥ 0 for all training examples (xᵢ, yᵢ)
  • w is the weight vector
  • φ(x) is the feature map that implicitly maps the input features x to a higher-dimensional space
  • C is the regularization parameter that controls the trade-off between margin maximization and training error
  • ξ is a slack variable that allows for some training errors.
  • The first term in the objective function,
  • 1/2 ||w||², represents the margin maximization objective similar to the standard SVM.
  • The second term, C * ξ, penalizes the training errors.
  • The constraint ensures that the correct class score (wᵀφ(xᵢ)) is higher than the scores of other classes (wᵀφ(xⱼ)) by a margin of at least 1 — ξ.

3. Optimization:
The objective function is solved using optimization techniques such as quadratic programming or gradient-based methods. The optimization process aims to find the optimal values for the weight vector w and slack variables ξ that minimize the objective function while satisfying the constraints.

4. Classification:
To classify a new test example x, the class with the highest score is selected as the predicted class label. The class scores are calculated as wᵀφ(x), where w is the learned weight vector obtained from the optimization.

Crammer-Singer SVM directly learns a single decision boundary for all classes, considering the joint optimization problem. It aims to maximize the margin between classes while ensuring correct classification. It can handle multiclass classification efficiently and avoids the need for transforming the problem into multiple binary tasks.

It’s worth noting that Crammer-Singer SVM is just one approach for multiclass SVM formulation, and other techniques like one-vs-one and one-vs-rest are also available. The choice of formulation depends on the specific problem and requirements.

Kernel trick in SVM

The kernel trick is a technique used in machine learning, particularly in Support Vector Machines (SVMs), to implicitly transform the input features into a higher-dimensional feature space without explicitly calculating the transformed features. It allows SVMs to efficiently handle non-linear decision boundaries and complex data distributions.

The concept of the kernel trick can be explained as follows:

1. Linear SVM and Non-Linear Data:
SVMs with linear kernels can effectively separate data when the classes are linearly separable. However, many real-world datasets are not linearly separable. In such cases, a non-linear decision boundary is required to accurately classify the data.

2. Mapping to a Higher-Dimensional Space:
The kernel trick allows us to implicitly map the original input features into a higher-dimensional feature space, where the classes may become linearly separable. This transformation is done by using a kernel function that calculates the similarity or inner product between pairs of data points in the higher-dimensional space.

3. Kernel Function:
A kernel function measures the similarity between two samples without explicitly computing the transformed features in the higher-dimensional space. It directly operates on the original input feature space, making the computation more efficient. Commonly used kernel functions include:
* Linear Kernel: K(x, y) = xᵀy
* Polynomial Kernel: K(x, y) = (αxᵀy + c)ᵈ
* Gaussian (RBF) Kernel: K(x, y) = exp(-γ||x-y||²)

Here, x and y are input feature vectors, α, c, d, and γ are kernel parameters.

4. Training and Classification:
Using the kernel trick, SVMs can be trained and applied in the original input feature space while implicitly operating in the higher-dimensional feature space. The SVM optimization problem is formulated using the kernel function, enabling the learning of complex, non-linear decision boundaries.

5. Benefits of the Kernel Trick:
The kernel trick allows SVMs to capture complex patterns and non-linear relationships in the data without explicitly calculating the transformed features. It offers the following benefits:

- Enables handling non-linear decision boundaries.
- Avoids the computational cost of explicitly transforming features into a higher-dimensional space.
- Can work with infinite-dimensional feature spaces (e.g., Gaussian kernel).
- Supports efficient training and inference.

The kernel trick is a powerful tool in SVMs as it allows for the efficient handling of non-linear and complex data distributions. It allows SVMs to implicitly work in higher-dimensional feature spaces, opening up the possibility of accurately classifying data that cannot be linearly separated in the original input feature space.

Logistic Regression

Logistic Regression is a popular algorithm used for binary classification tasks. It models the relationship between the input features and the probability of the target variable belonging to a particular class. The algorithm estimates the parameters of a logistic function to make predictions.

Logistic Regression workflow:

1. Data Representation:
The training data consists of input feature vectors X and corresponding binary class labels y. Each training example is represented as (xᵢ, yᵢ), where xᵢ is the input feature vector, and yᵢ is the binary class label (0 or 1).

2. Logistic Function (Sigmoid):
Logistic Regression uses the logistic function, also known as the sigmoid function, to model the probability of the target variable being in class 1.

The sigmoid function is defined as:
σ(z) = 1 / (1 + exp(-z))
where z is the linear combination of the input features and the corresponding weights.

3. Hypothesis Function:
The hypothesis function of Logistic Regression is given by:
hθ(x) = σ(θᵀx)
Here, hθ(x) represents the predicted probability of the target variable being in class 1 for the input feature vector x, θ is the vector of weights (parameters), and x is the input feature vector.

4. Cost Function:
The cost function in Logistic Regression is the logarithmic loss function (also called the cross-entropy loss function) and is defined as:
J(θ) = -(1/m) * Σ [yᵢ * log(hθ(xᵢ)) + (1-yᵢ) * log(1 — hθ(xᵢ))]
where m is the number of training examples, yᵢ is the true class label (0 or 1) for the i-th example, and hθ(xᵢ) is the predicted probability for the i-th example.
The cost function penalizes large errors in predictions and aims to minimize the discrepancy between the predicted probabilities and the true labels.

5. Optimization:
The goal is to find the optimal values of the weight vector θ that minimize the cost function. This is typically achieved using optimization algorithms such as gradient descent. The gradient descent algorithm iteratively updates the weights to find the minimum of the cost function.

6. Decision Boundary and Prediction:
Once the weights are optimized, the decision boundary is determined based on a probability threshold (usually 0.5). If the predicted probability is above the threshold, the example is classified as class 1; otherwise, it is classified as class 0.

To make predictions on new examples, the input features are passed through the hypothesis function, and the resulting probability is compared to the threshold to determine the predicted class.

Logistic Regression is a straightforward and interpretable algorithm that works well when the decision boundary between classes is relatively linear. It is widely used in various applications, and extensions such as multinomial logistic regression can be used for multiclass classification.

Logistic Regression can also be extended to handle multiclass classification problems by using one of the following approaches: one-vs-rest (one-vs-all) or multinomial logistic regression (softmax regression).

One-vs-Rest (One-vs-All) approach for multi-class classification

In the one-vs-rest approach, a separate logistic regression model is trained for each class, treating it as the positive class, while considering the remaining classes as the negative class. The steps involved in this approach are as follows:

Data Representation: The training data consists of input feature vectors X and corresponding class labels y, where y can take on multiple discrete values representing different classes.

Model Training: For each class, a binary logistic regression model is trained independently. During training, the positive class is set to the current class being considered, while all other classes are treated as the negative class.

Prediction: To make predictions, each binary logistic regression model is used to predict the probability of the input belonging to the positive class. The class with the highest probability is selected as the predicted class label.

The one-vs-rest approach converts the multiclass classification problem into multiple binary classification problems. It can handle imbalanced class distributions and allows for interpretable class-specific decision boundaries. However, it may struggle with overlapping or highly correlated classes.

Multinomial Logistic Regression (Softmax Regression) for multi-class classification

Multinomial logistic regression, also known as softmax regression, directly models the probabilities of the input belonging to each class. It generalizes logistic regression to handle multiple classes simultaneously. The steps involved in this approach are as follows:

Data Representation: The training data consists of input feature vectors X and corresponding class labels y, where y can take on multiple discrete values representing different classes.

Model Training: The softmax function is used as the activation function. The model learns a set of weights for each class and assigns a probability to each class. The weights are trained to maximize the likelihood of the observed class labels.

Prediction: To make predictions, the input features are passed through the trained model, resulting in class probabilities for each input. The class with the highest probability is selected as the predicted class label.

Multinomial logistic regression directly models the joint probability distribution over all classes, considering their relationships. It provides a unified approach for multiclass classification and can capture complex interactions between classes. However, it may be sensitive to imbalanced class distributions and requires more computational resources compared to the one-vs-rest approach.

Both approaches extend logistic regression for multiclass classification tasks, with each having its advantages and limitations. The choice between them depends on the specific problem, the characteristics of the data, and the interpretability requirements.

Part I: Fundamentals
Part III: Neural Network

--

--