ML Interview Prep: Part-I: Fundamentals

26 min readJul 6, 2023

This curated list of topics are from the fundamentals concepts in machine learning, that are often asked during MLE/MLS interviews.

[This is part-I of the three part ML interview refresher]

Supervised vs Unsupervised vs Reinforcement

Supervised learning uses labeled data to learn from known examples, unsupervised learning discovers patterns and structures in unlabeled data, and reinforcement learning focuses on learning through interactions with an environment to maximize a reward. Each of these learning approaches serves different purposes and has various applications in machine learning and artificial intelligence.

Supervised learning:

Supervised learning is a machine learning approach where an algorithm learns from labeled training data. In this type of learning, the training dataset consists of input features (also known as independent variables) and their corresponding output labels (also known as dependent variables or targets). The goal is to learn a function that maps the input features to the output labels accurately. The algorithm is “supervised” because it is provided with the correct answers during training, allowing it to make predictions or classifications on unseen data. Examples of supervised learning algorithms include: linear regression, decision trees, support vector machines.

Example: Suppose you have a dataset of emails, and each email is labeled as either “spam” or “not spam.” In supervised learning, you would train a model using this labeled dataset. The input features could be various attributes of the email (e.g., sender, subject, body), and the output labels would be “spam” or “not spam.” The goal is to learn a function that can accurately classify new, unseen emails as spam or not spam. The model learns from the labeled examples in the training data to make predictions on new, unlabeled emails. Algorithms like logistic regression or random forests can be used in supervised learning for this task.

Unsupervised Learning:

Unsupervised learning, on the other hand, deals with unlabeled data, where the training dataset contains only input features without any corresponding output labels. The objective of unsupervised learning is to discover meaningful patterns, structures, or relationships within the data. Algorithms in unsupervised learning aim to cluster similar data points together or find hidden representations that capture the underlying characteristics of the data. Unlike supervised learning, there is no ground truth or correct answers provided, and the algorithm explores the data without prior knowledge. Common unsupervised learning algorithms include: clustering methods like k-means and hierarchical clustering, dimensionality reduction techniques such as principal component analysis (PCA) and t-SNE.

Example: Let’s consider an example where you have a dataset of customer shopping behavior. This dataset consists of customer attributes like age, gender, and purchase history but lacks any specific labels. In unsupervised learning, you can use clustering algorithms to group similar customers together based on their attributes. For instance, using an algorithm like k-means clustering, the data might naturally form distinct clusters, suggesting different customer segments. This allows you to gain insights into the underlying patterns and structure of the data without any predefined labels.

Reinforcement Learning:

Reinforcement learning is a type of machine learning that deals with an agent learning to interact with an environment to maximize a cumulative reward. It is based on the concept of learning through trial and error. In reinforcement learning, the agent learns by taking actions in the environment and receiving feedback in the form of rewards or penalties. The goal of the agent is to learn an optimal policy — a set of actions that maximize the cumulative reward over time. Reinforcement learning is commonly used in scenarios where an agent needs to make sequential decisions, such as in robotics, game playing, and autonomous driving. Algorithms in reinforcement learning include: Q-learning, policy gradient methods, and deep reinforcement learning with neural networks.

Example: Imagine an autonomous driving scenario where an agent (e.g., a self-driving car) needs to learn how to navigate a road without any prior knowledge. In reinforcement learning, the agent interacts with the environment (the road) and receives feedback in the form of rewards or penalties. For example, the agent receives a positive reward when it stays on the road and reaches its destination safely, and it receives a negative reward or penalty when it deviates from the road or gets into an accident. The goal of the agent is to learn an optimal policy — a set of actions that maximizes the cumulative reward over time. Through trial and error, the agent explores different actions (accelerate, brake, turn, etc.) and learns which actions lead to better outcomes in terms of rewards. Reinforcement learning algorithms like Q-learning or deep reinforcement learning with neural networks can be employed to train the agent in this scenario.

Steps in ML pipeline

A machine learning pipeline typically involves several key steps to go from raw data to a trained model that can make predictions. Here are the main steps involved in a typical machine learning pipeline:

1. Data Collection: The first step is to gather the relevant data needed for training the model. This can involve collecting data from various sources, such as databases, APIs, or external datasets.

2. Data Preprocessing: Once the data is collected, it needs to be preprocessed to ensure its quality and compatibility with the learning algorithms. This step involves tasks such as data cleaning, handling missing values, dealing with outliers, and data normalization or scaling.

3. Feature Engineering: Feature engineering involves transforming the raw data into a set of features that can be used by the machine learning algorithm. This step can include tasks such as selecting relevant features, creating new features, encoding categorical variables, and scaling or transforming features.

4. Data Splitting: The dataset is divided into training and testing subsets. The training set is used to train the model, while the testing set is used to evaluate its performance. Additional subsets, such as validation sets, can be created for hyperparameter tuning and model selection.

5. Model Selection: This step involves choosing the appropriate model or algorithm to train on the data. The choice depends on the type of problem (classification, regression, etc.), the available data, and other factors such as interpretability, computational efficiency, and scalability.

6. Model Training: The selected model is trained on the training data, where it learns the patterns and relationships between the input features and the target variable. The training process involves adjusting the model’s parameters or weights using optimization algorithms to minimize the error or maximize the likelihood of the training data.

7. Model Evaluation: The trained model is evaluated on the testing set to assess its performance and generalization ability. Evaluation metrics such as accuracy, precision, recall, F1 score, or mean squared error are used to measure the model’s performance.

8. Model Optimization: If the model’s performance is not satisfactory, optimization techniques can be applied. This includes tuning hyperparameters, adjusting regularization techniques, trying different feature combinations, or applying ensemble methods to improve the model’s performance.

9. Model Deployment: Once the model is trained and evaluated, it can be deployed to make predictions on new, unseen data. This can involve creating an application or integrating the model into an existing system.

10. Model Monitoring and Maintenance: After deployment, the model’s performance needs to be monitored and maintained. This may involve retraining the model periodically with new data, updating the model with new features or changes in the data distribution, and addressing issues such as concept drift or model decay.

These steps form a general framework for a machine learning pipeline, but the specific implementation can vary depending on the problem, data, and the chosen algorithms and tools.

Generative Models:

Generative models aim to model the underlying probability distribution of the input data. They learn the joint probability distribution of the input features and the corresponding output labels (if available). In other words, generative models try to understand how the data is generated and capture the underlying patterns and dependencies between the features. Once the model has learned the distribution, it can generate new samples that resemble the original data. Some examples of generative models include: Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), Variational Autoencoders (VAEs). Generative models can be used for tasks such as data synthesis, data augmentation, and anomaly detection.

In generative models, we aim to model the joint probability distribution of the input features, denoted as X, and the corresponding output labels, denoted as Y. The objective is to estimate this joint distribution and learn the underlying patterns and dependencies between X and Y.

Mathematically, this can be represented as P(X, Y). Using Bayes’ theorem, we can express the joint distribution as: P(X, Y) = P(Y) * P(X | Y), where P(Y) represents the prior probability of Y, and P(X | Y) represents the conditional probability of X given Y. Generative models learn both the prior probability distribution P(Y) and the conditional probability distribution P(X | Y).

Discriminative Models:

Discriminative models, focus on learning the boundary or decision boundary that separates different classes or categories. They learn the conditional probability distribution of the output labels given the input features. In other words, discriminative models aim to directly model the decision boundary that distinguishes between different classes. They focus on the features that are most relevant for making predictions and classifying new instances. Some examples of discriminative models include: logistic regression, support vector machines (SVMs), deep neural networks (DNNs). Discriminative models are commonly used for tasks like classification, regression, and object recognition.

In discriminative models, we focus on learning the conditional probability distribution of the output labels Y given the input features X, denoted as P(Y | X). The objective is to directly model this conditional distribution and learn the decision boundary that separates different classes.

One of the example of discriminative model is logistic regression, the conditional probability can be represented as: P(Y | X) = sigmoid(W * X + b), here, W represents the weight parameters, X represents the input features, b represents the bias term, sigmoid() is the sigmoid function that maps the linear combination of the features to a probability value between 0 and 1.

Generative vs Discriminative

Generative models focus on understanding and modeling the underlying probability distribution of the data, while discriminative models concentrate on learning the decision boundary between different classes. The choice between the two approaches depends on the specific problem at hand and the desired application.

Differences between Generative and Discriminative Models:

Objective:
- Generative models aim to understand and model the joint distribution of the input features and output labels, while discriminative models focus on modeling the conditional distribution of the output labels given the input features.
Representation:
- Generative models aim to model the joint probability distribution P(X, Y), while discriminative models focus on the conditional probability distribution P(Y | X).
Data Generation vs. Decision Boundary:
- Generative models aim to understand the data generation process and capture the underlying patterns and dependencies between X and Y. By learning the entire probability distribution, generative models are capable of generating new samples that resemble the original data.
- Discriminative models, on the other hand, directly learn the decision boundary that separates different classes or categories, without explicitly generating new samples.
Use Cases:
- Generative models are useful in tasks where understanding the underlying data distribution and generating new data samples are important, such as data synthesis or anomaly detection.
- Discriminative models are more commonly used in tasks that involve classification, regression, or decision-making, where accurately predicting the output label is the primary objective.
Training:
- Generative models typically involve estimating the joint probability distribution, which can be more complex and computationally expensive. Generative models estimate both the prior probability P(Y) and the conditional probability P(X | Y).
- Discriminative models, on the other hand, directly estimate the conditional probability distribution, which is often simpler and more efficient. Use of Probability Distributions: Discriminative models directly estimate the conditional probability P(Y | X).

Bias Variance Trade off

Bias-variance trade-off represents the compromise between bias and variance errors in machine learning models. It emphasizes the need to strike a balance between simplicity and complexity to achieve good generalization performance.

The bias-variance trade-off is a fundamental concept in machine learning that relates to the overall predictive performance of a model. It represents the balance between two types of errors that a model can make: bias error and variance error.

Bias Error: Bias refers to the simplifying assumptions made by a model to make the target function easier to learn. A model with high bias tends to oversimplify the underlying patterns in the data and make strong assumptions about the relationships between features and the target variable. Consequently, it may consistently underfit the training data and have difficulty capturing complex relationships or variations in the data. High bias leads to high training error.
Variance Error: Variance refers to the amount of fluctuation or instability in a model’s predictions caused by small changes in the training data. A model with high variance is overly complex and sensitive to the noise or randomness in the training data. Such a model tends to fit the training data very closely, but it fails to generalize well to unseen data. High variance leads to high testing error or poor generalization.

The bias-variance trade-off can be visualized as follows:

To achieve good predictive performance, a model needs to strike the right balance between bias and variance. In general:

As we decrease bias (make the model more complex and flexible), the model becomes better at fitting the training data, reducing training error. However, this often increases variance, causing the model to be overly sensitive to the training data and resulting in higher testing error.

As we increase bias (make the model simpler), the model becomes less sensitive to the specific details of the training data, reducing variance. However, this may increase bias error, causing the model to underfit the training data and resulting in higher training and testing errors.

The goal is to find an optimal trade-off point that minimizes the total error on unseen data, balancing the ability to capture complex patterns without overfitting the training data. This is often achieved through techniques like regularization, cross-validation, and model selection.

Regularization

Regularization is a technique in machine learning that helps prevent overfitting and improves the generalization performance of models. It involves adding a regularization term to the loss function during training, which penalizes certain characteristics of the model to encourage simplicity and reduce the impact of noisy or irrelevant features. It in turn encourages the model to have simpler or smoother solutions.

Given a loss function J, the regularized loss function J_reg is defined as:

J_reg = J + α * R(w)

J represents the original loss function that measures the discrepancy between the model’s predictions and the true labels or targets, α is the regularization parameter that controls the strength of regularization. Here, α determines the trade-off between fitting the training data well (minimizing the loss function J) and keeping the model’s complexity low (minimizing the regularization term R(w)). A higher value of α increases the impact of the regularization term, leading to a simpler model with potentially higher bias and lower variance. R(w) is the regularization term that penalizes certain characteristics of the model. The specific form of R(w) depends on the type of regularization used.

During the training process, the regularized loss function J_reg is minimized with respect to the model’s parameters w. This is typically done using optimization algorithms such as gradient descent or stochastic gradient descent. The optimization process aims to find the values of the parameters that minimize the regularized loss function, striking a balance between fitting the training data and preventing overfitting.

This regularization term is combined with the original loss function, resulting in a regularized loss function. The regularization parameter α determines the strength of regularization and controls the trade-off between fitting the training data and controlling model complexity.

The regularized loss function is then minimized to find the optimal values of the model’s parameters. By including this term in the loss function, the model is encouraged to find a balance between fitting the training data and keeping the model’s complexity in check.

Importance

Overfitting Prevention:
- Overfitting occurs when a model learns to fit the training data too closely, capturing noise and irrelevant patterns that are specific to the training set. As a result, the model’s performance on unseen data deteriorates.
- Regularization helps mitigate overfitting by imposing constraints on the model’s complexity, discouraging it from memorizing noise and focusing on more meaningful patterns.
Generalization Improvement:
- By adding small penalties over weights, regularization helps improve the model’s ability to generalize to new, unseen data.
- It promotes the learning of underlying patterns and dependencies that are applicable to the entire dataset, rather than relying on specific idiosyncrasies of the training set.
Model Simplicity:
- Regularization encourages models to be simpler and more interpretable.
- It discourages complex or intricate solutions that may be prone to overfitting and are harder to understand.
- Simpler models are often preferred in practice as they are easier to explain, validate, and maintain.
Bias-Variance Trade-off:
- Regularization plays a crucial role in the bias-variance trade-off. By introducing a regularization term, the model’s ability to fit the training data precisely is reduced (higher bias), which can help reduce the model’s sensitivity to noise and improve its performance on unseen data (lower variance).
Parameter Shrinking:
- Regularization can shrink the values of the model’s parameters towards zero.
- This helps reduce the impact of less important features, preventing them from dominating the model’s predictions.
- It effectively performs feature selection by assigning smaller weights to irrelevant or noisy features.

Common regularization techniques include:L1 Regularization (Lasso), L2 Regularization (Ridge), Dropout Regularization.

L1 Regularization (Lasso)

L1 regularization adds the sum of the absolute values of the model’s parameters as a penalty term to the loss function. It encourages sparsity in feature selection, by adding the sum of the absolute values of the parameters. The L1 regularization term encourages sparsity in the parameter vector w:

R(w) = ||w||₁ = |w₁| + |w₂| + … + |w_n|

The regularization term R(w) penalizes large parameter values and promotes sparsity (L1 regularization)

L2 Regularization (Ridge)

L2 regularization adds the sum of the squared values of the model’s parameters as a penalty term to the loss function. It promotes smaller parameter values by adding the sum of the squared values of the parameters:

R(w) = ||w||₂² = w₁² + w₂² + … + w_n²

The regularization term R(w) penalizes smaller parameters

Dropout Regularization:

It randomly sets a fraction of the model’s input units or weights to zero during training, effectively creating an ensemble of smaller sub-networks. This helps prevent complex co-adaptations of neurons and reduces overfitting.

Difference between L1 and L2 regularization

L1 regularization and L2 regularization are two common techniques used to prevent overfitting in machine learning models by adding penalty terms to the loss function. The key differences between L1 and L2 regularization lie in the type of penalty applied and the effect on the model’s behavior.

Penalty Type:
- L1 regularization adds the sum of the absolute values of the model’s parameters to the loss function. Mathematically, it can be represented as the L1 norm of the parameter vector: λ * ||w||₁, where λ is the regularization parameter and ||w||₁ represents the L1 norm of the parameter vector w.
- L2 Regularization (Ridge) adds the sum of the squared values of the model’s parameters to the loss function. Mathematically, it can be represented as the L2 norm of the parameter vector: λ * ||w||₂², where λ is the regularization parameter and ||w||₂ represents the L2 norm of the parameter vector w.
Effect on Model’s Behavior:
- L1 Regularization encourages sparsity in the model’s parameter values, meaning it pushes some of the parameter values to exactly zero. This results in feature selection, where less important features are effectively ignored by the model, leading to a more interpretable and compact model. L1 regularization has the potential to create models with fewer parameters and a more focused set of features.
- L2 Regularization: L2 regularization penalizes large parameter values and encourages smaller parameter values. It does not force the parameters to become exactly zero. Instead, it smoothly reduces the impact of less important features but still keeps them in the model. L2 regularization helps to distribute the weight values more evenly across features, reducing the influence of any individual feature and leading to smoother and more stable models.
Choice
- It’s important to note that the preference for L2 regularization over L1 regularization depends on the specific problem, the nature of the data, and the desired trade-off between feature selection and model stability. In some cases, a combination of both regularization techniques (Elastic Net regularization) might be preferred to leverage the benefits of both L1 and L2 regularization.

When L1 is preferred over L2 regularization?

L1 regularization is often preferred over L2 regularization when feature selection is desired. This includes the case, where we want the model to focus on a smaller subset of important features, by forcing some parameter values to exactly zero. L1 regularization can effectively eliminate irrelevant or noisy features, leading to a more interpretable and efficient model. L1 regularization is useful in cases where the dataset has a large number of features, and we want to identify the most relevant ones.

L1 regularization (Lasso regularization) is typically preferred over L2 regularization (Ridge regularization) in the following scenarios:

Feature Selection:
L1 regularization has a tendency to drive some of the parameter values exactly to zero, resulting in sparse models. This makes L1 regularization useful when feature selection is desired, i.e., when you want to identify and focus on a smaller subset of important features. By effectively eliminating irrelevant or noisy features, L1 regularization can lead to more interpretable and efficient models. If you have a large number of features and want to identify the most relevant ones, L1 regularization is a good choice.
Model Interpretability:
L1 regularization promotes sparsity in the model’s parameter values. Sparse models are easier to interpret because they only consider a subset of features, allowing you to identify the most influential variables in the model. This can be important in domains where interpretability is a priority, such as healthcare or finance, where understanding the underlying factors driving predictions is crucial.
Computation Efficiency:
L1 regularization can be computationally efficient when dealing with high-dimensional data. Since L1 regularization encourages sparse solutions by driving some parameter values to zero, it effectively reduces the number of features considered by the model. This can lead to faster training and inference times, especially when dealing with large datasets or complex models.
Handling Collinear Features:
L1 regularization handles correlated or collinear features better than L2 regularization. Due to the nature of L1 regularization, it tends to select one feature from a group of highly correlated features and drive the others to zero. This can help mitigate multicollinearity issues in the data, where multiple features carry similar information.
Outlier Robustness:
L1 regularization is generally more robust to the presence of outliers compared to L2 regularization. Since L1 regularization penalizes using the absolute values of the parameter weights, it is less affected by extreme values or outliers in the data. In contrast, L2 regularization squares the parameter weights, making it more sensitive to outliers.

When L2 is preferred over L1 regularization?

L2 regularization is generally more commonly used as a default choice due to its smoothness and stable behavior. It improves model stability and generalization performance, especially in the presence of correlated features and outliers. L2 regularization (Ridge regularization) is typically preferred over L1 regularization (Lasso regularization) in the following scenarios:

Continuous Parameter Weights:
L2 regularization promotes smaller parameter values without driving them to exactly zero.This is beneficial when you want the model to consider all features and avoid excluding any potentially informative variables. L2 regularization helps maintain the continuity of the parameter weights and avoids discarding variables completely.
Multicollinearity Handling:
L2 regularization handles multicollinearity (high correlation between features) better than L1 regularization. When features are highly correlated, L2 regularization spreads the penalty across all correlated features, preventing one feature from being favored over others.This can help stabilize the model’s behavior and provide more robust predictions when dealing with correlated predictors.
Stability and Generalization:
L2 regularization tends to produce smoother models with more stable behavior. By reducing the impact of less important features, L2 regularization helps prevent overfitting and improves the generalization performance of the model. It is often preferred when the primary goal is to achieve good performance on unseen data rather than explicitly performing feature selection.
Models with Large Numbers of Features:
L2 regularization is well-suited for models with a large number of features. Unlike L1 regularization, which can drive some feature weights to exactly zero, L2 regularization keeps all features in the model but reduces their individual contributions. This can be beneficial in cases where it is not practical or desirable to exclude any features from consideration.
Noise Reduction:
L2 regularization can be effective in reducing the impact of noise in the data. By shrinking the parameter weights, L2 regularization helps suppress the influence of noisy or irrelevant features, making the model more robust to random fluctuations in the data.

Elastic Net regularization

Elastic Net regularization is a combination of L1 regularization (Lasso) and L2 regularization (Ridge). It addresses some limitations of each technique by providing a compromise between feature selection (L1) and parameter shrinkage (L2). The regularized loss function in Elastic Net is defined as follows:

J_reg = J + α *{ λ * L1(w) + 0.5 * (1 — λ) * L2(w)}

where:
- J is the original loss function that measures the discrepancy between the model’s predictions and the true labels or targets
- α is the regularization parameter that controls the overall strength of regularization
- λ is the mixing parameter that determines the balance between L1 and L2 regularization. It takes values between 0 and 1
- L1(w) represents the L1 regularization term (sum of absolute values of the parameters). L1(w) = ||w||₁ = |w₁| + |w₂| + … + |w_n|
- L2(w) represents the L2 regularization term (sum of squared values of the parameters). L2(w) = ||w||₂² = w₁² + w₂² + … + w_n²

The L1 regularization term encourages sparsity in the parameter vector w, driving some parameter values exactly to zero and promoting feature selection.

The L2 regularization term encourages smaller parameter values, which helps in parameter shrinkage and reduces the impact of outliers.

The mixing parameter λ controls the trade-off between L1 and L2 regularization.

When λ is set to 1, Elastic Net becomes equivalent to L1 regularization (Lasso), emphasizing feature selection.
When λ is set to 0, Elastic Net becomes equivalent to L2 regularization (Ridge), emphasizing parameter shrinkage.
Intermediate values of λ allow for a combination of both regularization techniques, leveraging their strengths.

In Elastic Net, the mixing parameter λ controls the balance between the two regularization techniques, allowing for different levels of emphasis on feature selection and parameter shrinkage. Elastic Net regularization is useful in scenarios where both feature selection and parameter shrinkage are desired, and it provides more flexibility than using L1 or L2 regularization individually.

Classification vs Regression

Classification and regression are two fundamental types of problems in machine learning, where the difference is in the type of target variable that they predict. In classification, the target variable is categorical, and the model aims to classify instances into specific classes. In regression, the target variable is continuous, and the model aims to predict numerical values.

In both classification and regression, the goal is to learn a function or model that can generalize well to unseen data. However, the difference lies in the nature of the target variable and the mathematical representation of the problem.

Regression Models

In regression, the goal is to predict a continuous, numerical value. Regression can be defined as finding a function f(x) that maps an input feature vector x to a continuous output value y, where y belongs to the real numbers (y ∈ ℝ). The target variable in regression is typically represented as continuous numerical data. The objective is to learn a regression model that captures the relationship between the input features and the target variable. The output of a regression model is a predicted numerical value, y_pred, which represents a quantity or measurement.

Example:

Given a dataset of houses with various features, the task is to predict the price of a house (regression). Mathematically, the target variable y is a continuous value representing the house price (y ∈ ℝ). The regression model learns a function f(x) that maps the house features x to the corresponding predicted house price y.

Regression models can be considered discriminative rather than generative. As regression models focus on predicting a numerical value based on the input features. Whereas generative models are used for modeling the joint distribution and generating new samples. As regression models are concerned with learning the mapping from input features to output values rather than explicitly modeling the data generation process, they falls under the discriminative model category

Evaluation of Regression

In regression problems, several evaluation metrics are commonly used to assess the performance of the model in predicting continuous numerical values. Some of the main evaluation metrics for regression problems are:

Below:

1. Mean Squared Error (MSE):
The Mean Squared Error calculates the average squared difference between the predicted values and the actual values. MSE is widely used as it emphasizes larger errors due to the squaring operation.

The MSE has several properties that make it a popular choice as a loss function in regression. It is non-negative, as each squared difference is non-negative. It penalizes larger errors more heavily due to the squaring operation. It is differentiable, allowing for gradient-based optimization algorithms to be applied during model training. It is widely used and easily interpretable, providing a measure of the average squared error between predicted and true values.

When minimizing the MSE, the regression model adjusts its coefficients to minimize the overall squared error, seeking to improve the model’s fit to the data. However, it is important to note that the MSE is sensitive to outliers, as their squared differences can dominate the overall loss. Therefore, in situations where outliers are present or the data has a skewed distribution, alternative loss functions, such as the mean absolute error (MAE) or Huber loss, may be used to mitigate the impact of outliers.

2. Root Mean Squared Error (RMSE):
The Root Mean Squared Error is the square root of the Mean Squared Error and provides a measure of the average magnitude of the errors. RMSE is commonly used as it is in the same scale as the target variable.

3. Mean Absolute Error (MAE):
The Mean Absolute Error calculates the average absolute difference between the predicted values and the actual values. MAE is less sensitive to outliers compared to MSE as it does not involve squaring.

The MAE has several properties that make it a suitable choice as a loss function in regression. It is non-negative, as each absolute difference is non-negative. It provides a direct measure of the average absolute error between predicted and true values. It is robust to outliers, as it does not amplify their effect like the squared differences in the MSE. It is easily interpretable and intuitive, representing the average magnitude of errors in the same units as the dependent variable.

When minimizing the MAE, the regression model adjusts its coefficients to minimize the overall absolute error, seeking to improve the model’s fit to the data. However, the MAE is not differentiable at zero, which can pose challenges for gradient-based optimization algorithms. In such cases, alternative loss functions, such as the mean squared error (MSE) or Huber loss, which is a compromise between the MAE and MSE, can be used to balance between robustness and differentiability.

4. R-squared (Coefficient of Determination)
The R-squared metric measures the proportion of the variance in the target variable that can be explained by the model. R-squared ranges from 0 to 1, where a higher value indicates a better fit.

5. Adjusted R-squared
The Adjusted R-squared metric is a modified version of R-squared that penalizes the inclusion of unnecessary features in the model. Adjusted R-squared accounts for the number of predictors in the model and provides a more reliable measure of model goodness-of-fit. The formula for Adjusted R-squared depends on the number of predictors and the sample size and is more complex.

These evaluation metrics provide quantitative measures to assess the performance of regression models. The choice of the evaluation metric depends on the specific problem, the desired trade-offs between accuracy and interpretability, and the nature of the data. It is important to consider multiple evaluation metrics to gain a comprehensive understanding of the model’s performance.

Classification Models

In classification, the goal is to predict a discrete, categorical outcome or label. Classification can be defined as finding a function f(x) that maps an input feature vector x to a discrete output class y, where y belongs to a finite set of possible classes C = {c1, c2, …, cn}. The objective is to learn a decision boundary or decision function that separates the different classes in the feature space. The output of a classification model is a predicted class label, y_pred, which is assigned to a specific class from the set of possible classes.

Example:
Given a dataset of emails, the task is to classify each email as either spam or not spam (binary classification). Mathematically, the target variable y belongs to the set C = {spam, not spam}. The classification model learns a function f(x) that maps the email features x to the corresponding class label y.

Evaluation of Classification

There are several evaluation metrics used for classification problems to assess the performance of a machine learning model. Here are some of the main evaluation metrics along with their equations:

1. Accuracy:
Accuracy measures the overall correctness of the model’s predictions. It calculates the ratio of correct predictions (true positives and true negatives) to the total number of instances.
Accuracy = (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)

2. Precision:
Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It focuses on the accuracy of positive predictions.
Precision = True Positives / (True Positives + False Positives)

3. Recall:
Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of all actual positive instances. It focuses on capturing all positive instances.
Recall = True Positives / (True Positives + False Negatives)

4. F1 Score:
The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall.
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

5. Specificity:
Specificity measures the proportion of correctly predicted negative instances out of all actual negative instances. It focuses on capturing all negative instances.
Specificity = True Negatives / (True Negatives + False Positives)

6. Area Under the ROC Curve (AUC-ROC):
The AUC-ROC metric assesses the model’s ability to discriminate between positive and negative instances across different classification thresholds. It calculates the area under the Receiver Operating Characteristic (ROC) curve.
The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold values. The AUC-ROC ranges from 0 to 1, where a higher value indicates better classification performance.
The choice of evaluation metric depends on the specific problem, the class distribution, and the relative importance of false positives and false negatives in the given context.

Confusion matrix in Classification

The confusion matrix provides a tabular representation of the predicted and actual class labels, enabling a detailed analysis of the model’s performance across different classes.

The confusion matrix is typically a square matrix of size N x N, where N is the number of classes in the classification problem. Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class. The cells of the matrix contain the counts or frequencies of the instances falling into each combination of predicted and actual classes.

The main purpose of the confusion matrix is to provide insights into the model’s classification results, allowing for the calculation of various evaluation metrics. From the confusion matrix, several performance measures can be derived, including:

1. True Positives (TP): The number of instances correctly classified as positive (correctly predicted as the class of interest).

2. True Negatives (TN): The number of instances correctly classified as negative (correctly predicted as not belonging to the class of interest).

3. False Positives (FP): The number of instances incorrectly classified as positive (predicted as the class of interest but actually belonging to a different class).

4. False Negatives (FN): The number of instances incorrectly classified as negative (predicted as not belonging to the class of interest but actually belonging to the class of interest).

Using the values in the confusion matrix, various evaluation metrics can be calculated, such as accuracy, precision, recall, and F1 score for each class. These metrics provide insights into the model’s performance in terms of correctly identifying instances of a specific class, avoiding false positives or negatives, and overall accuracy.

The confusion matrix is particularly useful in scenarios where class imbalance exists or when different classes have varying degrees of importance. It helps identify which classes are being misclassified more frequently, providing guidance on potential areas of improvement.

Actual Positive = TP + FN
Actual Negative = FP + TN
TPR (True Positive Rate), also known as sensitivity or recall, is a metric that measures the proportion of actual positive instances correctly classified as positive by the model. TPR = TP / (TP + FN)
TPR represents the model’s ability to correctly identify positive instances or the rate of true positives. A higher TPR indicates a more sensitive model that captures a larger proportion of positive instances.
TNR (True Negative Rate), also known as specificity, is a metric that measures the proportion of actual negative instances correctly classified as negative by the model. TNR = TN / (TN + FP)
TNR represents the model’s ability to correctly identify negative instances or the rate of true negatives. A higher TNR indicates a more specific model that avoids misclassifying negative instances.
FPR (False Positive Rate) is the proportion of actual negative instances incorrectly classified as positive by the model. FPR = FP / (FP + TN)
FPR represents the model’s tendency to incorrectly classify negative instances as positive. A lower FPR indicates a better ability to avoid false positives.
FNR (False Negative Rate) is the proportion of actual positive instances incorrectly classified as negative by the model. FNR = FN / (FN + TP)
FNR represents the model’s tendency to incorrectly classify positive instances as negative. A lower FNR indicates a better ability to avoid false negatives.

Part II: Traditional ML Algorithms
Part III: Neural Network