ML Interview Prep: Part-III: Neural Network

Jeniya Tabassum
23 min readJul 5, 2023

--

This curated list of topics are from the fundamentals concepts of neural networks, that are often asked during MLE/MLS interviews.

[This is part-III of the three part ML interview refresher]

Gradient Descent

Gradient descent is an iterative optimization algorithm used in machine learning to find the optimal values of the parameters of a model by minimizing a cost or loss function. It is a widely used algorithm in training models, especially in cases where the number of parameters is large.

The main idea behind gradient descent is to update the parameter values in the direction of the steepest descent of the cost function. The algorithm starts with initial values for the parameters and iteratively updates them by taking steps proportional to the negative gradient of the cost function with respect to the parameters.

The update rule for gradient descent can be expressed as:
θ_new = θ_old — learning_rate * ∇J(θ_old)
where:
- θ_old represents the current values of the parameters.
- θ_new represents the updated values of the parameters.
- learning_rate is a hyper parameter that determines the step size or learning rate of the algorithm.
- ∇J(θ_old) is the gradient of the cost function J with respect to the parameters θ_old.
- The gradient (∇) represents the vector of partial derivatives of the cost function with respect to each parameter. It indicates the direction and magnitude of the steepest ascent of the cost function.

In each iteration of gradient descent, the algorithm computes the gradient of the cost function at the current parameter values and updates the parameters by subtracting the learning rate times the gradient. The learning rate determines the size of the step taken in each iteration and affects the convergence speed of the algorithm.

The process continues iteratively until a stopping criterion is met, such as reaching a maximum number of iterations or when the change in the cost function becomes sufficiently small.

Gradient descent can be categorized into different variants, such as batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, based on the number of training examples used to compute the gradient at each iteration.

Gradient descent is a fundamental optimization algorithm used in machine learning to train models by iteratively updating the parameter values in the direction that minimizes the cost function.

Batch vs Stochastic Gradient Descent

The main difference between batch gradient descent and stochastic gradient descent lies in the way they update the model parameters during the training process.

Batch Gradient Descent:
In batch gradient descent, the entire training dataset is used to compute the gradient of the cost function with respect to the parameters. The gradient is averaged over all the training examples, and the parameters are updated once for each iteration or epoch. Batch gradient descent takes into account the information from the entire dataset, which can provide a more accurate estimate of the true gradient. However, it can be computationally expensive, especially for large datasets, as it requires processing the entire dataset for each iteration.

Stochastic Gradient Descent:
In stochastic gradient descent, only a single training example or a small subset (mini-batch) of training examples is used to compute the gradient at each iteration. The parameters are updated after processing each individual training example or mini-batch. Stochastic gradient descent updates the parameters more frequently, leading to faster convergence and better generalization in certain cases. However, the estimate of the gradient is noisier due to the use of a single example or a subset, which can introduce more variance in the parameter updates. Stochastic gradient descent is computationally efficient as it processes one or a few examples at a time, making it suitable for large datasets.

The choice between batch gradient descent and stochastic gradient descent depends on the characteristics of the dataset and the optimization requirements.

  • Batch gradient descent is commonly used when the dataset fits in memory, and the goal is to obtain an accurate estimate of the gradient. It provides smoother convergence but can be slower for large datasets.
  • Stochastic gradient descent is useful when dealing with large datasets, online learning scenarios, or when the goal is to find a good solution quickly. It may converge faster but may exhibit more oscillations in the optimization process.

Additionally, there is a middle ground known as mini-batch gradient descent, which combines the advantages of both approaches. It computes the gradient using a small batch of training examples, striking a balance between accuracy and computational efficiency.

Learning Rate Scheduling in Gradient Descent

The learning rate schedule, also known as the learning rate decay or learning rate annealing, is a technique used in gradient descent optimization to dynamically adjust the learning rate during training. Instead of using a fixed learning rate throughout the entire training process, the learning rate schedule modifies the learning rate over time, allowing for more efficient and effective optimization.

The primary goal of a learning rate schedule is to strike a balance between two conflicting objectives: rapid progress in the initial stages of training and fine-tuning in the later stages. It addresses challenges such as oscillations, slow convergence, and overshooting that can occur when using a fixed learning rate.

There are several common learning rate schedule strategies:

1. Fixed Learning Rate: In this approach, the learning rate remains constant throughout training. While simple to implement, it may not be optimal as the same learning rate is applied regardless of the current state of the optimization process.

2. Step Decay: The learning rate is reduced by a fixed factor after a predefined number of epochs or iterations. For example, the learning rate may be halved every few epochs. This strategy allows for rapid progress initially and fine-tuning later on.

3. Exponential Decay: The learning rate is exponentially decreased over time. It follows a function of the form:
learning_rate = initial_learning_rate * decay_rate^(epoch/decay_steps).
The decay rate and decay steps control the rate at which the learning rate decreases. As training progresses, the learning rate decreases exponentially, allowing for finer adjustments.

4. Piecewise Decay: The learning rate is reduced at specific milestones during training. For example, the learning rate may be decreased by a factor after a certain number of epochs or when a validation metric plateaus. This approach allows for more control over the learning rate adjustments based on the specific characteristics of the training process.

5. Adaptive Methods: Adaptive learning rate methods, such as AdaGrad, RMSProp, and Adam, dynamically adjust the learning rate based on the accumulated gradients or the second-order derivatives of the loss function. These methods effectively adapt the learning rate for each parameter based on their individual characteristics, resulting in more efficient optimization.

The choice of a learning rate schedule depends on various factors, including the dataset, model architecture, and optimization problem. It often requires empirical experimentation and fine-tuning to find the optimal schedule for a specific task.

By using a learning rate schedule, the optimization process can benefit from a higher learning rate initially for faster progress and gradually decrease the learning rate as the model approaches convergence, enabling more precise fine-tuning. This approach can improve the stability, convergence speed, and generalization performance.

Back Propagation

Back propagation is a fundamental algorithm used to train artificial neural networks. It is an efficient method for computing the gradients of the model parameters with respect to the loss function, which allows for the optimization of the network’s weights and biases.

The back propagation algorithm operates by propagating the error gradient backwards through the network, starting from the output layer and moving towards the input layer. The key idea behind backpropagation is to iteratively update the model parameters in the opposite direction of the gradient, effectively minimizing the loss function.

Here is a step-by-step overview of the back propagation algorithm:

1. Forward Pass: The input data is fed forward through the network, layer by layer, to compute the predicted output. The activation values of each neuron in the network are calculated using a specific activation function (e.g., sigmoid, ReLU).

2. Loss Calculation: The predicted output is compared to the true output using a loss function, such as mean squared error (MSE) or cross-entropy. The loss function quantifies the discrepancy between the predicted and true values.

3. Backward Pass: The error gradient is computed by propagating the loss backwards through the network. The gradient represents the sensitivity of the loss function with respect to each parameter in the network. The chain rule of calculus is used to calculate the gradient at each layer.

4. Parameter Update: The gradients are used to update the model parameters (weights and biases) in the direction that minimizes the loss. This update is typically performed using an optimization algorithm, such as stochastic gradient descent (SGD) or its variants. The learning rate determines the step size taken in the direction of the gradient.

5. Iterative Process: Steps 1 to 4 are repeated for multiple iterations or epochs until the network converges or reaches a stopping criterion (e.g., predefined number of iterations, small change in loss).

The back propagation algorithm efficiently computes the gradients by leveraging the chain rule of calculus, which allows for the propagation of errors from the output layer to the input layer. This allows the network to adjust its weights and biases based on the magnitude and direction of the gradients, improving its ability to make accurate predictions.

Back propagation is a key component in training deep neural networks, enabling them to learn complex representations from large amounts of data. It has revolutionized the field of machine learning and has been instrumental in the success of various applications, such as computer vision, natural language processing, and reinforcement learning.

Activation Function

The purpose of the activation function in neural networks is to introduce non-linearity into the network’s output. Activation functions are applied to the outputs of individual neurons or to the entire layer of neurons in a neural network.

The non-linearity introduced by activation functions enables neural networks to learn and model complex relationships in the data. Without activation functions, a neural network would simply be a linear combination of its input, and multiple layers of linear transformations would collapse into a single linear transformation. As a result, the network’s ability to approximate non-linear functions and capture complex patterns would be severely limited.

Activation functions bring several benefits to neural networks:

1. Non-linearity: Activation functions allow neural networks to learn non-linear mappings between inputs and outputs. This is crucial for handling complex data patterns and capturing non-linear relationships in the data.

2. Model Expressiveness: By introducing non-linearity, activation functions increase the model’s expressiveness, enabling it to represent a wider range of functions and make more sophisticated predictions.

3. Gradient Propagation: Activation functions help propagate gradients during the backpropagation algorithm, which is used to train neural networks. By applying the activation function to the neuron’s output, the gradient can be computed and used to update the network’s weights during the training process.

There are various types of activation functions used in neural networks, each with its own characteristics. Some common activation functions include:

Softmax is particularly useful for multi-class classification tasks because it provides a way to assign probabilities to each class in a mutually exclusive manner. It allows the neural network to output a probability distribution over all possible classes, making it suitable for problems where an input can belong to only one class. Here the exponentiation, which ensures that the resulting values are positive. The normalization ensures that the output values lie between 0 and 1 and sum up to 1, representing a valid probability distribution.

The choice of activation function depends on the nature of the problem and the characteristics of the data. Different activation functions have different properties in terms of their range, smoothness, and computational efficiency. Selecting an appropriate activation function is important to ensure that the neural network can effectively model the desired relationships in the data and achieve good performance.

Bias in NN

In neural networks, bias refers to a trainable parameter that allows the network to make predictions that are not solely based on the input data. It represents the intercept term or the offset of the decision boundary.

The bias term is added to the weighted sum of inputs in each neuron and is passed through the activation function to introduce a shift or a bias in the output. It allows the neural network to capture patterns that may not necessarily pass through the origin or have a zero mean.

The presence of a bias term in neural networks helps the model to fit more complex and diverse patterns in the data. It provides flexibility in shifting and adjusting the decision boundary, enabling the network to better capture the relationships between inputs and outputs.

To reduce bias in neural networks, one approach is to use more complex architectures with larger numbers of neurons or layers. Increasing the model’s capacity can help it capture more intricate patterns and reduce the bias in the predictions.

Additionally, using a diverse and representative training dataset can also help reduce bias. By exposing the network to a wide range of examples and variations in the data, it can learn to generalize better and make predictions that are less biased towards specific patterns or subsets of the data.

Regularization techniques such as L1 or L2 regularization can also play a role in reducing bias. These techniques introduce a penalty term in the loss function, which discourages the network from relying too heavily on individual features or parameters. Regularization helps to prevent overfitting, which can lead to high bias and poor generalization performance.

Finally, it’s important to strike a balance between reducing bias and increasing variance. While reducing bias is desirable, it’s crucial to avoid overfitting the training data, as this can lead to high variance and poor performance on unseen data. Regularization techniques and model validation with proper evaluation metrics can help find an optimal balance between bias and variance.

Feedback Connection

In a neural network, feedback connections refer to connections that allow information to flow in a loop, where the output of a neuron or a layer is fed back as input to the same or previous layers. These feedback connections enable the network to incorporate information from previous time steps or iterations, allowing it to capture temporal dependencies and context.

Mathematically, feedback connections can be represented by recurrent connections in the network architecture. One of the commonly used recurrent neural network (RNN) architectures is the Elman network, which consists of a hidden layer with feedback connections. Let’s consider a simple Elman network with one hidden layer and one output layer:

The hidden layer in an Elman network has two sets of weights: one set of weights for the current input, and another set of weights for the feedback connections from the previous hidden state. The equations for computing the activations in an Elman network can be described as follows:

  1. Input to the hidden layer at time step t:

2. Output of the network at time step t:

The key difference in the equations above compared to a feedforward neural network is the presence of the feedback term: W_h * H_(t-1). This term allows the network to incorporate information from previous time steps, enabling it to model sequential data and capture temporal dependencies.

During the training process, the weights of the network, including the feedback connections, are updated using backpropagation through time (BPTT), which is a variant of the standard backpropagation algorithm. BPTT computes gradients for the feedback connections by propagating errors through time, taking into account the dependencies introduced by the recurrent connections.

In summary, feedback connections in a neural network, such as those found in recurrent neural networks, allow the network to retain and utilize information from previous time steps or iterations. This enables the network to model sequential data and capture temporal dependencies, making it suitable for tasks such as time series analysis, natural language processing, and speech recognition.

Dropout Regularization in NN

The purpose of dropout regularization in neural networks is to mitigate overfitting and improve the generalization ability of the model. Dropout randomly “drops out” or deactivates a proportion of neurons in a layer during training, forcing the remaining neurons to learn more robust and independent features.

The key idea behind dropout is to introduce noise and reduce the interdependence between neurons. By randomly dropping out neurons, the network becomes less sensitive to the presence of specific neurons and is forced to distribute the learning across different combinations of neurons. This prevents complex co-adaptations between neurons, making the network more resilient and less likely to overfit to the training data.

During training, dropout is applied stochastically at each iteration. For each neuron, a binary mask is generated with a specified dropout rate (usually between 0.2 and 0.5). The mask indicates which neurons are dropped out (set to zero) and which are kept active (scaled by 1/(1 — dropout rate)). This random dropout process is performed independently for each training example and each layer.

The effect of dropout can be seen as training an ensemble of multiple thinned networks, where each network is a subset of the original network with different combinations of active neurons. At test time, the dropout is turned off, and the full network is used to make predictions.

The benefits of dropout regularization include:

1. Regularization: Dropout acts as a form of regularization by preventing overfitting. It discourages the network from relying too heavily on specific features or neurons, forcing it to learn more generalized representations.

2. Improved Generalization: Dropout improves the generalization performance of the network by reducing the co-adaptation of neurons. It encourages the network to learn more diverse and informative features that are more likely to generalize well to unseen data.

3. Robustness: Dropout enhances the robustness of the model by introducing redundancy. As different subsets of neurons are dropped out during training, the network becomes more robust to missing or noisy inputs at test time.

Overall, dropout regularization is a powerful technique that helps neural networks generalize better and reduce overfitting by introducing noise and preventing complex co-adaptations between neurons.

Weight Decay Regularization

The purpose of weight decay regularization in neural networks is to prevent overfitting and improve generalization by encouraging the model to have smaller weights. It is also known as L2 regularization or ridge regularization.

In weight decay regularization, an additional term is added to the loss function during training that penalizes large weights. This term is proportional to the squared magnitude of the weights, thus encouraging them to be smaller. The effect is that the model will not rely heavily on any single feature and will distribute its learning across all the features, resulting in a more robust and generalizable model.

The weight decay term is added to the loss function as follows:

Here, Loss represents the original loss function (e.g., mean squared error or cross-entropy loss), W represents the weights of the neural network, ||W||² represents the squared L2 norm of the weight matrix, and λ is the regularization parameter that controls the strength of the regularization.

During training, the model aims to minimize this combined loss, which encourages the weights to be small while also minimizing the original loss function. The λ parameter determines the trade-off between the two objectives, with higher values of λ emphasizing more on weight decay regularization.

By adding the weight decay term to the loss function, the model is discouraged from assigning excessive importance to individual features, reducing the risk of overfitting. The regularization term acts as a form of control on the complexity of the model, helping to prevent the model from becoming too sensitive to the training data.

Overall, weight decay regularization helps to improve the generalization ability of the neural network by preventing overfitting and promoting more balanced and smoother weight values.

RNN

Architecture:

A Recurrent Neural Network (RNN) is designed to handle sequential data by incorporating recurrent connections. It has a hidden state that allows information to be passed from one step to the next. Let’s go through the architecture and the mathematical explanations:

1. Recurrent Connection:

- In an RNN, each step in the sequence takes an input and produces an output along with a hidden state.

- The hidden state at time step t is computed based on the input at time step t and the previous hidden state.

- The hidden state at time step t can be represented as:

2. Output Computation:

- The output at each time step can be computed based on the hidden state at that time step.

- The output at time step t can be represented as:

Learning and Weight Updates:

RNNs, similar to other neural networks, use gradient-based optimization algorithms to update the weights during the learning process. Here’s an overview of the weight update process:

1. Forward Propagation:

- In forward propagation, the input sequence is fed through the RNN, and the outputs and hidden states are computed at each time step.

2. Loss Calculation:

- A loss function, such as mean squared error (MSE) or cross-entropy loss, is used to quantify the difference between the predicted outputs and the true labels.

3. Backpropagation Through Time (BPTT):

- Backpropagation Through Time (BPTT) is a variant of backpropagation used for training RNNs.

- BPTT calculates the gradients of the loss with respect to the network parameters by unrolling the computation through time.

- The gradients are computed at each time step and accumulated over the entire sequence.

4. Weight Update:

- The weights are updated using an optimization algorithm, typically SGD.

- Mathematically, the weight update can be represented as:

- The learning rate controls the step size during weight updates, similar to other neural network architectures.

5. Iterative Training:

- The forward propagation, loss calculation, backpropagation through time, and weight update steps are repeated for multiple iterations or epochs to train the model.

- The objective is to minimize the loss function and improve the model’s performance on the training data.

The recurrent connections in an RNN allow the model to capture dependencies and patterns in sequential data. The backpropagation through time algorithm helps propagate the gradients through the unfolded time steps, enabling the model to learn the optimal weights. The exact details of the RNN architecture and learning process can vary depending on the specific RNN variant (e.g., LSTM, GRU) and the task at hand.

Vanishing Gradient

The vanishing gradient problem in traditional recurrent neural networks (RNNs) refers to the issue where the gradients during the backpropagation process diminish exponentially as they propagate from the output layer to the earlier layers. This makes it difficult for the network to learn long-term dependencies in sequential data.

During backpropagation, gradients are computed by recursively multiplying gradients at each time step, starting from the output and propagating backwards to the initial time step. In RNNs, the gradients at each time step are obtained by taking the derivative of the loss function with respect to the parameters and recursively multiplying it with the previous time step’s gradient.

The problem occurs when the gradients diminish significantly as they propagate backward through time. This happens because the gradients are multiplied by the weights of the recurrent connections in each time step. If these weights are less than 1, the gradients can exponentially decay as they are multiplied repeatedly, resulting in very small gradients. As a result, the network fails to learn long-range dependencies and may struggle to capture information from earlier time steps.

The vanishing gradient problem hinders the training process as the small gradients provide weak signals for updating the network’s parameters. This leads to slow convergence, difficulty in capturing long-term dependencies, and limitations in the overall performance of the model.

Consider a traditional RNN with a single hidden layer at time step t:

The hidden state at time step t can be computed as:

During the backpropagation process, the gradients are calculated with respect to the weights W_h, W_x. Specifically, the gradients of the loss function with respect to the hidden state at time step t, denoted as δL/δh(t) are propagated backward through time.

To illustrate the vanishing gradient problem, let’s consider the case where the activation function is sigmoid:

When calculating the gradient of the hidden state at time step t with respect to the weights W_h, W_x we need to multiply the gradients by the following:

Since the derivative of the sigmoid function is less than 1 for most values, the gradients can diminish exponentially as they propagate backward through time, leading to the vanishing gradient problem.

As a result, the RNN struggles to capture long-term dependencies in the input sequence, as the gradients become too small to effectively update the weights in the earlier layers. This limitation can hinder the learning process and affect the model’s ability to make accurate predictions on sequential data.

To address the vanishing gradient problem, various techniques have been developed, such as using activation functions that alleviate the saturation problem (e.g., ReLU), employing different weight initialization strategies, using gradient clipping to limit the magnitude of gradients, and using specialized architectures like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) that incorporate gating mechanisms to control the flow of information and gradients. These techniques help mitigate the vanishing gradient problem and enable more effective training of deep RNNs.

LSTM

LSTM is designed to address the vanishing gradient problem in traditional RNNs and is capable of capturing long-term dependencies in sequential data. It achieves this by introducing memory cells and gates that control the flow of information. Here’s an overview of the LSTM architecture and the mathematical equations involved:

1. LSTM Cell Structure:

- LSTM consists of memory cells that store and update information over time.

- Each memory cell has three main components: an input gate, a forget gate, and an output gate.

2. Input Gate:

- The input gate determines how much new information should be stored in the memory cell.

- The input gate can be represented as:

3. Forget Gate:

- The forget gate controls the amount of information that is forgotten from the memory cell.

- The forget gate can be represented as:

4. Cell State Update:

- The cell state (memory) is updated based on the input and forget gates, as well as a candidate update.

- The updated cell state can be represented as:

5. Output Gate:

- The output gate determines how much of the updated cell state should be passed to the next time step.

- The output gate can be represented as:

6. Hidden State:

- The hidden state is computed based on the updated cell state and the output gate.

- The hidden state can be represented as:

GRU

GRU is another variant of RNN that simplifies the LSTM architecture by combining the forget and input gates into a single update gate. It has a more compact structure while still being effective in capturing long-term dependencies. Here’s an overview of the GRU architecture and the mathematical equations involved:

1. Update Gate:

- The update gate determines how much information from the previous hidden state should be passed to the current time step.

- The update gate can be represented as:

2. Reset Gate:

- The reset gate controls how much of the previous hidden state should be ignored when computing the current hidden state.

- The reset gate can be represented as:

3. Candidate Hidden State:

- The candidate hidden state is a combination of the current input and the reset gate-applied previous hidden state.

- The candidate hidden state can be represented as:

CNN

Architecture:

A typical CNN consists of multiple layers, including convolutional layers, pooling layers, and fully connected layers. Below is a description of each layer:

1. Convolutional Layer:

- The convolutional layer performs the convolution operation, which involves applying filters (kernels) to the input data to extract local features.

- The convolution operation for a single filter can be represented as:

- Here,

  • (i, j) represents the position in the output feature map,
  • (x, y) represents the position in the input data,
  • Filter(i-x, j-y) denotes the filter weights.
  • The bias term is added to introduce a shift in the activation.
  • The activation function introduces non-linearity, allowing the model to capture complex patterns.

2. Pooling Layer:

- The pooling layer reduces the spatial dimensions of the feature maps, capturing important information while reducing computation.

- Max pooling is a commonly used technique in CNNs, where the maximum value within a pooling window is selected as the output.

- Max pooling can be represented as:

for (x, y) in pooling window centered at (i, j)

3. Fully Connected Layer:

- Fully connected layers connect every neuron in the current layer to every neuron in the next layer.

- The output of a fully connected layer is computed as a linear combination of the input neurons followed by an activation function.

- The output of a fully connected layer can be represented as:

Learning and Weight Updates:

CNNs use gradient-based optimization algorithms, such as Stochastic Gradient Descent (SGD), to update the weights during the learning process. Here’s an overview of the weight update process:

1. Forward Propagation:

- In forward propagation, the input data is fed through the layers, and the output predictions are calculated.

- The activations and outputs at each layer are computed using the appropriate mathematical operations described earlier.

2. Loss Calculation:

- A loss function, such as mean squared error (MSE) or cross-entropy loss, is used to quantify the difference between the predicted outputs and the true labels.

3. Backpropagation:

- Backpropagation is used to compute the gradients of the loss with respect to the network parameters.

- The gradients are calculated layer-by-layer, starting from the last layer and moving backward.

- The chain rule of calculus is used to calculate the gradients efficiently.

4. Weight Update:

- The weights are updated using an optimization algorithm, typically SGD.

- The weight update can be represented as:

- The learning rate controls the step size during weight updates and affects the convergence and stability of the learning process.

5. Iterative Training:

- The forward propagation, loss calculation, backpropagation, and weight update steps are repeated for multiple iterations or epochs to train the model.

- The objective is to minimize the loss function and improve the model’s performance on the training data.

This iterative training process helps the CNN learn the optimal weights that minimize the loss and improve the model’s ability to make accurate predictions. The exact details of the architecture and learning process can vary depending on the specific CNN model and task at hand.

Batch Normalization vs Layer Normalization

The main difference between batch normalization and layer normalization lies in the scope of normalization.

Batch Normalization: Batch normalization is a technique used in neural networks to normalize the activations of a layer across a mini-batch of training examples. It normalizes the input values by subtracting the batch mean and dividing by the batch standard deviation. Batch normalization is typically applied after the linear transformation and before the activation function in each layer. The normalization is performed independently for each feature dimension within the batch.

Layer Normalization: Layer normalization, on the other hand, normalizes the activations of a layer across the feature dimension (or the “layer” dimension) rather than the batch dimension. It normalizes the input values by subtracting the mean and dividing by the standard deviation calculated across all the units in the layer. Layer normalization is applied independently for each training example, treating each example as a separate “batch”.

Key Differences:

1. Normalization Scope: Batch normalization normalizes the activations across the mini-batch, while layer normalization normalizes the activations across the layer or feature dimension.

2. Training and Inference: Batch normalization computes the mean and standard deviation of the mini-batch during training and uses these values to normalize the activations. During inference, it uses the estimated population statistics (mean and standard deviation) computed during training. Layer normalization, on the other hand, calculates the mean and standard deviation for each training example during both training and inference.

3. Dependency: Batch normalization introduces a dependency on the batch size since it normalizes the activations across the mini-batch. Layer normalization, being independent of the batch size, can be more suitable for scenarios where the batch size is small or varies.

4. Application: Batch normalization is commonly used in convolutional neural networks (CNNs) and deep neural networks (DNNs), where mini-batch processing is prevalent. Layer normalization is often applied in recurrent neural networks (RNNs) or sequential models, where the length of the sequences can vary.

Both batch normalization and layer normalization aim to address the internal covariate shift problem and improve the training of neural networks. They help stabilize and accelerate the training process by normalizing the activations, but they differ in terms of their scope and dependency on batch size. The choice between them depends on the specific requirements of the model and the nature of the data.

Part I: Fundamentals
Part II: Traditional ML Algorithms

--

--