Cheat Sheet for Deep Learning

Walid Ahmed
Nov 3 · 3 min read

Underfitting: The model fails to produce low error on the Training dataset. A typical remedy is to use more complex model.

Overfitting: The model fails to produce low error on the validation dataset. A typical remedy is to use more simpler model or adding more training data.

L1 regularization: A method to limit the growth of the weights by adding the following term to loss function

L2 regularization: A method to limit the growth of the weights by adding the following term to loss function

L1 Loss: A loss function for regression output, it minimizes the absolute differences between the estimated values and the existing target values. it is also knows as “Mean Absolute Error”

L2 loss: A loss function for regression output , it minimizes the squared differences between the estimated and existing target values this makes it more sensitive to outliers. it is knows as “Mean Squared Error”

L2 Loss Function is preferred in most of the cases unless utliers are present in the dataset, then the L1 Loss Function will perform better.

Cross Entropy Loss: The type of loss associated with classification network. Cross-entropy loss increases as the predicted probability diverges from the actual label. ross-entropy can be calculated as:

Batch Normalization: It normalizes the output of an activation layer Z by subtracting the batch mean and dividing by the batch standard deviation (same like we do to the initial input of the network) to create Z_norm. It might uses 2 extra parameters (ß and ℽ) to tune the Z_norm before sending as input to activation function. It decreases overfitting and help using a larger learning rate.

Vanishing gradients: in Deep networks , gradients are so small that values of weights are not changing , better initialization of weights might help decrease this problem.

Exploding gradients in Deep networks: gradients become very large so values of weights increase dramatically, better initialization of weights might help decrease this problem.

Relu: A common activation function used in deep learning, it has lower vanishing gradient problem than the Sigmoid function which makes it more convenient in deep networks

Data Normalization: Forcing all your data to be on the same scale (ex 0–1)

Convolution Layers benefits: 1-uses spatial information in data. 2.trained to capture the useful features. 3- translation invariant as it scans the whole image(still benefits from Maxpooling)

Pooling benefits: 1-decrease the dimension of data without actual loss of information.2-decreases overfitting.3-Help find a feature even if it is skewd in image.

Residual Networks: Created by Microsoft, include residual blocks where skip connection happens and activation from a unit contribute to the calculating of activity of a other preceding units. It helps decrease the vanishing gradients which make is more convenient fro deep networks.

Recurrence Neural Network: The output of a training example xi is affected by the output of the precious training examples.

DropOut: Decrease over-fitting by decreasing dependence between different neurons

Maximum likelihood estimation (MLE) is to find the optimal way to fit a distribution to the data.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade