The Essential Deep Learning Mathematics Cheat Sheet

8 min readApr 17, 2024

Table of Contents
Activation Functions
∘ ReLU
∘ Sigmoid Activation function
∘ SoftMax
∘ ArgMax
∘ TanH
Loss Functions
∘ Mean Squared Error
∘ Cross Entropy Loss
∘ Important Loss Adjustments
Calculus for Backpropagation
∘ Common Derivatives
∘ Important Calculus Concepts
Tensor Related Math
∘ Matrix Multiplication
∘ Broadcasting
Other Concepts
∘ One Hot Encoding
∘ Dot Product
∘ Logits

Activation Functions

Activation functions are functions that are usually applied to the output of a math equation and/or a parameter (in some ways helping us decide the importance of parameters.)

ReLU

Also known as Rectified Linear Unit. Basically ReLU(x) = MAX(0, x). Can be combined with other ReLUs to form fancy fitting graphs. Super cool in my opinion. (and a cute name to boot)

Sigmoid Activation function

Basically Squishes up numbers into a nice smooth curve ranging from 0 to 1, with extremely large and extremely small numbers having a diminishing effect

As they say, when in doubt, chuck it through a sigmoid

Softmax

Softmax normalizes probabilities by passing it through the following function: e^n1 / (e^n1 + e^n2 + .. e^nN) for N categories. It is especially useful for negative predictions thanks to the e^x function

ArgMax

Returns the index of the max value of an array. Pretty nifty when a range of predictions are being made on the same input

Index instead of value, which is the important bit

TanH

Similar to sigmoid; it transforms inputs into a range between -1 and 1.

Loss Functions

These functions help us determine how good or bad our model is relative to our validation data. Some examples of loss functions:

Mean Squared Error

Sum of (Actual — Predictions)² / N

Cross Entropy Loss

Explained by the following formula:

(i) Take the softmax of predicted variable (‘y-pred’) across multiple categories
N.B. I haven’t seen too many examples apply softmax to a network’s generated values, so I assume it is already ‘implied’ in the definition, since taking a log of negative probabilities doesn’t make much sense. TLDR if your y-pred is already a set of probabilities that sum up to 1, then you don’t need to do this step :)
(ii) Generate the one-hot encoded actual variable for the actual y (‘y-true’)
i.e. assuming our actual y-values for 4 predictions are Apple, Blueberry, Orange, Apple — the one-hot encoded vector would be [1,0,0], [0,1,0],[0,0,1],[1,0,0] — with ‘1’ implying that the actual probability of prediction should be 100% for the category that was one-hot encoded, and 0% for the rest
(iii) Multiply the log of softmax (i.e. log(softmax(y-pred)) with the one-hot encoded probability
(iv) Sum it all up and add a negative

N.B. Binary cross entropy is very similar, just two cases — either you’re a schrodingers cat or you’re not. If your one-hot encoding of being schrodinger’s cat is 1 and the probability of being Schrodinger’s cat is 0.7, then your row of iscat, isNotCat, probCat, probNotCat is 1, 0, 0.7, 0.3, and the CE loss for that row is -1*log(0.7)-0*log(0.3). Easy! Sum it all up across all predictions, and you get the total Cross Entropy Loss for a set of predictions

Important Loss Adjustments

Regularization is used to combat overfitting by normalizing loss functions, controlling weights. The idea is to add an extra function or “penalty” to your loss function to “regularize” the loss function so that your weights don’t become too large.

L1 regularization (Lasso) adds the sum of weights to the loss function. Since weights can be both positive and negative, they may cancel each other out, but this does help drive some weights down to zero and can help with feature selection
L2 regularization (Ridge) adds the sum of squares of weights to the loss function; this doesn’t necessarily drive weights to zero but does shrink them towards zero to minimize the overall loss

Calculus for Backpropagation

Common Derivatives

f(x) = a * x; f’(x) = a
if a = 1, f(x) = x, and f’(x) = 1
f(x) = x^n; f’(x) = n*(x^(n-1))
e.g. f(x) = 3x³, f’(x) = 9x² (combined with the above rule)
f(x) = log₁₀(x); f’(x) = 1/(x*ln(10))
N.B. ln is a log function with base e. 10 is replaced by the corresponding base of the log function. If the base is e instead of 10, i.e. f(x) = ln(x), then f’(x) = 1/x, since ln(e) = 1
Useful for Sigmoid activation
f(x) = tan(x); f’(x) = 1 — (tan(x))²
Useful for TanH activation
f(x) = e^x; f’(x) = e^x
f(x) = a @ x; (more on this below in the Matrix Multiplication section) .. the formula is too strange; try to simply match the dimensions of the expected vector

Important Calculus Concepts

Chain Rule: Possibly the most useful tool. If L = a + b, a = x + y, x = m + n, then dL/dm = (dL/dx) * (dx/da) * (da/dm)
Essentially derivatives have multiplicative properties and can be ‘chained’ together
Gradients are additive. e.g. if a tensor / variable occurs multiple times in a neural network, then its impact (and therefore gradient) to the final output (e.g. the impact to a loss function) will cumulatively add up, since a small change to that tensor / variable will have impact multiple times
When backpropagating across tensors, it’s important to check the shape of each tensor and ensure consistency in gradient and tensor dimensions
e.g. if x-normal = x — (x^max), with x being a tensor [10,10] and x^max being a tensor [10,1] (i.e. the max of each row), then the gradient of x^max should be a tensor of dimensions [10,1]

Tensor Related Math

Matrix Multiplication

Basic math behind matrix multiplication: Multiply the numbers in the row of matrix 1 (‘a’) by the numbers in the column of matrix 2 (‘W’), then add them up together
This is represented by aW = b
Order & Size matters: The order will dictate the size and outputs. Because the order matters, the size matters, i.e. the # of columns of matrix 1 (‘a’) should be equal to the # of rows of matrix 2 (‘W’)
e.g. if matrix 1 (‘a’) is 1x2 (i.e. 1 row, 2 columns), and matrix 2 (‘W’) is 2x2 (i.e. 2 rows, 2 columns), then the matrix multiplication will mean each column of matrix 1 (‘a’) is multiplied by each row in matrix 2 (‘W’) to generate 1 item for the final matrix
Final matrix: The # of rows of the final matrix (‘b’) is the equal to the # of rows in matrix 1 (‘a’). The # of columns in the final matrix is equal to the # of columns in matrix 2 (‘W’)
Transposing the output requires flipping the order and transposing the original matrices
e.g. if the transpose of ‘a’ is ‘A’, the transpose of ‘W’ is ‘w’, and the transpose of ‘b’ is ‘B’, then..
- aW = b
- wA = B
.. the order is flipped and the final matrix is transposed

N.B. Why do matrix multiplication? It’s faster and more efficient than running loops (good explanation here — it comes down to: using multiple threads instead of just one to boost performance, using cache, SIMD instructions i.e. breaking down matrices into smaller chunks allowing parallel operations, etc.) Python is slow esp. for loops, but Matrix Multiplication is fast, especially thanks to underlying optimizations done by PyTorch.

Broadcasting

Source: Pytorch Documentation

a) Each tensor should have at least one dimension
b) Tensor dimensions should be compared starting from the right
e.g. when multiplying a 27/25 tensor to a 25 only tensor — 25 will be compared to 25
c) Dimensions being compared to each other should either
- have the same size e.g. (28,27) & (25) can’t be broadcast because 25 != 27, but a (28,27) & (27) tensor can be broadcast since 27==27
- not exist, e.g (28,27) & (27) tensor can be broadcast, since the missing dimension can be filled in as 1 to make the sizes equal
- be of size 1, e.g. (28,27) & (3,27) can’t be broadcast, but (28,27) & (1,27) can be; similarly, (27,1) & (27,27) can be broadcast, but (27,3) & (27,27) can’t be

Assuming both tensors are broadcastable, the final size of the output is the max of dimensions for both tensors.

Other Concepts

One Hot Encoding

Dot Product

A relic of the cosine similarity function, meant to capture similarity between vectors, minus the denominators to ignore normalization

The denominator is meant to normalize the distances which is sometimes useful but often ignored. The larger the numerator value the more likely it is that the two numbers are similar. For more

Logits

Logits are outputs of a network before any sort of activation function is applied (see above for activation functions). They are typically non normalized. For example:

A one-hot encoded vector (X) matrix multiplied by a set of weights (W) would yield a set of values called logits, i.e. logits = X @ W
Passing them through an exponentiation function, i.e. counts = e^logits will give a normalized set of numbers.
Normalizing counts will give probabilities, i.e. probs = counts / counts.sum(1, keepdims = True)

This is different from math logits which are specific probability calculation functions — more detailed context here)

Will keep adding more relevant math as I uncover it!