From Intuitively Understanding Variational Autoencoders by Irhum Shafkat

The model is now exposed to a certain degree of local variation by varying the encoding of one sample, resulting in smooth latent spaces on a local scale, that is, for similar samples. Ideally, we want overlap between samples that are not very similar too, in order to interpolate *be…*

From Knowledge Distillation by Ujjwal Upadhyay

For distilling the learned knowledge we use **Logits** (the inputs to the final softmax). Logits can be used for learning the small model and this can be done by minimizing the squared difference between the logits produced by the cumbersome model and the logits produced by the small model.

From Rohan & Lenny #3: Recurrent Neural Networks & LSTMs by Rohan Kapur

…xactly how much these weights contribute, and by how much we modify them to decrease overall error. To do this, we use the backpropagation algorithm; this algorithm propagates the error between the predicted output of a recurrent net and the actual output in the dataset all the way back to the beginning of the network. Using the chain rule from differential calculus, backprop helps us calculate the gradients of the output error w.r.t. each individual weight (sort of like the error of each individual weight).

From Rohan & Lenny #3: Recurrent Neural Networks & LSTMs by Rohan Kapur

*t*anh of what it was in the *inputt*ed vec**tor (**sort of like an elemen**t**-wise tanh). Remember, this contrasts ANNs** beca**…e each value is the tanh** **of what it was in the inputted vector (sort of like an element-wise tanh). Remember, this contrasts ANNs because RNNs operate over vectors versus scalars.

From Rohan & Lenny #3: Recurrent Neural Networks & LSTMs by Rohan Kapur

**s ve**ctor**. However, **inside a single hidden layer, all timesteps share the same weight matr**i**… (the network would be extremely uninteresting if this wasn’t the case), including the bias vector. However, inside a single hidden layer, all timesteps share the same weight matrix. This is important because the number of timesteps is a variable; we may train on sequences with up …