Curious programmer, tinkers around in Python and deep learning.

You’ve actually asked a pretty amazing pair of questions, so, in order:

1.The brief answer is: yes, deeper filters, even without pooling or striding, would have larger receptive fields, just from the fact that they’d eventually have “cross terms” with pixels very far away. In fact, the paper Attention Is All You Need…

There are no explicit μ and σ for each class, but the encoding network can learn to implicitly generate different μ and σ for samples of each class (i.e the encoder will generate different μ vectors for an image of a 2 and an 8, not because it was explicitly programmed to do so (it doesn’t even have access to the labels), but it implicitly learns to do so during the optimization process)

Thank you, happy to learn that the animations themselves were so useful!

The key insight here lies in the fact that we don’t force it to be 0, we force it to be as close to 0, while minimizing the overall loss. If the model is only optimized with the KL loss, the latent vectors will be pure noise, but because it is optimized in tandem with the reconstruction loss (which would be extremely high if the latent vectors were…

Yes, they’re just standard dense layers. Halfway into the network, you’re effectively making the neural network “predict” the mean and standard deviations for the input’s latent vector, and just like any other prediction problem, they’re dense layers.

This link may be of interest to you, if you’d like to learn more about pooling layers.

If you’d like to see the kernel filter matrices directly, it’s pretty straightforward: get the weight tensor and index into it accordingly. For instance, to see the first kernel of the first filters of a conv layer named conv1, in PyTorch, just use conv1.weight[1, 1, :, :].

The math makes it far easier for me to see the concern now (thank you), so let me address that.

You don’t need to add up the KL loss halfway through backpropagation to the loss/cost function. Keep it from the very start of the backprop through decoder.

If you see the sample code for the loss function, we can just sum up the reconstruction and KL loss for each instance directly, and average them. Backpropagation only needs to run on this loss function once.

Since we express this loss as a sum, during backprop through decoder layers, the derivative will only be non-zero with…