Reza Roboubi
Sep 8, 2018 · 1 min read

Arthur, thank you so much for sharing that. I haven’t understood what this exactly means:

“[…] high temperature often comes down to having a distribution with high variance, which usually means a flat minimum. Since flat minima are often considered to generalize better, it’s consistent with the empirical finding that high learning and low batch size often lead to better minima.”

The question is, what’s exactly “minima” when we’re talking about a distribution of parameters? Presumably at some point we must take the mean of this distribution to get our final “result,” right? And at equilibrium this “mean” varies because our approximation to the ideal distribution varies.

So the different means (minima) are more flat because of higher entropy.

Have I understood this so far?! Or am I completely off-base?!

Now… are you saying that is a good thing? I don’t understand why it would be and in what way exactly the (cited) paper proves it?

It all seems very interesting. The paper seems well written, but I haven’t read it in-depth.

Thanks so much for sharing.

    Reza Roboubi

    Written by