From An intuitive understanding of the LAMB optimizer by Ben Mann

…ntuitive interpretation is that when we first start updating parameters, we’ll probably be way off. If the gradients are all pointing in different directions (high variance), we’ll take a small, cautious step. Conversely, if all the gradients are telling us to move in the same direction, the variance will be small, so we’ll take a bigger step in that direction. Either way, if the scale of all the gradients is large, the constant factor will cancel out when we …