Hi Barak, I think you mean zeroing out the lower parts of the Values matrix (since using using -Inf would cause the matmul operation to produce -Inf at the output always).
Theoretically it is possible to do that but we will lose the benefit of the matmul operation because now each row in the Weights matrix (which corresponds to a time step) has to be separately multiplied with a Values matrix that has its lower rows zeroed out based on the time step of the Weights row. In addition the output would be unbalanced since it is no longer a sum of probabilities. The LayerNorm operation may be partially able to fix that but I suspect training would take longer.
So from an implementation point of view it is more efficient and stable to apply the masking on the Weights matrix. Hope this clears your understanding!
