Hi Ethan. My students cited this in their final project reports so I’m just reading it. I find it curious that SGD outperforms ALS as well. I’m curious if you have thought about this more. The ALS you show is a 2nd order optimization, whereas SGD is 1st order. The former should technically fit the training data (though maybe too well of course). You change the loss function though for SGD by adding bias terms, so is it fair to compare it to ALS? It could be that the explicit bias terms capture a large part of the signal, and are more difficult to overfit. On your hunches, per my above point, I believe ALS should overfit more since it is the MLE estimate of each alternate variable’s regression problem. It should fit the rare cases as well as popular (in terms of optimization, not generalization). SGD likely does better for popular cases because it only updates when it sees an instance. Thus a movie with 100 ratings will get 10x more updates than a movie with 10 ratings. Given the same learning rate, this is like a form of extra regularization (similar to early stopping) on the rare cases.
In any case, great post!