Yeah you are saving on I/O, but you lose out in convergence and accuracy right? Because your losses are the mean of the losses taken, which means your variance is higher and therefore your convergence is slower.
Is there any way to just do this concurrently over many threads on the GPU?
what I mean is this:
Running threads cconcurrently and then collecting the gradients and applying them one after the other, therefore only parallelising the training operations, but not losing out on accuracy!