Hey Preetham, thanks for this post, it’s rich in intuitive & executable comprehension on the…
Jaideep Khare

This is a good question (not trivial) Jaideep. Specific to drop-outs, notice that you may not really be trimming down the problem space for the Network. This is because, while you are randomly dropping hidden units 50% of the time, you are still “parameter-sharing the learning across all the other connections” which are NOT dropped for any of the iteration/cycle. The efficiency of drop-outs come from parameter-sharing actually.

This creates a strong regularization with the added benefit of network-averaging towards reducing bias, which means increased accuracy. Infact, drop-out technique has become a standard way for most of deep-learning problems related to high-dimensional data where you would want to control variance without compromising on accuracy.

The good news is that, it reduces the complexity of setting up multiple network architecture and provides the same effect similar to (NOT the same) the Mixture of Experts. That said, Mixture of Experts model has a added benefit of playing around with heterogeneous expert architecture and ability to throw in a ML algo which is not even based on Neural Nets to the mix.

(I hope I understood and answered your question correctly)