Hi, layers do not represent number of groups of pixels. Number of layers represents the depth of the network. More is the number of layers, the network can learn more interesting features. But then why don’t we simply use 100 layers. Thats because as we increase number of layers, training time of the network increases. More number of parameters have to learned and each iteration(forward and backward) takes much more time. With 3 layers, training time wasn’t feasible in normal PC. There was a little bit of experimentation involved in deciding architecture. General rule for classification is to scale down by factor of 2 or 4. So it would have been ~1600 in first layers. But 1024 in first layer gives comparable results in lesser training time. :)
