Consider an image and a network that detects if a cat is present in the image. Now consider another network that detects if a car is present in the image. Now each of them is a small feature. If we construct something more interesting(complex) from these neural networks. More interesting simply means more closer to real world(say a cat sitting on a car). Can you construct such a network by using the two smaller networks and other smaller blocks. This cat sitting on a car is more interesting than just car or just cat. The network will obviously be more complex and will have higher depth.
To recognize numbers, you mean from images, one hidden layer can work to some extent but there are better and easier options for image.
Convolutional networks perform much better than simple networks on images. Here is a network which tells if there is a car in an image. A slight modification of this will do the job for you. https://github.com/g-akash/convolutional_neural_network
lemme know if you need more info on CNN.