In the example above, a mere presence of 2 eyes, a mouth and a nose in a picture does not mean there is a face, we also need to know how these objects are oriented relative to each other.
Understanding Hinton’s Capsule Networks. Part I: Intuition.
Max Pechyonkin

I would argue that CNN without Max pooling will not be confused by this easily!

