I see one very relevant difference between the voice interface and visual interface: Emotions. Our current human-mimicking voice assistants make you feel like you are talking to a human when you are not. Hearing and talking to a human voice triggers emotions is a much stronger way than the visual interface can. We are entering the topic of psychology of robotics here.
I don’t know what exactly a de-skeuomorphized voice interface will be (or sound) like. I don’t think that it has to be necessarily “dry, unnatural, and unenjoyable”. It certainly will be more authentic.
The design of voice interfaces is not only about intonation. It includes naming, the existence of gender-definition, conversational ornamentations (I’m thinking of Siri telling jokes here). All of these pieces will be part of the evolution of the design of voice interfaces.