The model can create life-like audio from a 15-second clip and a text prompt.
A ‘materially better’ model is reportedly months away.