Exploring OpenAI’s New Voice Engine: Ethical Considerations and Regulation

AI Ethics @ ShiftSC
4 min readApr 8, 2024

--

As AI photo and text generators become increasingly well-known, OpenAI, better known for ChatGPT, has launched their new AI voice engine. This AI voice engine, according to CNN, “uses a 15-second sample of someone speaking to generate a convincing replica of their voice,” which can then be used to generate a “reading” of a provided paragraph of text. While currently only available to educators and those in the health industry, AI voice engines raise even more questions surrounding personal privacy and misinformation.

One of the main points of concern, as with many AI generative engines, is privacy. While OpenAI may claim to securely store the data, data can never be 100% secure; it only takes one leak for the information to be available to those who may use it for nefarious purposes. For instance, a politician’s opposing faction may use her voice to generate an offensive voiceclip which negatively affects her popularity, or a scammer may use a voicebank of a working adult to fool their parent into losing large sums of unrecoverable money. Systems such as banks which use voice authentication may also suffer from this, as scammers are able to exploit this hole in their security with these AI voices.

Another important facet to this issue is consent. Currently, all those who are “fed” into this engine are done so with full consent, and all voice clips are labeled clearly as AI-generated. However, this is not a guarantee that if the technology is made public, the vetting process to ensure consent will be waterproof. There are many forms of identity verification for website usage on the internet currently, and equally many ways to circumvent these kinds of verifications. It may only take a single bad actor to cause irreparable damage, and these are the concerns that OpenAI must seriously consider before they make this technology public.

This is not to say, of course, that there are no benefits to these voice engines. They may help with translation and language learning for those learning additional languages or even for children, as native voices can be better analyzed to provide accurate pronunciation when learning. They may also help those who have lost the ability to speak, as recorded clips can be utilized to create AI voices and allow them to regain some autonomy. OpenAI’s blog provides some specific examples, among which include the ability to “give interactive feedback in each worker’s primary language including Swahili or more informal languages like Sheng, a code-mixed language popular in Kenya,” a potentially revolutionary example considering the lack of representation of many more informal languages in currently existing language “databanks”.

OpenAI does also claim that they are taking measures to address the formerly mentioned concerns, stating that they are “watermarking to trace the origin of any audio generated by Voice Engine, as well as proactive monitoring of how it’s being used,” and that they have implemented a “no-go voice list that detects and prevents the creation of voices that are too similar to prominent figures.” Whether these measures are effective remain to be seen; for the time being, OpenAI has chosen to keep this tool for private use only because of the many safety concerns regarding the engine.

OpenAI voice engines do have the potential to be helpful in many ways; however, the issues that come with them are not to be taken lightly. An interesting study can be made of the Japanese voice banks (and associated musical software) known as “Vocaloids,” which have been available for nearly two decades as of this newsletter. Vocaloids, on the surface, are similar — they consist of a “voice bank” of syllables, which can then be put together to form “words” and subsequently “speak.” However, Vocaloids require much more extensive training and development than OpenAI; they frequently consist of hundreds of syllables individually recorded, and to make them “speak” or “sing” requires hard work, as they must be “tuned” in order to mimic human pitch and tone. There is AI assistance for these Vocaloids, but they still do require human work, and many musicians have made a career of being “Vocaloid Producers,” tuning these Vocaloids and utilizing them as “singers” in their songs similar to how they might use instruments like a trumpet or guitar. Most importantly, Vocaloids are tightly regulated. Their production and usage has strict legal requirements, as voice banks are a product that is legally sold, and all “voice providers” who contribute their voice to the Vocaloid’s “voice bank” (the available bank of syllables that can be used to construct words) do so with full consent and payment. To fully take advantage of their benefits, OpenAI voice engines must be regulated with similar strictness and legality to Vocaloids, lest they cause significant damage to the people whose voices they are trained off of.

--

--