Speech synthesis has come a long way since the 1978 Speak & Spell toy that once wowed people with its cutting-edge ability to read words using electronic voices. Now, by using deep learning AI models, software can not only create realistic voices, but also convincingly imitate existing voices using small samples of voice.
Along these lines, OpenAI this week announced Voice Engine, a text-to-speech AI model that creates synthetic speech based on 15-second segments of recorded audio. We provide audio samples of the voice engine in action on our website.
Once the voice is cloned, the user can input text into the voice engine and get the voice result generated by the AI. But OpenAI isn't ready to make its technology widely available. The company originally planned to launch a pilot program earlier this month for developers to sign up for the voice engine API. However, after further consideration of the ethical implications, the company has decided to scale back its goals for now.
“In line with our approach and voluntary commitment to AI safety, we choose to preview this technology at this time but not broadly release it,” the company wrote. “We hope this preview of Voice Engine highlights its potential and promotes the need to strengthen society's resilience to the challenges posed by more compelling generative models.”
In general, voice cloning technology is not particularly new. Since 2022, several AI speech synthesis models have existed, and the technology is active in the open source community with packages such as his OpenVoice and XTTSv2. But the idea that OpenAI is making its particular brand of voice technology available to everyone is noteworthy. And in some ways, the company's reluctance to fully release it may be a bigger problem.
According to OpenAI, the benefits of its voice technology include providing natural-sounding reading assistance, allowing creators to reach the world by translating content while preserving native accents, and providing personalized audio options. These include supporting nonverbal individuals and helping patients regain their voice after treatment. A condition with language impairment.
But it also means that anyone with 15 seconds of someone's recorded audio can effectively clone it, so the potential for abuse is obvious. Even if OpenAI doesn't make its voice engine widely available, its ability to clone voices could be used to create new applications, such as through phone scams that imitate the voices of loved ones or campaign robocalls using cloned voices of politicians like Joe Biden. It is already causing problems in society.
Researchers and reporters have also shown that voice cloning technology can be used to break into bank accounts that use voice authentication (such as Chase's Voice ID), prompting the US Senate Banking Committee Chairman to Sen. Sherrod Brown, D-Ohio, said the Department of Housing and Urban Affairs will send a letter in May 2023 to CEOs of several large banks to help them combat AI-powered risks. will ask questions about the security measures banks have in place.
OpenAI recognizes that this technology could cause problems if it were widely released, so it is initially trying to avoid these problems with a set of rules. The company has been testing the technology with some partner companies since last year. For example, video synthesis company HeyGen uses this model to translate a speaker's voice into other languages while preserving the same audio.