MediaTech Intelligence

Poised for mass adoption? Synthesized voices for the media and entertainment industry – Take 1

By Dom Bourne, Take 1 Founder

Wed 12, 04 2023

Dom Bourne, Take 1 founder

Synthesized speech is nothing new. We’ve long been accustomed to hearing the robotic tones of computer-generated voices in messaging systems, access services and, more recently, in virtual assistant technologies like Siri and Alexa. But we’d never consider replacing the talent in our media productions with a synthesized voice. Or would we?

guy-with-headphones73-1024x992

At Take 1 we’ve always been big advocates of combining the power of technology and a skilled team to deliver the best results in the transcription, access and localization services we provide to our broadcast, streaming and production clients. In recent years, AI-powered tools have become increasingly valuable in improving the efficiency and speed of these workflows but, when it comes to recording dubbing tracks and audio descriptions, we would always choose live talent over a synthesized voice. However, the demonstrations that we witnessed at the IBC Show in 2022 have our team wondering whether synthesized speech is now poised for mass adoption across the media and entertainment industry, and we’re intrigued to see what developments there might be at the 2023 NAB Show.

Setting the stage for synthesized voices

Up until as recently as a few years ago, synthetic voices were created by recording an actual voice, chopping their speech into component sounds and then ‘mixing and matching’ these sounds to make new words – with no way to alter the tone or inflection of these recordings. From around 2016 however, developments in deep learning spurred a massive progression in speech recognition science so that neural networks are now trained on sets of speech data to generate far more realistic results.

These developments couldn’t come at a better time for the M&E sector. The industry is facing both an unprecedented demand for accessible and localized content and a talent and facility shortage that threatens to undermine our ability to meet these ever-increasing demands.

So, could these two sets of circumstances combine to provide an elegant solution?

In pursuit of synthesized speech perfection

Despite popular misconception, not all speech synthesis is the same.

Text to speech, or synthesized speech, makes it possible to convert text into a computer simulation of human speech by using machine learning to create voice robots that can ‘read’ any written content aloud. These solutions provide results that represent a dramatic improvement on the original approach and are often used for corporate and brand messages or for access services. But these voices are often still criticized for sounding robotic and lacking the emotion of a human voice.

Voice cloning software or speech-to-speech technologies, on the other hand, use one person’s speech to generate another voice. This AI-powered software uses audio recordings to clone a voice, identifying patterns like tone, pace, emphasis and pronunciations to create a model that can be used to voice completely new scripts. The results are far more convincing because they include pauses, breathing points, emotion and the other typical characteristics of a real voice – in short clips in particular, they can be indistinguishable from human recordings. This technology has already been used to recreate Val Kilmer’s voice in “Top Gun : Maverick” and to de-age actor Mark Hamill’s voice in “The Mandalorian” – and it could be a game changer for access and localization services in M&E.

The benefits of using speech synthesis to create voices that sound realistic are obvious – from saving on studio recording time, to cutting out performance and usage fees for voice artists and making revisions to voice-over tracks simply by editing the text in the script. We’d also be able to ensure that renowned and loved voices like Sir David Attenborough’s would be available for centuries to come and provide producers with any-time access to an almost limitless catalogue of voices to suit any creative brief.

From an access and localization perspective, voice cloning could solve many of the budget and capacity issues that we’re currently facing. In fact, the first feature-length film has already been dubbed into Latin American Spanish using AI voices.

But, before we consider all our problems solved, there are some issues to iron out.

The (potential) issues with synthesized voices

One of the biggest concerns about the increasing use of AI in any industry is that it will put people out of work. In this case, voice artists and actors could face the possibility of being replaced. However, voice cloning companies argue that the software could also provide opportunities for talent to increase their earnings. The process requires real voices to build the models initially and provides the opportunity for actors to license their voices and make them available to much wider markets. This means voice talent could potentially earn royalties for a limitless number of products without having to attend any recordings other than the original data capture.

Rights management is another area that will need to be considered. If voice artists license the use of their voice at a premium to make up for the loss in performance fees, the industry will need to devise systems to ensure that the sources are authorised suppliers and that the appropriate usage fees make their way back to the original artist. No doubt there are many underlying processes and workflows that will need to be reconsidered should voice cloning become an intrinsic part of future production workflows.

The term “fake news” never even existed before 2016, yet fictitious reports shared over social media are now attributed with influencing elections, inciting violence and even threatening democracy. And that’s just the written content. While sceptical audiences might question the accuracy of a text-based piece, when confronted with a video or audio recording that looks and sounds like the real thing, most of us wouldn’t doubt its legitimacy. Distinguishing fact from fiction will become increasingly difficult with increased speech synthesis.

There has also been some resistance from the blind and low-vision community to using synthesized voices in the production of audio descriptions. They contend that, if every element of production – from wardrobes to set dressing, make-up, lighting and camera angles – is so carefully considered for viewing audiences, it isn’t fair to compromise the experience for audiences with visual disabilities by using synthesized voices that can’t match the dramatic delivery of professional voice actors.

The future potential of synthesized speech

Clearly there’s work to be done, but all indications are that synthesized speech will have a big impact on production, access and localization workflows in the future. Just how far away that future is, is yet to be determined. The media and entertainment industry can be slow to adopt new approaches and processes like proof of concepts, vendor agreements, API integration, staff training and large-scale implementation have a habit of eating up a lot of time. So, it may be a good few years till we notice a difference in the content we’re watching – although if we get it right, the viewer shouldn’t actually notice a difference in the content at all.

Get in touch with the Take 1 team to discuss your dubbing or audio description requirements at www.take1.tv