TECH NEWS: AI creates natural-sounding speech and music

A few weeks ago I discussed the powerful advancement in artificial intelligence (AI) that enables it to create intricate works of art when a user provides a description in normal English.

These instructions can also include a particular style of a well-known artist. The AI will then design the required artwork in the typical style of the stated artist. Obviously artists all over the world were concerned since the creation of remarkable art was now within the reach of any person with a computer and access to the necessary software.

Now a group of Google Researchers developed a new AI system that can create natural-sounding speech and music after being prompted with a few seconds of audio. The framework for high-quality audio generation with long-term consistency, called AudioLM, generates true-to-life sounds without the need for any human intervention.

What makes AudioLM so remarkable is that it generates very realistic audio that fits the style of the relatively short audio prompt, including complex sounds like piano music, or a person speaking. And what is more is that the AI does it in such a way that is almost indistinguishable from the original recording. The particular technique seems promising to expedite the tedious process of training AI to generate audio.

AI-generated audio is, however, nothing new and is widely used in home assistants like Alexa where the voices use natural language processing. Similarly, AI music systems like OpenAI’s Jukebox, using a neural net, have generated impressive results, including rudimentary singing, as raw audio in a variety of genres and artist styles. But most existing techniques need people to prepare transcriptions and label text-based training data, which takes considerable time and human labour. Jukebox, for example, uses text-based data to generate song lyrics.

AudioLM is very different and does not require transcription or labelling. In the case of AudioLM, sound databases are fed into the program, and machine learning is used to compress the audio files into sound snippets, called semantic and acoustic “tokens,” without losing too much information. This tokenised training data is then fed into a machine-learning model that maps the input audio to a sequence of discrete tokens and uses natural language processing to learn the sound’s patterns.

To generate reliable audio, only a few seconds of sound need to be fed into AudioLM, which then predicts what comes next. This process is very similar to the way autoregressive language models that uses deep learning to produce human-like text like Generative Pre-trained Transformer 3 (GPT-3) predict what sentences and words typically follow one another.

The result is that audio produced by AudioLM sounds very natural. What is particularly remarkable is that the piano music generated by AudioLM sounds much more realistic and fluid than the music usually generated through the use of AI techniques that often sound chaotic. There is no doubt that AudioLM already has a much better sound quality than previous music generation programs. In particular, AudioLM is surprisingly good at recreating some of the repeating patterns inherent in human-made music. It generated convincing continuations that are coherent with the short prompt in terms of melody, harmony, tone and rhythm.

AudioLM has the ability to learn the inherent structure at multiple levels and is able to create realistic piano music by capturing the subtle vibrations contained in each note when the piano keys are played, as well as the rhythms and harmonies. AudioLM was able to generate coherent piano music continuations, despite being trained without any symbolic representation of music.

But AudioLM is not limited to music only. Since it was trained on a library of recordings of humans speaking sentences, the system can also generate speech that continues in the accent and cadence of the original speaker. Without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody.

AudioLM is trained to pick up the types of sound bits that occur frequently together and uses the process in reverse to produce sentences. But even more impressive, it has the ability to learn the pauses and exclamations that are inherent in spoken languages but not easily translated into text.

When conditioned on a prefix (or prompt) of only three seconds of speech from a speaker not seen during training, AudioLM produces consistent continuations while maintaining the original speaker’s identity, voice, prosody, accent and recording conditions of the prompt (e.g., level of reverberation, background noise, etc.), as well as demonstrate syntactically correct and semantically coherent content.

The difference between AudioLM and previous AI systems is that it learns the various nuances from the input data automatically, while previous AI systems could capture the nuances only if they were explicitly annotated in the training data. It is this unique characteristic that adds to the realistic effect of the generated speech since there is important linguistic information that is not in the words that are pronounced, but in the way things are expressed.

The contribution of this breakthrough to synthesise high-quality audio with long-term coherent structure is that it could help people with speech impediments. Speech generation technology that sounds more natural could also help to improve internet accessibility tools and bots that, for instance, work in healthcare settings. AI-generated music could be used in the composing of more natural-sounding background soundtracks for videos and slide-shows without infringing on copyright or royalties.

However, this technology is not without far-reaching ethical implications. It is important to determine whether the musicians who produce the clips used as training data will get attribution or royalties from the end product.

Similarly, AI-generated speech that is indistinguishable from the real thing could become so convincing that it enables the spread of deep fakes and misinformation more easily. The ability to continue short speech segments while maintaining speaker identity and prosody can potentially lead to the spoofing of biometric identification or impersonating a specific speaker. One way of mitigating this risk is to train a classifier that can distinguish natural sounds from sounds produced using AI with very high accuracy.

What is definitively certain is that artificial intelligence will impact our future dramatically and will not only create amazing art, but also realistic speech and music.

Professor Louis CH Fourie is an Extraordinary Professor of the University of the Western Cape.

BUSINESS REPORT