Researchers at Google announced AudioPaLM, a large language model (LLM) that performs text-to-speech (TTS), automated speech recognition (ASR), and speech-to-speech translation (S2ST) with voice transfer. AudioPaLM is based on the PaLM-2 LLM and outperforms OpenAI's Whisper on translation benchmarks.
AudioPaLM is a decoder-only Transformer-based model that combines text and audio input into a single embedding representation. Unlike conventional S2ST models which use a cascade of discrete ASR, machine translation (MT), and TTS models, AudioPaLM can preserve acoustic features, such as the speaker's voice. AudioPaLM achieves state-of-the-art scores on S2ST and ASR benchmarks, and also exhibits zero-shot capabilities, performing ASR on input and target combinations not present in its training data. When evaluated on the FLEURS dataset, AudioPaLM "significantly" outperforms OpenAI's Whisper on ASR tasks.
InfoQ recently covered several other multilingual AI speech models. In 2022, OpenAI released Whisper, an encoder/decoder Transformer-based ASR model which can transcribe and translate speech audio from 97 different languages. Earlier this year, Meta released MMS, a wav2vec-based model which can do ASR and TTS in over 1,100 languages.
In contrast to these, AudioPaLM is a decoder-only Transformer-based model. It is based on a pre-trained PaLM-2. The model's token dictionary is then extended to include acoustic tokens, which represent short segments of an audio waveform. These are mapped into the same embedding space as the original model's text tokens. The input to the model can then consist of both audio and text. The text input includes a short description of the task, such as "[ASR Italian]". When the model's output is decoded, the acoustic tokens can be converted back to audio waveforms using an AudioLM model.
AudioPaLM Architecture. Image Source: https://google-research.github.io/seanet/audiopalm/examples/
AudioPaLM was trained on thousands of hours of audio data, drawn from over 100 languages. It was evaluated on several benchmarks, including CoVoST2 (AST), CVSS (S2ST), and VoxPopuli (ASR). It outperformed baseline models on AST and S2ST, and was "competitive" on ASR. In zero-shot AST using the FLEURS benchmark, AudioPaLM "significantly" outperformed Whisper. It also outperformed Whisper on ASR tasks involving languages that Whisper had been trained on, but AudioPaLM had not.
The researchers also evaluated AudioPaLM's audio generation quality, especially in regards to preserving the original speaker's voice during S2ST. They used a combination of "objective metrics and subjective evaluation studies" to compare its performance to baseline models and found it "significantly" outperformed the baselines. In their paper, the Google team pointed out the need for better benchmarks for measuring the quality of audio generation:
In comparison to text, the richness of the set of established benchmarks for generative text/audio tasks is less developed. This work has focused on speech recognition and speech translation, for which the benchmarks are more mature. The establishment of more benchmarks and metrics for generative audio tasks will help to accelerate research further.
Several users discussed AudioPaLM in a Hacker News thread. In response to a question about the translation accuracy of LLMs, given their propensity to "hallucinate," one user remarked that for state-of-the-art models like AudioPaLM, hallucinations are "near non-existent." In regards to AudioPaLM's translations, another user observed:
Impressive that it translated "Morgenstund hat Gold im Mund" (morning hour has gold in the mouth) to the equivalent English expression "the early bird gets the worm", instead of going for a literal translation.
Several examples of AudioPaLM's output are available on the web.