Researchers from Meta AI and the Hebrew University of Jerusalem present AudioGen, an auto-regressive AI model based on Transformer that generates audio samples conditioned on text inputs. Their study entitled: ” AudioGen: textually guided audio generation” was published on arXiv on September 30th.
Generating audio samples conditioned on descriptive textual captions is a complex task. Among the challenges cited by the researchers are source differentiation (e.g., separating multiple people speaking simultaneously) due to the way sound propagates through a medium. Moreover, this task can be further complicated by the actual recording conditions (background noise, reverberation …). The scarcity of textual annotations imposes another constraint, limiting the possibility of scaling the models. Finally, high fidelity audio modeling requires encoding audio at a high sampling rate, which leads to extremely long sequences.
AudioGen, a textually guided auto-regressive generation model
To overcome these challenges, the researchers used an augmentation technique that mixes different audio samples, leading the model to learn internally to separate multiple sources. They organized 10 datasets containing different types of audio and text annotations to handle the sparsity of text-audio data points and trained AudioGen for about 4,000 hours.
This is based on two main steps. In the 1st, the raw audio is encoded into a discrete sequence of tokens using a neural audio compression model. This end-to-end model is trained to reconstruct the input audio from the compressed representation, with the addition of perceptual loss in the form of a set of discriminations, and allows the generation of high-fidelity audio samples while remaining compact.
The second stage exploits an autoregressive transform-decoder language model that operates on the discrete audio tokens obtained from the first stage while being conditioned on textual inputs.
The results
AudioGen can produce a very large variety of sounds and combine them in a single sample, it can also generate a piece of music from a short musical extract.
The researchers asked raters recruited using the Amazon Mechanical Turk platform to rate audio samples on a scale of 1 to 100. Four models were evaluated: CLIP-based DiffSound with 400 million parameters and three T5-based AudioGen with 285 million to one billion parameters.
They were asked to rate the quality of the sound but also the relevance of the text, i.e. the correspondence between the audio and the text. The AudioGen model based on 1 billion parameters obtained the best results in quality and relevance (about 70 and 68 respectively) while Diffsound obtained about 66 and 55 points.
It is possible to listen to some samples on this project page.
Limitation of AutoGen
The researchers concede that their model, while having the ability to separate sources and create complex compositions, still lacks understanding for the temporal order in a scene. It does not distinguish between a dog barking and a bird singing and a dog barking while a bird sings in the background.
However, this work may provide a basis for building better speech synthesis models. Furthermore, the proposed research could open up future directions in comparative analysis, semantic audio editing, separation of audio sources from discrete units…
Article sources:
“AudioGen: textually guided audio generation”
arXiv:2209.15352v1 ,30 Sep 2022.
FAIR Team, Meta AI : Felix Kreuk , Gabriel Synnaeve , Adam Polyak , Uriel Singer , Alexandre Défossez ,Jade Copet , Devi Parikh , Yaniv Taigman ;
Yossi Adi, FAIR team AI and Hebrew University of Jerusalem.
Translated from AudioGen, le modèle d’IA text-to-audio de Meta AI