Machine learning for content creation has recently exploded, and several Textto- Audio (TTA) models have been published, showing astonishing results. However, questions remain regarding the capabilities of these models in more specific use cases, such as synthesizing sound effects and thereby supporting video game development. These models have the potential to reduce reliance on pre-recorded sounds or low-quality procedural audio, as well as simplify the editing process, if the generations meet the high-quality standards of sound designers and gamers. Furthermore, since first generations may fall short of user expectations, a key question is whether such models can be controlled to edit audios. This work investigates four state-of-the-art TTA models — AudioGen, AudioLDM2, Stable Audio Open and AudioLCM — evaluating their sound effect generation using both objective metrics (FAD, KL-Divergence, CLAP) and a small-scale subjective listener study. The results show that these models frequently fail to generate accurate sounds when provided with detailed and specific prompts — a notable limitation given that sound effects need to fit particular scenarios. We selected the best-suited model for the use case — Stable Audio Open — for further investigation into editing capabilities. We adapted and implemented five distinct approaches for editing audio at inference time. We evaluated three of these methods — semantic guidance, DDPM inversion (Zeta) and the blending of cross-attention maps — in a user study involving groups of non-experts and audio professionals. The study’s results indicate that Stable Audio Open lacks consistency in generating high-quality audio. Moreover, both groups agreed that although the success remains inconsistent across audio samples, meaningful editing is indeed achievable. The cross-attention blending method achieved a relevance score, that describes the closeness of editing prompt and audio, of 54/100, while improving perceived audio quality by an average of 10 points. While the evaluated methods may not yet be ready for production-level deployment, the outcomes demonstrate clear potential for future development. Furthermore, our results show that CLAP — a widely used metric in the domain — does not align with subjective ratings and fails to reliably distinguish between source and edited prompts for sound effects. This underscores the need for a more robust evaluation in the field of audio.
Building similarity graph...
Analyzing shared references across papers
Loading...
Theresa Anna Hösl
Building similarity graph...
Analyzing shared references across papers
Loading...
Theresa Anna Hösl (Wed,) studied this question.