March 3, 2026Open Access

Audio Foundation-Modeller för Generering av Ljudeffekter : Utvärdering, Kontrollerbarhet och Redigerbarhet av de Senaste Modellerna

Key Points

Sound effect generation often lacks accuracy under specific prompts, limiting use cases.
The model Stable Audio Open was chosen for further investigation due to its editing capabilities.
Evaluation included user studies with methods like semantic guidance and cross-attention map blending.
The outcomes suggest a need for improved evaluation metrics to align subjective ratings with objective assessments.

Abstract

Machine learning for content creation has recently exploded, and several Textto- Audio (TTA) models have been published, showing astonishing results. However, questions remain regarding the capabilities of these models in more specific use cases, such as synthesizing sound effects and thereby supporting video game development. These models have the potential to reduce reliance on pre-recorded sounds or low-quality procedural audio, as well as simplify the editing process, if the generations meet the high-quality standards of sound designers and gamers. Furthermore, since first generations may fall short of user expectations, a key question is whether such models can be controlled to edit audios. This work investigates four state-of-the-art TTA models — AudioGen, AudioLDM2, Stable Audio Open and AudioLCM — evaluating their sound effect generation using both objective metrics (FAD, KL-Divergence, CLAP) and a small-scale subjective listener study. The results show that these models frequently fail to generate accurate sounds when provided with detailed and specific prompts — a notable limitation given that sound effects need to fit particular scenarios. We selected the best-suited model for the use case — Stable Audio Open — for further investigation into editing capabilities. We adapted and implemented five distinct approaches for editing audio at inference time. We evaluated three of these methods — semantic guidance, DDPM inversion (Zeta) and the blending of cross-attention maps — in a user study involving groups of non-experts and audio professionals. The study’s results indicate that Stable Audio Open lacks consistency in generating high-quality audio. Moreover, both groups agreed that although the success remains inconsistent across audio samples, meaningful editing is indeed achievable. The cross-attention blending method achieved a relevance score, that describes the closeness of editing prompt and audio, of 54/100, while improving perceived audio quality by an average of 10 points. While the evaluated methods may not yet be ready for production-level deployment, the outcomes demonstrate clear potential for future development. Furthermore, our results show that CLAP — a widely used metric in the domain — does not align with subjective ratings and fails to reliably distinguish between source and edited prompts for sound effects. This underscores the need for a more robust evaluation in the field of audio.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Theresa Anna Hösl

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Audio Foundation-Modeller för Generering av Ljudeffekter : Utvärdering, Kontrollerbarhet och Redigerbarhet av de Senaste Modellerna

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study