This article proposes an adaptive hybrid transformer framework for controllable audio (Foley) synthesis that addresses the persistent “control gap” between user-intended perceptual attributes (e.g., pitch and intensity) and the characteristics realized in diffusion-based generative latent spaces. The method integrates three complementary mechanisms: Gated Cross-Attention (GCA) to stabilize multimodal fusion and suppress irrelevant visual tokens, mitigating attention collapse and attention-sink behavior, Dynamic Attention Fusion (DAF) that assigns context-dependent modality weights using normalized Shannon entropy as a differentiable reliability proxy, improving robustness under modality degradation (e.g., visual noise or vague prompts); and improved Representation Alignment (iREPA) that distills structural knowledge from frozen teacher encoders to accelerate convergence while preserving spatial/temporal structure relevant to synchronization. For parameter-efficient controllability, the framework employs LoRA/MoE-LoRA adapters as functional control bases, enabling fine-grained manipulation of acoustic attributes with minimal additional parameters. Quantitative evaluation uses controllability-specific metrics (CSS/COI) and automated validation via AuditEval-ssl, demonstrating strong correlation with expert ratings and improved robustness in combined-noise scenarios.
Building similarity graph...
Analyzing shared references across papers
Loading...
Вадим Мухін
Ярослав Хабло
Problems of Informatization and Management
Building similarity graph...
Analyzing shared references across papers
Loading...
Мухін et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69f6e60f8071d4f1bdfc6afc — DOI: https://doi.org/10.18372/2073-4751.85.21098