What question did this study set out to answer?

The study aims to bridge the control gap in audio synthesis by aligning user intentions with generated attributes.

May 3, 2026Open Access

Adaptive hybrid transformers for controllable audio synthesis via representation alignment and dynamic modality weighting

Key Points

The study aims to bridge the control gap in audio synthesis by aligning user intentions with generated attributes.
Developed an adaptive hybrid transformer framework for audio synthesis.
Integrated gated cross-attention, dynamic attention fusion, and improved representation alignment.
Evaluated using controllability-specific metrics and automated validation.
Achieved strong correlation with expert ratings, indicating high quality of synthesised audio.
Demonstrated robustness under noisy conditions, suggesting improved performance in dynamic settings.
Enabled fine-grained control of acoustic attributes with minimal additional parameters.

Abstract

This article proposes an adaptive hybrid transformer framework for controllable audio (Foley) synthesis that addresses the persistent “control gap” between user-intended perceptual attributes (e.g., pitch and intensity) and the characteristics realized in diffusion-based generative latent spaces. The method integrates three complementary mechanisms: Gated Cross-Attention (GCA) to stabilize multimodal fusion and suppress irrelevant visual tokens, mitigating attention collapse and attention-sink behavior, Dynamic Attention Fusion (DAF) that assigns context-dependent modality weights using normalized Shannon entropy as a differentiable reliability proxy, improving robustness under modality degradation (e.g., visual noise or vague prompts); and improved Representation Alignment (iREPA) that distills structural knowledge from frozen teacher encoders to accelerate convergence while preserving spatial/temporal structure relevant to synchronization. For parameter-efficient controllability, the framework employs LoRA/MoE-LoRA adapters as functional control bases, enabling fine-grained manipulation of acoustic attributes with minimal additional parameters. Quantitative evaluation uses controllability-specific metrics (CSS/COI) and automated validation via AuditEval-ssl, demonstrating strong correlation with expert ratings and improved robustness in combined-noise scenarios.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Вадим Мухін

Ярослав Хабло

Journals

Problems of Informatization and Management

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Adaptive hybrid transformers for controllable audio synthesis via representation alignment and dynamic modality weighting

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study