What question did this study set out to answer?

This research aims to address challenges in AI music generation by developing an innovative framework that improves emotional representation and audio quality.

May 29, 2026Open Access

Collaborative optimisation of emotion regulation and audio synthesis based on PerformanceNet and multi-emotional music generation model

Puntos clave

This research aims to address challenges in AI music generation by developing an innovative framework that improves emotional representation and audio quality.
Developed PEMF, an end-to-end framework integrating PerformanceNet with a multi-emotion music generation model.
Implemented a four-dimensional chord encoding method and a dual-encoding transformer architecture for melody and chord processing.
Utilized an asymmetric U-net structure with a multi-band residual mechanism for audio synthesis.
Achieved a chord vocabulary coverage near 1.0 and an emotion recognition accuracy of 92.3%, outperforming traditional models.
High-frequency energy retention rate of 89.1% with a Fréchet audio distance of 0.5.
Demonstrated a 36.9% improvement in emotional consistency and a 64.2% reduction in latency.

Resumen

In response to the three major challenges in AI music generationlimited chord representation, monotonous emotions, and low audio fidelitythis research proposes a novel end-to-end framework termed PEMF that integrates PerformanceNet with a multi-emotion music generation model.The core innovations include a structured four-dimensional chord encoding method using root, third, fifth, and crown notes to expand harmonic diversity to 60 chord types, a dual-encoding transformer architecture that independently processes melody and chord streams for superior structural coherence, and a fine-grained emotion regulation mechanism mapping pitch histograms and rhythm density parameters to Russell's two-dimensional emotion space for continuous control.For audio synthesis, an asymmetric U-net structure combined with a multi-band residual learning mechanism and a flooding loss strategy significantly enhances spectral fidelity and training stability.Experimental results demonstrate that PEMF achieves a chord vocabulary coverage near 1.0, an emotion recognition accuracy of 92.3% significantly outperforming symbolic transformer's 78.6%, a high-frequency energy retention rate of 89.1%, and a Fréchet audio distance of 0.5.System performance shows a 36.9%improvement in emotional consistency and a 64.2% reduction in latency compared to staged training, validating its efficacy in practical applications like music therapy and film scoring.

Me gusta

Guardar

Ver artículo completo