What question did this study set out to answer?

The aim is to enhance intelligent music composition by improving the alignment between text and music through an innovative framework.

April 10, 2026Open Access

Cross-modal deep collaborative networks for intelligent music composition and interactive performance

Key Points

The aim is to enhance intelligent music composition by improving the alignment between text and music through an innovative framework.
Developed a text and music-based generation net (TMG-Net) framework.
Utilized Transformer and Audio Transformer for semantic feature capture.
Employed contrastive learning for cross-modal alignment.
Implemented a conditional diffusion model for Mel-spectrogram generation.
Reconstructed audio using a vocoder to ensure high fidelity.
TMG-Net outperformed MuseGAN, Restyle-MusicVAE, and Mustango on critical metrics.
Achieved significant improvements in Fréchet Audio Distance (FAD), CLAP score, and Recall at 10 (R@10).
Demonstrated capability in aligning textual semantics with high-quality music generation.

Abstract

With the continuous growth of intelligent music generation and the increasing demand for human–computer interaction, realizing music composition driven by natural language descriptions has become a significant research direction in this field. However, existing approaches still exhibit limitations in semantic alignment between text and music, audio quality of generation, and controllability, making it difficult to balance high fidelity with semantic consistency. To address this issue, this study proposes a text and music-based generation net (TMG-Net) framework that integrates Transformer, Audio Transformer, contrastive learning, and diffusion-based generation for text-driven music generation tasks. The framework employs a Transformer to capture the semantic features of text and time–frequency information of audio, and achieves cross-modal alignment through contrastive learning. On this basis, a conditional diffusion model is introduced to generate Mel-spectrograms, which are subsequently reconstructed into high-fidelity music via a vocoder, thereby enhancing both the naturalness and semantic consistency of the generated music. Experiments conducted on the MusicCaps and Song Describer Dataset public benchmarks demonstrate that TMG-Net significantly outperforms representative methods such as MuseGAN, Restyle-MusicVAE, and Mustango across three key metrics—Fréchet Audio Distance (FAD), Contrastive Language-Audio Pretraining (CLAP) score, and Recall at 10 (R@10)—while approaching the performance of MusicLLM. These results indicate that TMG-Net can effectively align with textual semantics while ensuring audio quality, offering a novel technological pathway and application potential for intelligent music creation and interactive performance.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Junyu Li

Yang Yu

Y. Shao

Journals

PeerJ Computer Science

Actions

Institutions

Nanjing Normal University

National University of Computer and Emerging Sciences

Nanjing Xiaozhuang University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Cross-modal deep collaborative networks for intelligent music composition and interactive performance

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study