With the continuous growth of intelligent music generation and the increasing demand for human–computer interaction, realizing music composition driven by natural language descriptions has become a significant research direction in this field. However, existing approaches still exhibit limitations in semantic alignment between text and music, audio quality of generation, and controllability, making it difficult to balance high fidelity with semantic consistency. To address this issue, this study proposes a text and music-based generation net (TMG-Net) framework that integrates Transformer, Audio Transformer, contrastive learning, and diffusion-based generation for text-driven music generation tasks. The framework employs a Transformer to capture the semantic features of text and time–frequency information of audio, and achieves cross-modal alignment through contrastive learning. On this basis, a conditional diffusion model is introduced to generate Mel-spectrograms, which are subsequently reconstructed into high-fidelity music via a vocoder, thereby enhancing both the naturalness and semantic consistency of the generated music. Experiments conducted on the MusicCaps and Song Describer Dataset public benchmarks demonstrate that TMG-Net significantly outperforms representative methods such as MuseGAN, Restyle-MusicVAE, and Mustango across three key metrics—Fréchet Audio Distance (FAD), Contrastive Language-Audio Pretraining (CLAP) score, and Recall at 10 (R@10)—while approaching the performance of MusicLLM. These results indicate that TMG-Net can effectively align with textual semantics while ensuring audio quality, offering a novel technological pathway and application potential for intelligent music creation and interactive performance.
Building similarity graph...
Analyzing shared references across papers
Loading...
Junyu Li
Yang Yu
Y. Shao
PeerJ Computer Science
Nanjing Normal University
National University of Computer and Emerging Sciences
Nanjing Xiaozhuang University
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d896046c1944d70ce07417 — DOI: https://doi.org/10.7717/peerj-cs.3749