Key points are not available for this paper at this time.
Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text. Such models jointly learn a latent acoustic space during training, which can be sampled from during inference. Controlling these models during inference typically requires finding an appropriate reference utterance, which is non-trivial.Large generative language models (LLMs) have shown excellent performance in various language-related tasks. Given only a natural language query text (the 'prompt'), such models can be used to solve specific, context-dependent tasks. Recent work in TTS has attempted similar prompt-based control of novel speaking style generation. Those methods do not require a reference utterance and can, under ideal conditions, be controlled with only a prompt. But existing methods typically require a prompt-labelled speech corpus for jointly training a prompt-conditioned encoder.In contrast, we instead employ an LLM to directly suggest prosodic modifications for a controllable TTS model, using contextual information provided in the prompt. The prompt can be designed for a multitude of tasks. Here, we give two demonstrations: control of speaking style; prosody appropriate for a given dialogue context. The proposed method is rated most appropriate in 50% of cases vs. 31% for a baseline model.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sigurgeirsson et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e7398bb6db6435876b2fda — DOI: https://doi.org/10.1109/icassp48485.2024.10448400
Atli Sigurgeirsson
Simon King
University of Edinburgh
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: