What question did this study set out to answer?

This research aims to enhance gesture generation in dialogues by integrating both speaking and listening behaviors into a single model.

April 17, 2026

Conversational Gesture Model (CGM): Extending Speaker‐Centric Audio‐Driven Motion Generation to Full Conversation Gestures

Key Points

This research aims to enhance gesture generation in dialogues by integrating both speaking and listening behaviors into a single model.
Developed the Conversational Gesture Model (CGM) using cross-attention mechanisms.
Synchronized audio and text features with character gesture encodings.
Tested the model for producing gestures that align with conversational cues like tone and semantics.
Successfully generated gestures that retain speaker-driven quality.
Improved realism and coherence of conversational interactions.
Enhanced responsiveness to both speaker and listener roles in dialogues.

Abstract

Abstract In this work we extend speaker‐centric audio‐driven gesture synthesis toward a unified conversational model that jointly captures both speaking and listening behaviors. Existing speaker‐centric models effectively generate gestures aligned with speech but overlook the bidirectional dynamics that characterize natural dialogue. To address this limitation, we propose the Conversational Gesture Model (CGM), a cross‐attention‐based model capable of synthesizing gestures conditioned on interlocutor conversational cues such as gestures, tone, and textual semantics. By leveraging cross‐attention mechanisms, the model fuses interlocutor audio and text features with character gesture encodings, enabling a single system to seamlessly alternate between speaking and listening roles of the same character. Hence, our model enables a single system to act as both speaker and listener, capturing the fluid role shifts and mutual influence inherent in conversation. Experiments demonstrate that this approach preserves the quality of speaker‐driven gestures while significantly improving the realism, coherence, and responsiveness of full conversational interactions.

Bookmark

Conversational Gesture Model (CGM): Extending Speaker‐Centric Audio‐Driven Motion Generation to Full Conversation Gestures

Key Points

Abstract

Cite This Study