Adaptive multimodal generation now enables artificial interlocutors that perceive speech, gaze, and gesture simultaneously and adjust feedback within milliseconds. Leveraging these advances, the present study engineers and validates a learner-adaptive system that fuses wav2vec-based speech recognition, a vision transformer for non-verbal cues, and a diffusion-avatar prompt engine trained through reinforcement learning with human fluency rubrics as reward. One hundred twenty intermediate English learners (B1B2) practised with the agent or a teacher-led communicative syllabus for twelve weeks. Fine-grained telemetry captured 63 948 utterances, 5.7 million prosodic frames, and 173 hours of video frames. Mixed-effects growth modelling shows the AI group improved words-per-minute by 48.6 wpm (95 % CI = 42.454.8), mean-length-of-run by 3.91 syllables (CI = 3.344.48), and reduced filled-pause density by 6.3 pauses per 100 words (CI = 5.17.5), outperforming controls on all endpoints (p < 0.001). Learner diaries corroborate quantitative gains, citing lower anxiety and heightened prosodic experimentation. Findings evidence that synchronising cross-modal analytics with real-time generative feedback yields substantial fluency dividends and offer design principles for scalable AI-assisted speaking tutors.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ye Li
Yan Liang
Theoretical and Natural Science
University of Edinburgh
Changchun University
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68c1a25a54b1d3bfb60dd266 — DOI: https://doi.org/10.54254/2753-8818/2025.25631
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: