Adaptive multimodal generation now enables artificial interlocutors that perceive speech, gaze, and gesture simultaneously and adjust feedback within milliseconds. Leveraging these advances, the present study engineers and validates a learner-adaptive system that fuses wav2vec-based speech recognition, a vision transformer for non-verbal cues, and a diffusion-avatar prompt engine trained through reinforcement learning with human fluency rubrics as reward. One hundred twenty intermediate English learners (B1B2) practised with the agent or a teacher-led communicative syllabus for twelve weeks. Fine-grained telemetry captured 63 948 utterances, 5.7 million prosodic frames, and 173 hours of video frames. Mixed-effects growth modelling shows the AI group improved words-per-minute by 48.6 wpm (95 % CI = 42.454.8), mean-length-of-run by 3.91 syllables (CI = 3.344.48), and reduced filled-pause density by 6.3 pauses per 100 words (CI = 5.17.5), outperforming controls on all endpoints (p < 0.001). Learner diaries corroborate quantitative gains, citing lower anxiety and heightened prosodic experimentation. Findings evidence that synchronising cross-modal analytics with real-time generative feedback yields substantial fluency dividends and offer design principles for scalable AI-assisted speaking tutors.
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68c1a25a54b1d3bfb60dd266 — DOI: https://doi.org/10.54254/2753-8818/2025.25631
Ye Li
Yan Liang
Theoretical and Natural Science
University of Edinburgh
Changchun University
Building similarity graph...
Analyzing shared references across papers
Loading...