In view of the problems of insufficient multimodal data fusion and weak analysis of time series dynamic changes in the current teaching effect evaluation in the field of education, this study proposes a time series analysis model based on multimodal Transformer, which aims to reveal the evolution law of teaching effect by integrating visual, auditory and text modal features. Methodologically, a dataset containing 500 hours of classroom teaching videos, audio and text records is first constructed. ResNet-50 is used to extract the teacher’s gesture expression features, OpenSmile obtains the voice emotion parameters, and BERT extracts the semantic vector of teacher-student dialogue; feature alignment and fusion are performed through the cross-modal attention mechanism, and the temporal convolution layer is introduced to capture the sequential dependency of teaching behavior. Finally, the teaching effect score (0-100 points) is output every 10 minutes through the fully connected layer. The experiment shows that the model accuracy rate reaches 89.7% and the F1 value is 0.87. The time series analysis finds that the accuracy of students’ cognitive tasks drops to 44.6% in the 15th minute of the course and to 14.1% in the 33rd minute. The results show that the model can effectively quantify teaching dynamics, provide data basis for real-time adjustment of teaching methods, and promote the transformation of personalized education decision-making from static evaluation to dynamic optimization.
Han Li (Thu,) studied this question.