In response to the issue where current English vocabulary pronunciation evaluation models cannot fully extract feature information from different dimensions of spectrograms, this paper first designs a multi-dimensional audio feature extraction algorithm based on multi-scale dilated convolution.This algorithm initially constructs a shallow feature refinement module that uses parallel convolutions to capture time, frequency, and time-frequency three-dimensional shallow features of Mel-frequency cepstral coefficients features.It combines Res2net structure, dilated convolution, and channel attention to capture more fine-grained multi-scale information from the shallow multi-dimensional features.Then it employs a global feature fusion module combined with multiplicative gating mechanisms to enhance cross-scale feature fusion.Finally, differential evolution algorithm optimised support vector machines are used to score the multi-dimensional features.Experimental results indicate that the average evaluation accuracy of the proposed model reaches 94.57%, outperforming comparative models and achieving an objective and accurate assessment of English vocabulary pronunciation.
Can Du (Thu,) studied this question.