With the continuous emergence of exam question types, accurate classification of knowledge points is crucial for intelligent exam analysis. Existing methods focus on text or text–image fusion but largely ignore spatial layout. To address this limitation, we propose a heterogeneous layout-aware cross-modal framework for knowledge point classification. The architecture begins with an encoding module where independent text and layout encoders extract semantic content and spatial configurations, respectively. We then design a layout-aware enhancing module consisting of two parallel cross-modal blocks, namely a Layout-Aware Text-Enhancing block and a Context-Aware Layout-Enhancing block. This module supports the bidirectional fusion of text and layout features and generates a comprehensive representation that integrates both semantic and spatial information. Furthermore, a dynamic router with top-k expert selection is introduced to dynamically adapt to question-specific knowledge distributions and focus on core knowledge points for precise classification. Experimental results demonstrate that our method effectively integrates text and layout information, significantly enhancing performance on the proposed QType-EDU dataset. The approach achieves 91.56% accuracy for coarse-grained classification and 80.58% for fine-grained classification, with an overall F1-score of 91.39%, surpassing all baseline models.
Su et al. (Wed,) studied this question.