Effective software maintenance requires automated issue report classification due to increasing report volume and complexity. Although fine-tuned BERT models and Large Language Models (LLMs) have exhibited potential in this field, they face critical limitations in handling lengthy reports and ensuring classification consistency. This paper presents an LLM-based method for processing long reports and explores two classification perspectives, namely user intent understanding and example-based decision-making. On this basis, we propose three LLM-based methods: (1) Intent Extraction and Classification, which identifies and classifies user intent from issue reports; (2) Ensemble Classification, which enhances the intent-based method through majority voting; and (3) Explained Few-Shot Learning, which implements the example-based strategy with transparent rationales. We compare these methods against 3 baselines: a RoBERTa-based model, SETFIT, and a previous LLM-based method, using GPT-4o, GPT-3.5-turbo, and Qwen 2.5-32B, through an extensive evaluation that comprises consistency analysis, ablation studies, and an analysis of misclassification patterns. The results show that GPT-4 outperforms the state-of-the-art by 5–8% and performs well across all methods. Furthermore, the results show that Qwen-2.5 performs better than the larger GPT-3.5-turbo, suggesting that multiple factors beyond parameter count–including architectural design, training data composition, and optimization strategies–influence classification performance. The analysis creates a taxonomy of classification challenges and reveals important findings about the pros and cons of each approach. We also introduced innovative ensemble techniques based on LLM perplexity and adaptive strategies capable of selecting the most effective LLM and proposed classification method under specific privacy constraints.
Vito et al. (Wed,) studied this question.