What question did this study set out to answer?

This research aims to enhance automated issue report classification using advanced LLM techniques.

May 16, 2026

Advancing LLM-Based Issue Report Classification with Explained Few-Shot Learning, Intent Extraction, Ensemble, and Summarization

Key Points

This research aims to enhance automated issue report classification using advanced LLM techniques.
Proposed intent extraction and classification to identify user intent in reports.
Developed ensemble classification to improve accuracy via majority voting.
Implemented explained few-shot learning to leverage example-based decision-making with clear rationales.
GPT-4 outperformed existing models by 5-8% in classification accuracy.
Qwen-2.5 surpassed the larger GPT-3.5-turbo, indicating importance of architecture and training data.
Identified key classification challenges and advantages of each method through extensive evaluations.

Abstract

Effective software maintenance requires automated issue report classification due to increasing report volume and complexity. Although fine-tuned BERT models and Large Language Models (LLMs) have exhibited potential in this field, they face critical limitations in handling lengthy reports and ensuring classification consistency. This paper presents an LLM-based method for processing long reports and explores two classification perspectives, namely user intent understanding and example-based decision-making. On this basis, we propose three LLM-based methods: (1) Intent Extraction and Classification, which identifies and classifies user intent from issue reports; (2) Ensemble Classification, which enhances the intent-based method through majority voting; and (3) Explained Few-Shot Learning, which implements the example-based strategy with transparent rationales. We compare these methods against 3 baselines: a RoBERTa-based model, SETFIT, and a previous LLM-based method, using GPT-4o, GPT-3.5-turbo, and Qwen 2.5-32B, through an extensive evaluation that comprises consistency analysis, ablation studies, and an analysis of misclassification patterns. The results show that GPT-4 outperforms the state-of-the-art by 5–8% and performs well across all methods. Furthermore, the results show that Qwen-2.5 performs better than the larger GPT-3.5-turbo, suggesting that multiple factors beyond parameter count–including architectural design, training data composition, and optimization strategies–influence classification performance. The analysis creates a taxonomy of classification challenges and reveals important findings about the pros and cons of each approach. We also introduced innovative ensemble techniques based on LLM perplexity and adaptive strategies capable of selecting the most effective LLM and proposed classification method under specific privacy constraints.

Bookmark

Cite This Study

Vito et al. (Wed,) studied this question.

synapsesocial.com/papers/6a0809f1a487c87a6a40bc64 https://doi.org/https://doi.org/10.1145/3815577

Bookmark