What question did this study set out to answer?

This research aims to develop a software refactoring model using abstract syntax trees to predict class-level refactoring effectively.

April 15, 2026Open Access

Investigating the effectiveness of abstract syntax tree for refactoring prediction at class level using LSTM

Key Points

This research aims to develop a software refactoring model using abstract syntax trees to predict class-level refactoring effectively.
Generated abstract syntax trees for datasets using a check style tool.
Applied SMOTE for balancing the imbalanced dataset before tokenization.
Used LSTM with varying architectures to predict refactoring requirements.
Evaluated performance using metrics like accuracy and AUC.
The three-layer LSTM achieved an accuracy of 96.24% and AUC of 0.58.
Bernoulli Naive Bayes outperformed other classifiers with an AUC of 0.87.
Balancing the dataset with SMOTE increased AUC to 0.98.
Shorter input sequences of 150 words yielded the highest accuracy of 97.15%.

Abstract

Many researchers have recommended refactoring models through source code metrics based on a threshold value. However, this approach is not universally acceptable in the industry because each organisation has its own threshold value. Therefore, it is desirable to develop an automated model that can determine an acceptable threshold value for most of them. This paper aims to develop a Software Refactoring Model (SRM) through an abstract syntax tree (AST) rather than source code metrics to predict the class-level refactoring. AST is generated for each data set (Antlr4, Mct, Titan, Junit) through the check style 9.0.1 tool. As our considered data set is purely imbalanced, we have used the SMOTE data sampling technique to balance the data, and then it is tokenized. After tokenization, words are collected through different tree traversal techniques. In total, we have considered 500 words for each project. Then, LSTM takes a sequence of 50 words in each group, incrementally through the padding, to predict the requirement of refactoring classes. We then estimate metrics such as the Area under the Curve (AUC), F-score, and accuracy to measure the performance of the refactoring prediction model. We have also performed a comparative analysis by applying LSTM with different layers and with other frequently used classifiers. We have evaluated our proposed model by using AST and source code metrics. In our proposed refactoring model, The three-layer LSTM (LSTM3) had the best performance among LSTM architectures (Accuracy=96.24% and AUC=0.58) but BNB performs well among all (AUC=0.87). Balancing the dataset with SMOTE further enhanced discrimination ability, increasing the AUC to 0.98 (median = 1.00), up from 0.78 before balancing. Sequence length also had an impact on performance: shorter inputs of 150 words produced the greatest results, with a mean accuracy of 97.15% and a mean AUC of 0.61. In comparative trials, Bernoulli Naive Bayes (BNB) consistently beat traditional classifiers, including LSTM, while AST-based models outperformed object-oriented measures (accuracy = 94.66%, AUC = 0.94). Our experimental result suggests that the Initial 150 words achieve the mean AUC rank of 4.47, which is the highest performer among all ten groups to predict the classes that need refactoring. Our results also show that BNB performs better (based on AUC value) than other well-known classifiers, including LSTM. Additionally, it is also observed that a larger number of layers obtains significant results for software refactoring. We have also compared the results obtained after applying AST and object-oriented metrics (OOM) and observed that AST is obtaining better results than OOM.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Panigrahi et al. (Mon,) studied this question.

www.synapsesocial.com/papers/69df2abce4eeef8a2a6afc55 — DOI: https://doi.org/10.1007/s10586-026-05965-6

Authors

Rasmita Panigrahi

Sanjay Misra

Lov Kumar

Journals

Cluster Computing

Actions

Institutions

National Institute of Technology Kurukshetra

Institute for Energy Technology

GIET University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Investigating the effectiveness of abstract syntax tree for refactoring prediction at class level using LSTM

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion