In this paper, we present a comprehensive study of the capabilities of five large language models (LLMs), namely StarCoder2, LLaMA, CodeLlama, Mistral, and DeepSeek, for abstracting UML class diagrams from code, with the aim to provide researchers and developers with insights into the capabilities and limitations of using various LLMs in a model-driven reverse engineering process. We evaluate the LLMs by prompting them to generate UML class diagrams for both Java and Python programs, with the key focus on accuracy, consistency, and F1 score. Our findings reveal that all LLMs have higher accuracy and F1 scores for Python than for Java. DeepSeek and Mistral perform best overall, while LLaMA consistently performs the lowest in all metrics and for both languages.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hanan; id_orcid 0009-0003-4693-8707 Siala
King's College London
Kevin; id_orcid 0000-0002-9706-1410 Lano
Building similarity graph...
Analyzing shared references across papers
Loading...
Siala et al. (Wed,) studied this question.