What question did this study set out to answer?

The study aims to evaluate the effectiveness of large language models in resolving code quality issues identified by SonarQube.

April 15, 2026Open Access

An evaluation study of large language models for addressing code quality issues

Key Points

The study aims to evaluate the effectiveness of large language models in resolving code quality issues identified by SonarQube.
Investigated six LLMs, including GPT-4o and Grok 3, for automated code repair.
Mapped SonarQube issues to structured prompts for LLMs.
Used a unified prompt strategy for comparison of model performance.
Introduced the Static Repair Success Rate (SRSR) to evaluate repair effectiveness.
Achieved an average reduction of 36.02% in SonarQube-reported issues across all models.
Grok 3 model achieved the highest reduction of issues at 71.54% for a single project.
Demonstrated LLMs' potential to enhance automated refactoring in software development.

Abstract

This empirical study investigates how state-of-the-art Large Language Models (LLMs) can automatically resolve code issues identified by SonarQube, a widely used static analysis tool. As automated maintenance becomes more common, combining AI models with rule-based analysis offers a promising approach to improving code quality. We compare six LLMs, including GPT-4o, Gemini 2.0 Flash, Claude 3 Opus, Mistral Large, Grok 3, and Deep-Seek V3, in performing automated code repair. Using a unified prompt strategy, SonarQube issues are mapped into structured prompts, and LLM-generated fixes replace affected functions in the source code. We evaluate repairs based on syntactic correctness, reduction in SonarQube reported issues, and introduce the Static Repair Success Rate (SRSR), a strict metric that measures the proportion of syntactically valid repairs that resolve all original issues without introducing new ones, followed by a semantic analysis to assess whether the repaired code preserved the intended program behavior. Overall, the average reduction in SonarQube-reported issues, calculated across all models and projects, was about 36.02%. The best result for a single project was achieved by the Grok 3 model, which reduced issues by 71.54%. These findings suggest that LLMs can enhance automated refactoring and help reduce static analysis–reported issues. They offer insights for integrating AI into development workflows, helping companies streamline maintenance, reduce technical debt, and sustain high code quality.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Patcas et al. (Mon,) studied this question.

www.synapsesocial.com/papers/69df2c01e4eeef8a2a6b0f12 — DOI: https://doi.org/10.1007/s10664-026-10858-8

Authors

Rares Patcas

Simona Motogna

Journals

Empirical Software Engineering

Actions

Institutions

Babeș-Bolyai University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

An evaluation study of large language models for addressing code quality issues

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion