This empirical study investigates how state-of-the-art Large Language Models (LLMs) can automatically resolve code issues identified by SonarQube, a widely used static analysis tool. As automated maintenance becomes more common, combining AI models with rule-based analysis offers a promising approach to improving code quality. We compare six LLMs, including GPT-4o, Gemini 2.0 Flash, Claude 3 Opus, Mistral Large, Grok 3, and Deep-Seek V3, in performing automated code repair. Using a unified prompt strategy, SonarQube issues are mapped into structured prompts, and LLM-generated fixes replace affected functions in the source code. We evaluate repairs based on syntactic correctness, reduction in SonarQube reported issues, and introduce the Static Repair Success Rate (SRSR), a strict metric that measures the proportion of syntactically valid repairs that resolve all original issues without introducing new ones, followed by a semantic analysis to assess whether the repaired code preserved the intended program behavior. Overall, the average reduction in SonarQube-reported issues, calculated across all models and projects, was about 36.02%. The best result for a single project was achieved by the Grok 3 model, which reduced issues by 71.54%. These findings suggest that LLMs can enhance automated refactoring and help reduce static analysis–reported issues. They offer insights for integrating AI into development workflows, helping companies streamline maintenance, reduce technical debt, and sustain high code quality.
Building similarity graph...
Analyzing shared references across papers
Loading...
Patcas et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c01e4eeef8a2a6b0f12 — DOI: https://doi.org/10.1007/s10664-026-10858-8
Rares Patcas
Simona Motogna
Empirical Software Engineering
Babeș-Bolyai University
Building similarity graph...
Analyzing shared references across papers
Loading...