What does this research mean for the field?

GPT-4 reduces critical and high-severity vulnerabilities in LLM-generated Java applications by 33.3% and 53.8%, respectively, while Llama 4 and DeepSeek show increased vulnerabilities with certain prompt designs. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

March 7, 2026Open Access

Evaluating the Efficiency of LLM-Generated Software in Resisting Malicious Attacks

Key Points

This research aims to evaluate the security of Java applications generated by large language models using a structured framework.
Implemented a framework incorporating OWASP-supported tools.
Conducted static code analysis, third-party dependency checks, and dynamic attack simulations.
Compared vulnerabilities across three LLMs: GPT-4, Llama 4, and DeepSeek.
GPT-4 improved critical and high-severity vulnerabilities by 33.3% and 53.8%, respectively.
Llama 4 experienced a 10-15% increase in vulnerabilities when shifting to a secure prompt.
DeepSeek saw a 40% increase in low-severity vulnerabilities with no change in high-severity vulnerabilities.

Abstract

This study introduces a structured framework for evaluating the security of Java applications generated by large language models (LLMs) and presents the results from its implementation across three models: DeepSeek, GPT-4, and Llama 4. The framework integrates Open Web Application Security Project (OWASP)-supported tools, such as SpotBugs with FindSecBugs, OWASP Dependency Check, and OWASP Zed Attack Proxy (ZAP), alongside the NIST Risk Management Framework. These tools and standards were selected for being publicly available, allowing this process to be replicated and extended without proprietary licensing, and for their alignment with widely adopted industry benchmarks. The testing methodology for generated Java applications includes static code analysis, third-party dependency checking, and dynamic attack simulation. Each of the specified tools for this study corresponds to identifying a specific category of critical vulnerabilities. Identified vulnerabilities are then evaluated against NIST risk analysis standards to characterize their threat sources, likelihoods, and impacts, as well as their implications for the overall security risk profile of each application. The effect of prompt design is also explored by comparing a neutral prompt against a security-emphasized prompt incorporating OWASP best practices. Results varied considerably across models: GPT-4 showed noticeable improvements across critical and high-severity vulnerabilities, with 33.3% and 53.8% reductions, respectively. However, Llama 4 and DeepSeek saw an increase in vulnerabilities from the neutral to the secure prompt. Llama 4 had a general increase of 10- 15% across critical, high, and medium-severity vulnerabilities, while DeepSeek saw no change in high-severity vulnerabilities and a 40% increase in low-severity vulnerabilities. The framework presented provides a structured process for evaluating LLM-generated code against established software development and security standards, while identifying present limitations and possible directions for future work.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluating the Efficiency of LLM-Generated Software in Resisting Malicious Attacks

Key Points

Abstract

Cite This Study