What does this research mean for the field?

The LLM-as-a-Judge paradigm can provide scalable and reliable evaluations of software artifacts generated by Large Language Models (LLMs) in software engineering. Novelty: ClaimNovelty.SYNTHESIS. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to explore the potential of using LLMs as judges for evaluating software outputs in software engineering.

February 21, 2026

LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

Key Points

This research aims to explore the potential of using LLMs as judges for evaluating software outputs in software engineering.
Conducted a literature review on existing studies regarding LLM-as-a-Judge in software engineering.
Analyzed limitations of current methodologies and identified key research gaps.
Outlined a roadmap to advance LLM-as-a-Judge frameworks for software artifact evaluation.
Highlighted the need for scalable and reliable evaluation methods.
Identified traditional metrics like BLEU as inadequate for assessing software quality.
Proposed LLM-as-a-Judge as a promising approach to enhance evaluation of software artifacts.

Abstract

The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks from code generation to program repair, producing a massive volume of software artifacts. This surge in automated creation has exposed a critical bottleneck: the lack of scalable and reliable methods to evaluate the quality of these outputs. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. This approach leverages the advanced reasoning and coding capabilities of LLMs themselves to perform automated evaluations, offering a compelling path toward achieving both the nuance of human insight and the scalability of automated systems. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Junda He

Jieke Shi

Terry Yue Zhuo

Journals

ACM Transactions on Software Engineering and Methodology

Actions

Institutions

Monash University

Australian National University

Commonwealth Scientific and Industrial Research Organisation

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study