What type of study is this?

This is a Quantitative Study study.

October 8, 2025Open Access

Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Key Points

Breakpoint can generate scalable code-repair tasks, enabling evaluation of system-level reasoning.
Across more than 900 generated tasks, success rates for state-of-the-art models range from 55% to 0%.
The methodology controls task difficulty that reflects both local reasoning and overall system-level reasoning.
This approach reduces the need for human effort in creating and tuning evaluation benchmarks.

Abstract

Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Hariharan et al. (Fri,) studied this question.

www.synapsesocial.com/papers/68e6bc5f38ca8e474d549fbe — DOI: https://doi.org/10.48550/arxiv.2506.00172

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Kaivalya Hariharan

Uzay Girit

Andrew Wang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion