What type of study is this?

This is a Quantitative Study study.

October 8, 2025Open Access

Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Puntos clave

Breakpoint can generate scalable code-repair tasks, enabling evaluation of system-level reasoning.
Across more than 900 generated tasks, success rates for state-of-the-art models range from 55% to 0%.
The methodology controls task difficulty that reflects both local reasoning and overall system-level reasoning.
This approach reduces the need for human effort in creating and tuning evaluation benchmarks.

Resumen

Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Kaivalya Hariharan

Uzay Girit

Andrew Wang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider