Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kaivalya Hariharan
Uzay Girit
Andrew Wang
Building similarity graph...
Analyzing shared references across papers
Loading...
Hariharan et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68e6bc5f38ca8e474d549fbe — DOI: https://doi.org/10.48550/arxiv.2506.00172
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: