What type of study is this?

This is a Experimental Study study.

September 29, 2025Open Access

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Key Points

The new benchmark effectively evaluates long-context capabilities of LLMs, addressing current limitations.
Experiments show that existing benchmarks fail to separate baseline performance from true long-context ability.
The study introduces metrics that distinguish long-context performance, enhancing model comparisons.
Using fixed input lengths in benchmarks limits their applicability, as revealed in this new length-controllable approach.

Abstract

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Sun,) studied this question.

www.synapsesocial.com/papers/68da58e0c1728099cfd118c8 — DOI: https://doi.org/10.48550/arxiv.2505.19293

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens· 2024 · 2 citations
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA· 2024 · 1 citations
XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies· 2024 · 1 citations
LongIns: A Challenging Long-context Instruction-based Exam for LLMs· 2024
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Authors

Yang Wang

Hongye Jin

Shaochen Zhong

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion