What question did this study set out to answer?

To develop a benchmark for evaluating cognitive reasoning in language models and assess their limitations.

February 16, 2026Open Access

A Benchmark for Evaluating Cognitive Reasoning in Modern Language Models

Key Points

To develop a benchmark for evaluating cognitive reasoning in language models and assess their limitations.
Proposed a modular benchmark for cognitive competence assessment in language models.
Integrated dimensions of language processing: factual, syntactic, and logical.
Tested eight language models using a uniform procedure and scoring scheme.
Larger language models performed better on general knowledge tasks but struggled with conjunctive reasoning.
Feedback led to unstable but measurable improvements, indicating reactive mechanisms.
Models failed to construct new, coherent chains of beliefs, undermining claims of cognitive understanding.

Abstract

With the growth of large language models (LLMs), there are increasing calls to interpret their behavior through the prism of analogies to human cognitive mechanisms. At the same time, scientific literature points to the fundamental limitations of these systems, describing them, among other things, as models that generate a superficial simulation of reasoning without real access to semantic meanings (“stochastic parrots” or “illusion of reasoning”). This paper proposes an innovative, modular benchmark for assessing the cognitive competence of LLMs, integrating three complementary dimensions of language processing: factual, syntactic, and logical. Eight language models (LLama 3.2, Mistral 7B, LLama 3:8B, Gemini 2.5 Flash, ChatGPT-3, ChatGPT-4o mini, ChatGPT-4, and ChatGPT-5) were tested using a uniform procedure with context reset after each interaction and a three-point scoring scheme (0/0.5/1). The results obtained showed a clear advantage for the largest models in tasks based on general knowledge and formal transformations known from training, with a significant decrease in effectiveness, regardless of model size, in tasks requiring conjunctive reasoning based solely on new, local premises. Importantly, unstable but measurable corrective abilities of some models were also observed after feedback, suggesting the presence of reactive mechanisms, but were insufficient to consider them systems capable of cognitive self-reflection. The combined analysis indicates that LLMs effectively simulate syntax and logic rules when the task corresponds to recognizable formal patterns, but fail in situations requiring the construction of new, coherent chains of beliefs and symbolic inferences, which undermines the thesis of their cognitive “understanding”. The results justify the need to create more complex and semantically restrictive evaluation frameworks that will allow distinguishing statistical fit from systemic, multi-stage formal reasoning. The proposed benchmark is a step towards a more multidimensional and diagnostic evaluation of LLMs, shifting the focus from “will the model respond correctly?” to “why and under what conditions is the model able to reason?”

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Kinga Piętka

Michał Bereta

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Benchmark for Evaluating Cognitive Reasoning in Modern Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider