With the growth of large language models (LLMs), there are increasing calls to interpret their behavior through the prism of analogies to human cognitive mechanisms. At the same time, scientific literature points to the fundamental limitations of these systems, describing them, among other things, as models that generate a superficial simulation of reasoning without real access to semantic meanings (“stochastic parrots” or “illusion of reasoning”). This paper proposes an innovative, modular benchmark for assessing the cognitive competence of LLMs, integrating three complementary dimensions of language processing: factual, syntactic, and logical. Eight language models (LLama 3.2, Mistral 7B, LLama 3:8B, Gemini 2.5 Flash, ChatGPT-3, ChatGPT-4o mini, ChatGPT-4, and ChatGPT-5) were tested using a uniform procedure with context reset after each interaction and a three-point scoring scheme (0/0.5/1). The results obtained showed a clear advantage for the largest models in tasks based on general knowledge and formal transformations known from training, with a significant decrease in effectiveness, regardless of model size, in tasks requiring conjunctive reasoning based solely on new, local premises. Importantly, unstable but measurable corrective abilities of some models were also observed after feedback, suggesting the presence of reactive mechanisms, but were insufficient to consider them systems capable of cognitive self-reflection. The combined analysis indicates that LLMs effectively simulate syntax and logic rules when the task corresponds to recognizable formal patterns, but fail in situations requiring the construction of new, coherent chains of beliefs and symbolic inferences, which undermines the thesis of their cognitive “understanding”. The results justify the need to create more complex and semantically restrictive evaluation frameworks that will allow distinguishing statistical fit from systemic, multi-stage formal reasoning. The proposed benchmark is a step towards a more multidimensional and diagnostic evaluation of LLMs, shifting the focus from “will the model respond correctly?” to “why and under what conditions is the model able to reason?”
Building similarity graph...
Analyzing shared references across papers
Loading...
Kinga Piętka
Michał Bereta
Building similarity graph...
Analyzing shared references across papers
Loading...
Piętka et al. (Sat,) studied this question.
www.synapsesocial.com/papers/69926552eb1f82dc367a1282 — DOI: https://doi.org/10.3390/app16041918
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: