What question did this study set out to answer?

The research aims to diagnose the depth of linguistic abilities in ChatGPT models through targeted assessments.

February 14, 2026Open Access

Using Diagnostic Probing to Expose the Shallow Syntactic and Semantic Foundations of ChatGPT as a Large Language Model

Key Points

The research aims to diagnose the depth of linguistic abilities in ChatGPT models through targeted assessments.
Implemented a multi-phase series of controlled experiments.
Evaluated syntactic and semantic phenomena in ChatGPT models.
Used forced-choice grammaticality judgments and plausibility assessments.
Conducted Chain-of-Thought analysis for accuracy and reasoning.
GPT-3.5 accuracy on long-range dependencies fell to 67%.
Quantifier scope accuracy dropped to 42% for GPT-3.5.
GPT-4 showed better performance but exhibited similar qualitative limitations.
Findings highlighted frequent reliance on surface-level pattern matching.

Abstract

The remarkable conversational fluency of OpenAI's ChatGPT often creates an illusion of deep linguistic understanding, prompting its adoption across diverse sectors. This study critically evaluates this purported knowledge by implementing a comprehensive battery of diagnostic probes grounded in theoretical linguistics. We designed a multi-phase series of controlled experiments targeting core syntactic phenomena, including hierarchical agreement, syntactic islands, and binding theory, alongside semantic phenomena such as logical operators, quantifier scope, and presupposition. The study evaluated both GPT-3.5-turbo and GPT-4 models via the OpenAI API using forced-choice grammaticality judgments, plausibility assessments, and Chain-of-Thought (CoT) analysis to measure accuracy, stability, and reasoning soundness. Quantitative results revealed significant performance degradation on complex linguistic structures, with accuracy on long-range dependencies and quantifier scope falling to 67% and 42% for GPT-3.5, respectively. While GPT-4 demonstrated quantitatively superior performance, it exhibited qualitatively similar failure patterns, indicating that scaling alone does not address fundamental limitations. Qualitative analysis of reasoning chains revealed frequent post-hoc rationalization, associative drift, and a reliance on surface-level pattern matching rather than sound logical deduction. The findings robustly demonstrate that ChatGPT's linguistic knowledge is shallow, statistically driven, and non-causal, failing to reliably implement abstract grammatical rules or compositional semantics. We conclude that a paradigm shift in large language model (LLM) evaluation is necessary, moving from broad, aggregate benchmarks to targeted, causal probes that diagnose specific architectural limitations. These findings have significant implications for AI safety, reliability, and the future development of genuinely intelligent systems, underscoring the need for architectural innovations beyond mere scaling of parameters and data.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hashim Raza Khan

Actions

Institutions

National Research University Higher School of Economics

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Using Diagnostic Probing to Expose the Shallow Syntactic and Semantic Foundations of ChatGPT as a Large Language Model

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider