What type of study is this?

September 10, 2025Open Access

Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents

Key Points

The framework reveals how different LLMs handle dialectical reasoning through a three-stage process.
Evaluation scores are based on clarity, coherence, originality, and dialecticality, detected through dual agents.
Semantic diagnostics identified rhetorical flaws in the reasoning not reflected in scoring rubrics.
The integration of multi-stage reasoning and graph-based storage enhances interpretability and replicability.

Abstract

(1) Background and objectives: Large language models (LLMs) such as GPT, Mistral, and LLaMA exhibit strong capabilities in text generation, yet assessing the quality of their reasoning—particularly in open-ended and argumentative contexts—remains a persistent challenge. This study introduces Dialectical Agent, an internally developed modular framework designed to evaluate reasoning through a structured three-stage process: opinion, counterargument, and synthesis. The framework enables transparent and comparative analysis of how different LLMs handle dialectical reasoning. (2) Methods: Each stage is executed by a single model, and final syntheses are scored via two independent LLM evaluators (LLaMA 3.1 and GPT-4o) based on a rubric with four dimensions: clarity, coherence, originality, and dialecticality. In parallel, a rule-based semantic analyzer detects rhetorical anomalies and ethical values. All outputs and metadata are stored in a Neo4j graph database for structured exploration. (3) Results: The system was applied to four open-weight models (Gemma 7B, Mistral 7B, Dolphin-Mistral, Zephyr 7B) across ten open-ended prompts on ethical, political, and technological topics. The results show consistent stylistic and semantic variation across models, with moderate inter-rater agreement. Semantic diagnostics revealed differences in value expression and rhetorical flaws not captured by rubric scores. (4) Originality: The framework is, to our knowledge, the first to integrate multi-stage reasoning, rubric-based and semantic evaluation, and graph-based storage into a single system. It enables replicable, interpretable, and multidimensional assessment of generative reasoning—supporting researchers, developers, and educators working with LLMs in high-stakes contexts.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Cătălin Anghel

Andreea Alexandra Anghel

Emilia Pecheanu

Journals

Informatics

Actions

Institutions

"Dunarea de Jos" University of Galati

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study