Abstract Large Language Models (LLMs) are increasingly discussed as tools for supporting mathematical problem solving. However, existing research predominantly evaluates LLM performance in terms of correctness, while the mathematics educational quality of AI-generated worked-out solutions remains largely unexplored. This study investigates how the mathematics educational quality (a set of quality criteria relating to teaching and learning mathematics) of AI-generated problem solutions can be influenced by model choice and prompt engineering. Four LLMs (Gemini 1.5 Pro (Advanced), Claude 3.5 Sonnet, ChatGPT-o3 mini, and DeepSeek-R1) were tested on six problem-solving tasks from the domains of number and algebra using four prompt techniques (Zero Shot, Chain of Thought, Persona, and Retrieval-Augmented Generation). In total, 2880 solutions were analyzed, combining human expert coding of content-related, process-related, and pedagogical-contextual quality with binary logistic regression. Results show that content-related quality is mainly driven by model type and task characteristics, whereas process-related and pedagogical-contextual quality depend strongly on prompt design, particularly Persona prompting. Overall, no single model or prompt technique performs optimally across all dimensions, indicating that effective educational use of LLMs requires context-sensitive combinations of models, prompts, and tasks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sebastian Schorcht
Fabian Anton Müller
Nils Buchholtz
ZDM
Universität Hamburg
Technische Universität Dresden
Hochschule für Technik und Wirtschaft Dresden – University of Applied Sciences
Building similarity graph...
Analyzing shared references across papers
Loading...
Schorcht et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69d893eb6c1944d70ce04e69 — DOI: https://doi.org/10.1007/s11858-026-01784-6