What question did this study set out to answer?

The aim is to investigate how model choice and prompt techniques affect the mathematics educational quality of AI-generated solutions.

April 10, 2026Open Access

No one-size-fits-all: a study of prompt techniques and large language models to enhance AI’s mathematics educational quality

Key Points

The aim is to investigate how model choice and prompt techniques affect the mathematics educational quality of AI-generated solutions.
Tested four large language models on six math problem-solving tasks.
Applied four prompt techniques: Zero Shot, Chain of Thought, Persona, and Retrieval-Augmented Generation.
Conducted human expert coding for quality assessment across content, process, and pedagogical aspects.
Analyzed 2880 generated solutions using binary logistic regression.
Content-related quality is mainly influenced by the model type and task characteristics.
Process-related and pedagogical-contextual quality depend significantly on prompt design, especially Persona prompting.
No single model or prompt technique was best for all quality dimensions, pointing to the need for tailored approaches.

Abstract

Abstract Large Language Models (LLMs) are increasingly discussed as tools for supporting mathematical problem solving. However, existing research predominantly evaluates LLM performance in terms of correctness, while the mathematics educational quality of AI-generated worked-out solutions remains largely unexplored. This study investigates how the mathematics educational quality (a set of quality criteria relating to teaching and learning mathematics) of AI-generated problem solutions can be influenced by model choice and prompt engineering. Four LLMs (Gemini 1.5 Pro (Advanced), Claude 3.5 Sonnet, ChatGPT-o3 mini, and DeepSeek-R1) were tested on six problem-solving tasks from the domains of number and algebra using four prompt techniques (Zero Shot, Chain of Thought, Persona, and Retrieval-Augmented Generation). In total, 2880 solutions were analyzed, combining human expert coding of content-related, process-related, and pedagogical-contextual quality with binary logistic regression. Results show that content-related quality is mainly driven by model type and task characteristics, whereas process-related and pedagogical-contextual quality depend strongly on prompt design, particularly Persona prompting. Overall, no single model or prompt technique performs optimally across all dimensions, indicating that effective educational use of LLMs requires context-sensitive combinations of models, prompts, and tasks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Sebastian Schorcht

Fabian Anton Müller

Nils Buchholtz

Journals

ZDM

Actions

Institutions

Universität Hamburg

Technische Universität Dresden

Hochschule für Technik und Wirtschaft Dresden – University of Applied Sciences

No one-size-fits-all: a study of prompt techniques and large language models to enhance AI’s mathematics educational quality

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study