What question did this study set out to answer?

This research investigates how temperature and instruction strategies impact hallucination behavior in large language models.

January 18, 2026

Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models

Key Points

This research investigates how temperature and instruction strategies impact hallucination behavior in large language models.
Evaluated three LLMs using the TruthfulQA benchmark
Applied zero-shot, few-shot, and 'say-I-don't-know' prompting
Measured performance through exact match, token-level F1, and other metrics
Conducted two-way ANOVA and Tukey post-hoc analyses
Instruction strategies significantly influenced factual accuracy across all models
Few-shot prompting achieved the highest F1 score of 0.1889
'Say-I-don't-know' prompts increased semantic alignment but reduced precision
Embedding-based similarity showed higher consistency for zero-shot responses

Abstract

Large Language Models (LLMs) have demonstrated impressive generative and reasoning abilities, yet their tendency to produce factually incorrect or fabricated information—so-called hallucinations—remains a key limitation. This study systematically examines how temperature and system instruction strategies affect hallucination behavior in open-source LLMs executed through the Ollama framework. Three representative models—Gemma 2B, Mistral 7B Instruct, and Phi-3 Mini—were evaluated on the TruthfulQA benchmark using zero-shot, few-shot, and “say-I-don’t-know” prompting paradigms. Performance was measured through exact match, token-level F1, semantic similarity, and embedding-based similarity metrics. Two-way ANOVA and3 Tukey post-hoc analyses revealed that system instruction significantly influenced factual accuracy across all models, while temperature effects were comparatively minor. Few-shot prompting achieved the highest mean F1 score (0.1889), indicating that example conditioning effectively constrained hallucinations. Conversely, “say-I-don’t-know” prompts increased semantic alignment but reduced precision, suggesting a conservative refusal bias. Embedding-based similarity analyses confirmed higher semantic consistency for zero-shot responses. The results highlight that prompt design exerts a stronger and more interpretable influence on hallucination than sampling stochasticity, offering practical guidance for improving the factual reliability of open-source LLMs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Abdullah Talha Kabakuş

Journals

Gazi University Journal of Science Part A Engineering and Innovation

Actions

Institutions

Düzce Üniversitesi

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study