Large Language Models (LLMs) have demonstrated impressive generative and reasoning abilities, yet their tendency to produce factually incorrect or fabricated information—so-called hallucinations—remains a key limitation. This study systematically examines how temperature and system instruction strategies affect hallucination behavior in open-source LLMs executed through the Ollama framework. Three representative models—Gemma 2B, Mistral 7B Instruct, and Phi-3 Mini—were evaluated on the TruthfulQA benchmark using zero-shot, few-shot, and “say-I-don’t-know” prompting paradigms. Performance was measured through exact match, token-level F1, semantic similarity, and embedding-based similarity metrics. Two-way ANOVA and3 Tukey post-hoc analyses revealed that system instruction significantly influenced factual accuracy across all models, while temperature effects were comparatively minor. Few-shot prompting achieved the highest mean F1 score (0.1889), indicating that example conditioning effectively constrained hallucinations. Conversely, “say-I-don’t-know” prompts increased semantic alignment but reduced precision, suggesting a conservative refusal bias. Embedding-based similarity analyses confirmed higher semantic consistency for zero-shot responses. The results highlight that prompt design exerts a stronger and more interpretable influence on hallucination than sampling stochasticity, offering practical guidance for improving the factual reliability of open-source LLMs.
Building similarity graph...
Analyzing shared references across papers
Loading...
Abdullah Talha Kabakuş
Gazi University Journal of Science Part A Engineering and Innovation
Düzce Üniversitesi
Building similarity graph...
Analyzing shared references across papers
Loading...
Abdullah Talha Kabakuş (Fri,) studied this question.
www.synapsesocial.com/papers/696c79cde45ebfc9113cd3de — DOI: https://doi.org/10.54287/gujsa.1819131