What question did this study set out to answer?

The research aims to determine whether self-referential capabilities in language models arise from pretraining or from instruction tuning.

April 19, 2026Open Access

Self-Reference in Base vs Instruction-Tuned Large Language Models

Key Points

The research aims to determine whether self-referential capabilities in language models arise from pretraining or from instruction tuning.
Compared 8 base language models with their instruction-tuned versions across 4 families.
Analyzed 6,400 text generations using content coding and entropy profiling.
Employed linear probing to assess differences in internal activations.
Instruction tuning amplified self-referential behavior by 2.4 times (Cohen's d = 0.80).
Base models displayed significant self-reference, over 90 times above control baselines.
Linear probes achieved 89-99% accuracy in distinguishing base from instruction-tuned models.

Abstract

When language models produce self-referential text - discussing their own processes, expressing uncertainty about their nature - is this a capacity learned during pretraining, or a behavioral pattern installed by instruction tuning? We disentangle these possibilities by comparing 8 base models with their instruction-tuned counterparts across 4 families (Llama, Qwen, Mistral, Gemma), analyzing 6,400 generations with content coding, entropy profiling, and linear probing of internal activations. Instruction tuning amplifies self-referential behavior by 2.4x (Cohen's d = 0.80), but base models produce non-trivial self-reference even without post-training (over 90x above control baselines). Critically, these differences are representationally deep: linear probes decode base-vs-instruct status at 89-99% accuracy from mid-layer activations, and instruction tuning reduces first-token entropy by d = 3.38. Our results suggest that self-reference emerges during pretraining and instruction tuning amplifies and reshapes them rather than creates.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jędrzej Paweł Maczan

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Self-Reference in Base vs Instruction-Tuned Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study