What question did this study set out to answer?

The research aims to determine whether self-referential capabilities in language models arise from pretraining or from instruction tuning.

April 19, 2026Open Access

Self-Reference in Base vs Instruction-Tuned Large Language Models

Key Points

The research aims to determine whether self-referential capabilities in language models arise from pretraining or from instruction tuning.
Compared 8 base language models with their instruction-tuned versions across 4 families.
Analyzed 6,400 text generations using content coding and entropy profiling.
Employed linear probing to assess differences in internal activations.
Instruction tuning amplified self-referential behavior by 2.4 times (Cohen's d = 0.80).
Base models displayed significant self-reference, over 90 times above control baselines.
Linear probes achieved 89-99% accuracy in distinguishing base from instruction-tuned models.

Abstract

When language models produce self-referential text - discussing their own processes, expressing uncertainty about their nature - is this a capacity learned during pretraining, or a behavioral pattern installed by instruction tuning? We disentangle these possibilities by comparing 8 base models with their instruction-tuned counterparts across 4 families (Llama, Qwen, Mistral, Gemma), analyzing 6,400 generations with content coding, entropy profiling, and linear probing of internal activations. Instruction tuning amplifies self-referential behavior by 2.4x (Cohen's d = 0.80), but base models produce non-trivial self-reference even without post-training (over 90x above control baselines). Critically, these differences are representationally deep: linear probes decode base-vs-instruct status at 89-99% accuracy from mid-layer activations, and instruction tuning reduces first-token entropy by d = 3.38. Our results suggest that self-reference emerges during pretraining and instruction tuning amplifies and reshapes them rather than creates.

Self-Reference in Base vs Instruction-Tuned Large Language Models

Key Points

Abstract

Cite This Study