When language models produce self-referential text - discussing their own processes, expressing uncertainty about their nature - is this a capacity learned during pretraining, or a behavioral pattern installed by instruction tuning? We disentangle these possibilities by comparing 8 base models with their instruction-tuned counterparts across 4 families (Llama, Qwen, Mistral, Gemma), analyzing 6,400 generations with content coding, entropy profiling, and linear probing of internal activations. Instruction tuning amplifies self-referential behavior by 2.4x (Cohen's d = 0.80), but base models produce non-trivial self-reference even without post-training (over 90x above control baselines). Critically, these differences are representationally deep: linear probes decode base-vs-instruct status at 89-99% accuracy from mid-layer activations, and instruction tuning reduces first-token entropy by d = 3.38. Our results suggest that self-reference emerges during pretraining and instruction tuning amplifies and reshapes them rather than creates.
Jędrzej Paweł Maczan (Mon,) studied this question.